[postgis-users] Help with Bad Query Plan

Mon Jan 12 12:09:27 PST 2009

I folks, thanks for the e-mails.  Unfortunately I did not have access to
my e-mail at the weekend.  I have, however, created a solution having
read www.postgresql.org/docs/8.3/static/explicit-joins.html - which
might only be required pre-geo 3.1 (see comments at bottom).  The query
plan comments simply might not be of interest if 3.1 addresses this
issue...anyway, the query plans that my current database comes up with
are pretty bad.

set from_collapse_limit=1;
SELECT possible_geography.report_id
FROM (SELECT gl.report_id, g.the_geom FROM gdors_geography g,
gdors_geolink gl WHERE gl.temp_report_date BETWEEN '2008-01-06' AND
'2009-01-06' AND g.gid = gl.geom_id) AS possible_geography, (SELECT
g.the_geom FROM gdors_geography g WHERE g.gid=3) AS selected_geography 
WHERE ST_Intersects(possible_geography.the_geom,
selected_geography.the_geom);

"Nested Loop  (cost=0.00..1448.63 rows=1 width=4) (actual
time=93.485..16959.766 rows=81 loops=1)"
"  Join Filter: ((g.the_geom && g.the_geom) AND
_st_intersects(g.the_geom, g.the_geom))"
"  ->  Index Scan using gdors_geography_pkey on gdors_geography g
(cost=0.00..8.27 rows=1 width=3014) (actual time=0.017..0.020 rows=1
loops=1)"
"        Index Cond: (gid = 3)"
"  ->  Nested Loop  (cost=0.00..1436.61 rows=250 width=3018) (actual
time=0.037..14.822 rows=275 loops=1)"
"        ->  Seq Scan on gdors_geolink gl  (cost=0.00..214.00 rows=250
width=8) (actual time=0.019..4.098 rows=275 loops=1)"
"              Filter: ((temp_report_date >= '2008-01-06'::date) AND
(temp_report_date <= '2009-01-06'::date))"
"        ->  Index Scan using gdors_geography_pkey on gdors_geography g
(cost=0.00..4.88 rows=1 width=3018) (actual time=0.018..0.026 rows=1
loops=275)"
"              Index Cond: (g.gid = gl.geom_id)"
"Total runtime: 16960.739 ms"

(An index on the geolink date field decreased the time by 0.5 seconds).
Not bad for 81 matches!

Basically (as I'm sure you know), the from_collapse_limit prevents the
subqueries from being flattened out into a bigger query...so literally
do as I say.  I'm not sure if anyone can create a join that has the same
result?  That would remove the need to reset from_collapse_limit.  Don't
worry if Geo 3.1 solves this though!

For those following I wanted to clarify a couple of things.  I have
loaded the ESRI world dataset, which is 18MiB, and so the whole concept
of IO limitations is doubtful assuming the correct query plan.  Outside
the scope of the issue but FYI the spatial intersect query in ArcMap and
shapefiles takes approximately one second and with SDE binary/Oracle ~3
seconds.
Secondly, although ST_Intersects with open geometries may not be the
most efficient method, my view was that a better query plan would
significantly improve the performance.  The date restriction limits
geolink to about 10 results to join with geometries.  All the database
needs to do is intersect the selected geometry (Asia, with the 10
geometries).

Explaining some of PostgreSQL query plans (The one below uses a new
'continent table'):

-- Additional Table (continent table has no index and contains eight
continent geometries - Africa, Antarctica, Asia, Australia, Europe,
North America, Oceania and South America)
-- 2 rows, ~7000ms.
SELECT geolink.report_id, geography.gid
FROM gdors_geolink geolink, gdors_geography geography, continent
selected_geography
WHERE geolink.temp_report_date BETWEEN '2008-01-06' AND '2009-01-06'
AND geolink.geom_id = geography.gid
AND selected_geography.gid=3
AND ST_Intersects(selected_geography.the_geom, geography.the_geom);

"Hash Join  (cost=31.91..247.09 rows=21 width=8) (actual
time=8226.836..8229.534 rows=2 loops=1)"
"  Hash Cond: (geolink.geom_id = geography.gid)"
"  ->  Seq Scan on gdors_geolink geolink  (cost=0.00..214.00 rows=257
width=8) (actual time=0.046..2.432 rows=275 loops=1)"
"        Filter: ((temp_report_date >= '2008-01-06'::date) AND
(temp_report_date <= '2009-01-06'::date))"
"  ->  Hash  (cost=9.39..9.39 rows=1802 width=4) (actual
time=8226.653..8226.653 rows=104 loops=1)"
"        ->  Nested Loop  (cost=0.00..9.39 rows=1802 width=4) (actual
time=1568.557..8226.289 rows=104 loops=1)"
"              Join Filter: _st_intersects(selected_geography.the_geom,
geography.the_geom)"
"              ->  Seq Scan on continent selected_geography
(cost=0.00..1.10 rows=1 width=364553) (actual time=0.012..0.018 rows=1
loops=1)"
"                    Filter: (gid = 3)"
"              ->  Index Scan using gdors_geography_the_geom on
gdors_geography geography  (cost=0.00..8.27 rows=1 width=2914) (actual
time=2.402..53.014 rows=188 loops=1)"
"                    Index Cond: (selected_geography.the_geom &&
geography.the_geom)"
"                    Filter: (selected_geography.the_geom &&
geography.the_geom)"
"Total runtime: 8229.772 ms"

The above query plan looks better, although my understanding is that it
is still doing a lot of work - creating a (hash) set of results based on
the intersection (many geometries), that for each geolink result the
geography hash table is queried create the join.  To give an example,
assuming 100 geometries are within Asia...but only 10 of them have
reports...it is getting a collection of geolink results (10)...and at
the same time getting a collection of geometries, so 100, and then
joining 10 reports to 10 of the geometries - out of the 100, which we
have done a full intersect query on!?!  It would be quicker to join the
geolink results with geometries and then intersect that small number
(intersect 10...not 100...or 1000s if more geometries).  Please correct
me if I am wrong.  If that is the case then that is only going to be
exacerbated with greater numbers of geometries...rather than collecting
a very small subset of reports first, looking up their geometries, and
then doing a spatial intersection.

FYI I did manage to get the following query run on PostgreSQL 8.2 with a
spatial index on geography.the_geom.

EXPLAIN ANALYZE SELECT geolink.report_id, geography.gid
FROM gdors_geolink geolink, gdors_geography geography, gdors_geography
selected_geography
WHERE geolink.temp_report_date BETWEEN '2008-01-06' AND '2009-01-06'
AND geolink.geom_id = geography.gid
AND selected_geography.gid=3
AND ST_Intersects(selected_geography.the_geom, geography.the_geom);

"Nested Loop  (cost=0.00..91.85 rows=1 width=8) (actual
time=19204.738..1418353.900 rows=81 loops=1)"
"  ->  Nested Loop  (cost=0.00..16.56 rows=1 width=4) (actual
time=16360.043..1415704.994 rows=6392 loops=1)"
"        Join Filter: _st_intersects(selected_geography.the_geom,
geography.the_geom)"
"        ->  Index Scan using gdors_geography_pkey on gdors_geography
selected_geography  (cost=0.00..8.27 rows=1 width=2910) (actual
time=0.041..0.045 rows=1 loops=1)"
"              Index Cond: (gid = 3)"
"        ->  Index Scan using gdors_geography_the_geom on
gdors_geography geography  (cost=0.00..8.27 rows=1 width=2914) (actual
time=3.584..22581.913 rows=19432 loops=1)"
"              Index Cond: (selected_geography.the_geom &&
geography.the_geom)"
"              Filter: (selected_geography.the_geom &&
geography.the_geom)"
"  ->  Index Scan using gdors_geolink_pkey on gdors_geolink geolink
(cost=0.00..75.28 rows=1 width=8) (actual time=0.401..0.403 rows=0
loops=6392)"
"        Index Cond: (geolink.geom_id = geography.gid)"
"        Filter: ((temp_report_date >= '2008-01-06'::date) AND
(temp_report_date <= '2009-01-06'::date))"
"Total runtime: 1418355.448 ms"

Clearly that query was just slow...intersecting geometries where 99%
(ish) were going to be thrown away.

## Other Questions ##

(IR at Paragon Corp).  Yes I have tried && - shown in the previous e-mails.
I also believe that the plans executing _st_intersects on all of the
geometries whilst index scanning for reports based on date is really
quite suboptimal; I assume that I'm reading the plans correctly :).

(Paul Ramsey).  Your query, below, does not finish either in any
reasonable amount of time (I guess ~23mins).  It is not too surprising
given the number of geometries and plans that my database is coming up
with but it does serve to demonstrate the importance of a good query
plan; your 'Introduction to PostGIS' presentation is great btw.
SELECT COUNT(*) FROM gdors_geography g1, gdors_geography g2 WHERE
ST_Intersects(g1.the_geom, g2.the_geom) AND g2.gid = 3;
Also, you request for a full EXPLAIN ANALYZE I think is answered in the
'continent' table example.

(Regina Obe).  I have not changed the postgresql.conf file although I am
interested in what settings you would change, especially considering the
small amount of the data; FYI the Asia shapefile is ~600k - quite big
for this dataset but still pretty small.  I did experiment with
increasing the shared_buffer size to 512MB.  The Windows install that I
currently have, PostgreSQL 8.2, has the following postgis_full_version()
output:
"POSTGIS="1.3.5" GEOS="3.0.3-CAPI-1.4.2" PROJ="Rel. 4.6.1, 21 August
2008" USE_STATS"
My Ubuntu machine has "POSTGIS="1.3.3" GEOS="3.0.0-CAPI-1.4.1"
PROJ="Rel. 4.6.0, 21 Dec 2007" USE_STATS"
CLEARLY NOT GEOS 3.1.  I guess that could be the primary cause of my
problem?  It doesn't appear that 3.1 is available on Windows.  I guess
it will be easiest for me to compile PostgreSQL on Ubuntu and then apply
the latest PostGIS?  Comments welcome :)
-----------------------------------------
**************************************************************************
The information contained in the EMail and any attachments is
confidential and intended solely and for the attention and use of
the named addressee(s). It may not be disclosed to any other person
without the express authority of the HPA, or the intended
recipient, or both. If you are not the intended recipient, you must
not disclose, copy, distribute or retain this message or any part
of it. This footnote also confirms that this EMail has been swept
for computer viruses, but please re-sweep any attachments before
opening or saving. HTTP://www.HPA.org.uk
**************************************************************************