[postgis-users] Performance issue with ST_INTERSECTS

Thu Nov 3 14:30:09 PDT 2011

Hello List, 

This is my first post so feel free to let me know if I'm missing any critical info or if this is better suited for the pgsql-performance list. 

Essentially, I'm troubleshooting a performance issue with a spatial intersection query. The query sums the lengths of line features that intersect polygon features. The line layer is very dense. The polygon layer only has 8 features, but each polygon does have a lot of points.

It's taking almost 50 minutes to run and I'm wondering if the planner is choosing a less than optimal query plan. One reason I'm thinking this is based on a bad row estimate in the explain analyze output. Can anything be done to resolve this bad estimate? Or is it unavoidable? And more importantly: Is the query plan that the planner is choosing the fastest one that it could use despite the bad row estimate? 

-----Version Info----- 
Production Server: 
PostgreSQL 8.4.7 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit 
POSTGIS="1.5.1" GEOS="3.2.2-CAPI-1.6.2" PROJ="Rel. 4.7.1, 23 September 2009 " LIBXML="2.7.6" USE_STATS

We also tested the query against a test server with more recent postgres product versions (the same query plan and bad row estimate occurred): 
PostgreSQL 9.1.1 on x86_64-pc-linux-gnu, compiled by gcc-4.6.real (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, 64-bit 
POSTGIS="1.5.3" GEOS="3.2.2-CAPI-1.6.2" PROJ="Rel. 4.7.1, 23 September 2009 " LIBXML="2.7.8" USE_STATS (procs from 1.5 r5385 need upgrade)

-----SQL Statement----- 
SELECT r.gid, SUM(ST_LENGTH(ST_INTERSECTION(r.the_geom,n.the_geom))) * 0.000621371192 AS milesum 
FROM psaregions r join nhd100kstreams n ON ST_INTERSECTS(r.the_geom, n.the_geom) 
WHERE n.fcode = 46003 
GROUP BY r.gid 

-----Explain Analyze Output----- 

See here: http://explain.depesz.com/s/Esx 

HashAggregate (cost=509.88..512.02 rows=8 width=1195362) (actual time=2823674 .455..2823674 .458 rows=8 loops=1) 
-> Nested Loop (cost=0.00..70.36 rows=87904 width=1195362) (actual time=23.893..675119.593 rows=90268 loops=1) 
Join Filter: _st_intersects(r.the_geom, n.the_geom) 
-> Seq Scan on psaregions r (cost=0.00..1.08 rows=8 width=1193798) (actual time=0.005..0.019 rows=8 loops=1) 
-> Index Scan using nhd100kstreams_the_geom_gist on nhd100kstreams n (cost=0.00..8.40 rows=1 width=1564) (actual time=0.043..214.728 rows=41631 loops=8) 
Index Cond: (r.the_geom && n.the_geom) 
Filter: (n.fcode = 46003) 
Total runtime: 2823674.517 ms 

***Notice actual rows in the first index scan is 41631, but the estimate is 1. 

-----Additional Info----- 
- I've run vacuum analyze on the tables in question. That has not helped. (Autovacuum does run regularly also.) 
- In terms of history, this query has always been slow. 
- Various configuration settings: 
shared_buffers: 3GB (Machine RAM: 12GB) 
work_mem: 2GB 
maintenance_work_mem: 1GB 
effective_cache_size: 2GB (this was originally set to the default; we increased it to 2GB and restarted the server, but it didn't change the query plan.) 
- Hardware info: Dell PowerEdge R710 running Ubuntu 10.04.2 LTS. 

-----Table and Index Schema----- 

Table "public.psaregions" 
Column | Type | Modifiers 
------------+-----------------------+---------------------------------------------------------- 
gid | integer | not null default nextval('psaregions_gid_seq'::regclass) 
psa_region | character varying(50) | 
np_length | numeric | 
xxx | numeric | 
the_geom | geometry | 
sm_geom | geometry | 
miles46006 | numeric | 
miles46003 | numeric | 
llextent | character varying | 
Indexes: 
"psaregions_pkey" PRIMARY KEY, btree (gid) 
"psaregions_region_idx" btree (psa_region) 
"psaregions_sm_geom_gist" gist (sm_geom) 
"psaregions_the_geom_gist" gist (the_geom) 

Table "public.nhd100kstreams" 
Column | Type | Modifiers 
------------+-----------------------+-------------------------------------------------------------- 
gid | integer | not null default nextval('nhd100kstreams_gid_seq'::regclass) 
objectid | integer | 
comid | integer | 
fdate | date | 
resolution | integer | 
gnis_id | character varying(10) | 
gnis_name | character varying(65) | 
lengthkm | numeric | 
reachcode | character varying(14) | 
flowdir | integer | 
wbareacomi | integer | 
ftype | integer | 
fcode | integer | 
shape_leng | numeric | 
enabled | smallint | 
geo_geom | geometry | 
sm_geom | geometry | 
the_geom | geometry | 
Indexes: 
"nhd100kstreams_pkey" PRIMARY KEY, btree (gid) 
"nhd100kstreams_fcode_idx" btree (fcode) 
"nhd100kstreams_geo_geom_gist" gist (geo_geom) 
"nhd100kstreams_sm_geom_gist" gist (sm_geom) 
"nhd100kstreams_the_geom_gist" gist (the_geom) 

-----Table Metadata: ----- 
psaregions: 8 records; very static, no regular updates/inserts 
nhd100kstreams: 234954 records; 137,676 records with fcode = 46003; very static, no regular updates/inserts

Looking forward to hearing your feedback. 

Thank you kindly, 
Shira Bezalel