[postgis-users] trouble getting nearest neighbor with ST_DWithin and ST_Distance

karsten vennemann karsten at terragis.net
Wed Sep 16 00:45:18 PDT 2009

Hi PostGISers,
sorry this is a long one ;)
I am trying to use the ST_DWithin and ST_Distance functions to get a list of nearest neighboring roads in relation to a point location. The dataset is big its the tiger shape files which I loaded into PostGIS and clustered on the gist_index, SRID is 9102003 (or EPSG code 102003) an Albers-Equal-Area projection which has units in meters. I researched earlier posts and it seemed that several recommendations pointed out that the use of ST_DWithin and ST_Distance would do exactly what I wanted to do.
I created an spatial index on the roads layer and afterwards ran vacuum analyze on the roads layer.
CREATE INDEX roads_gidx ON roads USING gist (the_geom); ALTER TABLE roads CLUSTER ON roads_gidx;
and a second index on the transformed geom:
CREATE INDEX roads_tranform_9102003_gidx ON roads USING gist (st_transform(the_geom, 9102003)) WHERE the_geom IS NOT NULL;
My "dream" SQL query which I wanted to run is "Find all road segments within 50 meter from a point"" e.g. like this

SELECT * FROM roads WHERE ST_DWithin(transform(roads.the_geom,9102003), GeomFromText('POINT(-2026631.3499604 1355302.2855377)', 9102003), 50) 
ORDER BY ST_Distance(transform(roads.the_geom,9102003), GeomFromText('POINT(-2026631.3499604 1355302.2855377)', 9102003)) limit 1;
However this query was so slow that I never managed yet to wait it out yet :(  .. I believe it would be a couple of hours  ...
Thus I tried to run it in parts to fnd out what is =going on
Since no index was used I tried to convince the query planner to use a bounding box (which ST_DWithin should use anyway) by adding one manually:
SELECT * FROM roads WHERE ST_DWithin(transform(roads.the_geom,9102003),  GeomFromText('POINT(-2260485.2563774 251816.03973006)', 9102003), 50) and  
transform(roads.the_geom,9102003) && GeomFromText('POINT(-2260485.2563774 251816.03973006)', 9102003) LIMIT 1;
The query returns one row and 
Explain Analyze give me:
"Limit  (cost=0.00..2450234.77 rows=1 width=879) (actual time=1277669.052..1277669.053 rows=1 loops=1)"
"  ->  Seq Scan on roads  (cost=0.00..2450234.77 rows=1 width=879) (actual time=1277669.051..1277669.051 rows=1 loops=1)"
"        Filter: ((transform(the_geom, 9102003) && '0103000020B3E28A000100000005000000000000C01B3F41C100000040B0BB0E41000000C01B3F41C100000060D0BE0E41000000A0E93E41C100000060D0BE0E41000000A0E93E41C100000040B0BB0E41000000C01B3F41C100000040B0BB0E41'::geometry) AND _st_dwithin(transform(the_geom, 9102003), '0101000020B3E28A0082F9D0A0023F41C163FE5D5140BD0E41'::geometry, 50::double precision) AND (transform(the_geom, 9102003) && '0101000020B3E28A0082F9D0A0023F41C163FE5D5140BD0E41'::geometry) AND ('0101000020B3E28A0082F9D0A0023F41C163FE5D5140BD0E41'::geometry && st_expand(transform(the_geom, 9102003), 50::double precision)))"
"Total runtime: 1277669.134 ms" 
I am wondering why the spatial index was not used which - at least in theory - should be obvious to the query planner ....
This discussion forum thread from August 2009 discusses similar issues:
 <http://www.nabble.com/-PostGIS---230%3A-st_expand-seems-to-affect-the-execution-order-wich-affects-st_dwithin-tt24809346.html#a24809346> http://www.nabble.com/-PostGIS---230%3A-st_expand-seems-to-affect-the-execution-order-wich-affects-st_dwithin-tt24809346.html#a24809346
Namely that PostGIS performs an sequential scan instead of an index scan. From what I understand in the above thread it seems that the query planner for large/huge tables (like my tables of all US roads loaded from TIGER) does somehow not get the correct cost inputs to make the decision to use the spatial index and this might get solved in PostGIS 1.1 or 1.5 possibly (I'm using PostGIS 1.3.6RC1)?
I also tried to change the cost for the functions used to convince PostGIS to do the bounding box operation first but with no success. I also temporarily to "disabling" or better "discouraging" seq scans by: set enable_seqscan=false and after the query setting it back to "normal": set enable_seqscan=true;  - still PostGIS does the seq scan first.

Another forum thread seems to report  the same issue:  <http://www.nabble.com/use-of-spatial-index-in-nearest-neighbour-query---to25291925.html#a25291925> http://www.nabble.com/use-of-spatial-index-in-nearest-neighbour-query---to25291925.html#a25291925
So would anybody have suggestions how I can work around this behavior and still get the closets roads from a point location or will that just not work right now on huge data sets?
Is there any way to really force PostGIS to use the gist index ?
I am using the following software versions:
PostgreSQL 8.3.7 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Ubuntu 4.3.3-5ubuntu4) 4.3.3
POSTGIS="1.3.6RC1" GEOS="3.1.0-CAPI-1.5.0" PROJ="Rel. 4.6.1, 21 August 2008" USE_STATS

Karsten Vennemann
Seattle, WA, USA 
 <http://www.terragis.net> www.terragis.net
More details:
Strangely the above query also works faster when using SRID 4326 in degress which doesn't seem to make sense (radius = 50 degrees - no way I don't get it)
SELECT * FROM roads WHERE ST_DWithin(roads.the_geom, transform(GeomFromText('POINT(-2260485.2563774 251816.03973006)', 9102003),4326), 50) and  
roads.the_geom && transform(GeomFromText('POINT(-2260485.2563774 251816.03973006)', 9102003),4326) LIMIT 1;
This returns one record in 4607 ms.
Explain of this query spits out:
"Limit  (cost=0.00..9.24 rows=1 width=879) (actual time=437.187..437.188 rows=1 loops=1)"
"  ->  Index Scan using roads_gidx on roads  (cost=0.00..9.24 rows=1 width=879) (actual time=437.184..437.184 rows=1 loops=1)"
"        Index Cond: ((the_geom && '0103000020E6100000010000000500000000000060837E65C000000080970C2AC000000060837E65C0000000206DBE5540000000A006FD51C0000000206DBE5540000000A006FD51C000000080970C2AC000000060837E65C000000080970C2AC0'::geometry) AND (the_geom && '0101000020E610000018925BB2067D5EC0C8CF2D22DA7C4240'::geometry))"
"        Filter: ((the_geom && '0103000020E6100000010000000500000000000060837E65C000000080970C2AC000000060837E65C0000000206DBE5540000000A006FD51C0000000206DBE5540000000A006FD51C000000080970C2AC000000060837E65C000000080970C2AC0'::geometry) AND _st_dwithin(the_geom, '0101000020E610000018925BB2067D5EC0C8CF2D22DA7C4240'::geometry, 50::double precision) AND (the_geom && '0101000020E610000018925BB2067D5EC0C8CF2D22DA7C4240'::geometry) AND ('0101000020E610000018925BB2067D5EC0C8CF2D22DA7C4240'::geometry && st_expand(the_geom, 50::double precision)))"
"Total runtime: 437.500 ms"

I also tried to use a bounding box with a manually created polygon but I am running into the same issues - its running slow doing a seq scan: 1519138 ms:
SELECT * FROM roads WHERE transform(roads.the_geom,9102003) && ST_Polygon(ST_GeomFromText('LINESTRING(-2026098.7705385 1354954.1944117,-2026979.4410859 1354954.1944117,-2026979.4410859 1356078.528747,-2026098.7705385 1356078.528747,-2026098.7705385 1354954.1944117)'), 9102003) and ST_intersects(transform(roads.the_geom,9102003), ST_Polygon(ST_GeomFromText('LINESTRING(-2026098.7705385 1354954.1944117,-2026979.4410859 1354954.1944117,-2026979.4410859 1356078.528747,-2026098.7705385 1356078.528747,-2026098.7705385 1354954.1944117)'), 9102003));
Explain Analyze of the query above gives me:
"Seq Scan on roads  (cost=0.00..2144881.90 rows=1 width=879) (actual time=1491788.995..1519081.606 rows=45 loops=1)"
"  Filter: ((transform(the_geom, 9102003) && '0103000020B3E28A000100000005000000DA0242C572EA3EC115F7C431CAAC34416B01EB70E3ED3EC115F7C431CAAC34416B01EB70E3ED3EC1A1F65B872EB13441DA0242C572EA3EC1A1F65B872EB13441DA0242C572EA3EC115F7C431CAAC3441'::geometry) AND (transform(the_geom, 9102003) && '0103000020B3E28A000100000005000000DA0242C572EA3EC115F7C431CAAC34416B01EB70E3ED3EC115F7C431CAAC34416B01EB70E3ED3EC1A1F65B872EB13441DA0242C572EA3EC1A1F65B872EB13441DA0242C572EA3EC115F7C431CAAC3441'::geometry) AND _st_intersects(transform(the_geom, 9102003), '0103000020B3E28A000100000005000000DA0242C572EA3EC115F7C431CAAC34416B01EB70E3ED3EC115F7C431CAAC34416B01EB70E3ED3EC1A1F65B872EB13441DA0242C572EA3EC1A1F65B872EB13441DA0242C572EA3EC115F7C431CAAC3441'::geometry))"
"Total runtime: 1519138.528 ms"

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20090916/b5232d94/attachment.html>

More information about the postgis-users mailing list