[postgis-users] trouble getting nearest neighbor with ST_DWithin and ST_Distance

Wed Sep 16 01:39:39 PDT 2009

karsten vennemann wrote:

> Hi PostGISers,
> sorry this is a long one ;)
>  
> I am trying to use the ST_DWithin and ST_Distance functions to get a 
> list of nearest neighboring roads in relation to a point location. The 
> dataset is big its the tiger shape files which I loaded into PostGIS and 
> clustered on the gist_index, SRID is 9102003 (or EPSG code 102003) an 
> Albers-Equal-Area projection which has units in meters. I researched 
> earlier posts and it seemed that several recommendations pointed out 
> that the use of ST_DWithin and ST_Distance would do exactly what I 
> wanted to do.
>  
> I created an spatial index on the roads layer and afterwards ran vacuum 
> analyze on the roads layer.
> CREATE INDEX roads_gidx ON roads USING gist (the_geom); ALTER TABLE 
> roads CLUSTER ON roads_gidx;
>  
> and a second index on the transformed geom:
> CREATE INDEX roads_tranform_9102003_gidx ON roads USING gist 
> (st_transform(the_geom, 9102003)) WHERE the_geom IS NOT NULL;
>  
> My "dream" SQL query which I wanted to run is "Find all road segments 
> within 50 meter from a point"" e.g. like this
> SELECT * FROM roads WHERE ST_DWithin(transform(roads.the_geom,9102003), 
> GeomFromText('POINT(-2026631.3499604 1355302.2855377)', 9102003), 50)
> ORDER BY ST_Distance(transform(roads.the_geom,9102003), 
> GeomFromText('POINT(-2026631.3499604 1355302.2855377)', 9102003)) limit 1;
>  
> However this query was so slow that I never managed yet to wait it out 
> yet :(  .. I believe it would be a couple of hours  ...
>  
> Thus I tried to run it in parts to fnd out what is =going on
> Since no index was used I tried to convince the query planner to use a 
> bounding box (which ST_DWithin should use anyway) by adding one manually:
>  
> SELECT * FROM roads WHERE ST_DWithin(transform(roads.the_geom,9102003),  
> GeomFromText('POINT(-2260485.2563774 251816.03973006)', 9102003), 50) and 
> transform(roads.the_geom,9102003) && 
> GeomFromText('POINT(-2260485.2563774 251816.03973006)', 9102003) LIMIT 1;
> The query returns one row and
> Explain Analyze give me:
>  
> "Limit  (cost=0.00..2450234.77 rows=1 width=879) (actual 
> time=1277669.052..1277669.053 rows=1 loops=1)"
> "  ->  Seq Scan on roads  (cost=0.00..2450234.77 rows=1 width=879) 
> (actual time=1277669.051..1277669.051 rows=1 loops=1)"
> "        Filter: ((transform(the_geom, 9102003) && 
> '0103000020B3E28A000100000005000000000000C01B3F41C100000040B0BB0E41000000C01B3F41C100000060D0BE0E41000000A0E93E41C100000060D0BE0E41000000A0E93E41C100000040B0BB0E41000000C01B3F41C100000040B0BB0E41'::geometry) 
> AND _st_dwithin(transform(the_geom, 9102003), 
> '0101000020B3E28A0082F9D0A0023F41C163FE5D5140BD0E41'::geometry, 
> 50::double precision) AND (transform(the_geom, 9102003) && 
> '0101000020B3E28A0082F9D0A0023F41C163FE5D5140BD0E41'::geometry) AND 
> ('0101000020B3E28A0082F9D0A0023F41C163FE5D5140BD0E41'::geometry && 
> st_expand(transform(the_geom, 9102003), 50::double precision)))"
> "Total runtime: *1277669*.134 *ms*"
>  
> woaahh!
> I am wondering why the spatial index was not used which - at least in 
> theory - should be obvious to the query planner ....
>  
> This discussion forum thread from August 2009 discusses similar issues:
> http://www.nabble.com/-PostGIS---230%3A-st_expand-seems-to-affect-the-execution-order-wich-affects-st_dwithin-tt24809346.html#a24809346
>  
> Namely that PostGIS performs an sequential scan instead of an index 
> scan. From what I understand in the above thread it seems that the query 
> planner for large/huge tables (like my tables of all US roads loaded 
> from TIGER) does somehow not get the correct cost inputs to make the 
> decision to use the spatial index and this might get solved in PostGIS 
> 1.1 or 1.5 possibly (I'm using PostGIS 1.3.6RC1)?
>  
> I also tried to change the cost for the functions used to convince 
> PostGIS to do the bounding box operation first but with no success. I 
> also temporarily to "disabling" or better "discouraging" seq scans by: 
> set enable_seqscan=false and after the query setting it back to 
> "normal": set enable_seqscan=true;  - still PostGIS does the seq scan first.
> Another forum thread seems to report  the same issue: 
> http://www.nabble.com/use-of-spatial-index-in-nearest-neighbour-query---to25291925.html#a25291925
>  
> So would anybody have suggestions how I can work around this behavior 
> and still get the closets roads from a point location or will that just 
> not work right now on huge data sets?
> Is there any way to really force PostGIS to use the gist index ?
>  
> I am using the following software versions:
> PostgreSQL 8.3.7 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real 
> (Ubuntu 4.3.3-5ubuntu4) 4.3.3
> POSTGIS="1.3.6RC1" GEOS="3.1.0-CAPI-1.5.0" PROJ="Rel. 4.6.1, 21 August 
> 2008" USE_STATS
>  
> Cheers
> Karsten

Hi Karsten,

What is the SRID of the geometry column in your roads table? The reason 
I ask is that in your query you're forcing PostGIS to reproject every 
single row on the roads table which is why it seq scans.

Assuming that all the geometries in your roads table are SRID 9102003, 
then as a starting point, the following query should be quick:

SELECT * FROM roads WHERE ST_DWithin(roads.the_geom, 
GeomFromText('POINT(-2026631.3499604 1355302.2855377)', 9102003), 50)

HTH,

Mark.

-- 
Mark Cave-Ayland - Senior Technical Architect
PostgreSQL - PostGIS
Sirius Corporation plc - control through freedom
http://www.siriusit.co.uk
t: +44 870 608 0063

Sirius Labs: http://www.siriusit.co.uk/labs