[postgis-users] https://postgis.net/docs/geometry_distance_knn.html and index usage

Wed Jun 22 19:04:19 PDT 2022

Hi Lars,

I'm just a user, but we have handled some pretty big datasets so I can
understand your problem. I might be misunderstanding, but to confirm you
are asking the system to cross join all of the possible points (something
like 63483 * 4 * 542 * 4), use st_distance to convert them to geography and
calculate the distance between every combination of these points, store
this in an unindexed in memory table and and then sort it by a different
but as yet uncalculated value (using the knn operator) so that you can find
the largest value?

I wonder if the search strategy could be refined.

a) The Explain output suggests to me that most of the time in the query is
calculating all the distances in the join table (and I'm guessing
calculations). Having done this, why then invoke the knn approach to the
same points to find the distances to sort, when you have already calculated
this in the first join operation. It may be more efficient to sort by the
result of the st_distance function.

b) rather than force the calculation of distance between all the points
geographically, find the furthest apart polygons using their inherent
geometry, then calculated the exact distance for only the two points.

I'd play around a bit, but the srid you are using covers a small area, so
differences between using spheroid and not should not create massive errors.

cheers

Ben

On Wed, 22 Jun 2022 at 22:22, Lars Aksel Opsahl <Lars.Opsahl at nibio.no>
wrote:

> Hi
>
> I have two simple tables.
>
>   Table "pg_temp_30.g1temp"
>  Column |          Type          | Collation | Nullable |
> Default
>
> --------+------------------------+-----------+----------+------------------------------------
>  geo    | geometry(Polygon,4258) |           |          |
>  id     | integer                |           | not null | nextval(
> 'g1temp_id_seq'::regclass)
> Indexes:
>     "g1temp_pkey" PRIMARY KEY, btree (id)
>     "g1temp_geo_idx" gist (geo)
>
>
>                                   Table "pg_temp_30.g2temp"
>  Column |          Type          | Collation | Nullable |
> Default
>
> --------+------------------------+-----------+----------+------------------------------------
>  geo    | geometry(Polygon,4258) |           |          |
>  id     | integer                |           | not null | nextval(
> 'g2temp_id_seq'::regclass)
> Indexes:
>     "g2temp_pkey" PRIMARY KEY, btree (id)
>     "g2temp_geo_idx" gist (geo)
>
>
> And I need to find the polygon in g2temp that has the longest distance to
> any polygon in d1temp.
> g1temp has 542 rows and table g2temp has 63483 rows. All polygons are
> simple rectangles.
>
> I run analyze
> analyze g1temp;
> analyze g2temp;
>
>
> And then run this query
>
> EXPLAIN ANALYZE
> WITH index_query AS (
> SELECT g2.geo , ST_Distance(g1.geo, g2.geo,true) AS d, 1 as runnning_jos
> FROM
> g1temp AS g1,
> g2temp AS g2
> ORDER BY g1.geo <-> g2.geo desc
> LIMIT 1
> )
> SELECT *
> FROM index_query;
>
> And I get this
>
> QUERY PLAN
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------
>  Subquery Scan on index_query  (cost=43613127.36..43613153.64 rows=1
> width=132) (actual time=69029.773..69029.775 rows=1 loops=1)
>    ->  Limit  (cost=43613127.36..43613153.63 rows=1 width=140) (actual
> time=69029.771..69029.772 rows=1 loops=1)
>          ->  Result  (cost=43613127.36..947419646.12 rows=34407786
> width=140) (actual time=69029.769..69029.770 rows=1 loops=1)
>                ->  Sort  (cost=43613127.36..43699146.83 rows=34407786
> width=292) (actual time=69019.575..69019.576 rows=1 loops=1)
>                      Sort Key: ((g1.geo <-> g2.geo)) DESC
>                      Sort Method: top-N heapsort  Memory: 29kB
>                      ->  Nested Loop  (cost=0.00..43441088.43
> rows=34407786 width=292) (actual time=0.184..58446.116 rows=34407786
> loops=1)
>                            ->  Seq Scan on g2temp g2  (cost=0.00..1245.33
> rows=63483 width=120) (actual time=0.020..31.744 rows=63483 loops=1)
>                            ->  Materialize  (cost=0.00..14.63 rows=542
> width=160) (actual time=0.000..0.030 rows=542 loops=63483)
>                                  ->  Seq Scan on g1temp g1  (cost=0.00..11.92
> rows=542 width=160) (actual time=0.014..0.202 rows=542 loops=1)
>  Planning Time: 0.320 ms
>  Execution Time: 69029.872 ms
> (12 rows)
>
> I do understand that this may take time, but I wonder why I see this "Seq
> scan" both on g1 and g2 ?
>
> From what I read on https://postgis.net/docs/geometry_distance_knn.html it
> should use indexes  in some way.
>
> Have also tested with casting geography when creating table g1 and g2 and
> that did not make any difference.
>
> Tested also with big workmem (1000 MB ) and the result was the same.
>
> I am running on POSTGIS="3.3.0dev 3.1.0alpha2-1532-gc8eedf3ae"
> [EXTENSION] PGSQL="120" GEOS="3.9.0-CAPI-1.16.2" PROJ="7.2.1"
> LIBXML="2.9.10" LIBJSON="0.13.1" LIBPROTOBUF="1.3.3" WAGYU="0.5.0
> (Internal)" TOPOLOGY
> (1 row)
>
> Thanks.
>
> Lars
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
>

-- 

[image: Ausvet Logo] <https://www.ausvet.com.au/>

Dr Ben Madin

BVMS MVPHMgmt PhD MANZCVS GAICD
Managing Director
Mobile:
+61 448 887 220 <+61448887220>
E-mail:
ben at ausvet.com.au
Website:
www.ausvet.com.au
Skype: benmadin
Address:
5 Shuffrey Street
Fremantle, WA 6160
Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20220623/d8a73f1f/attachment.htm>