[postgis-users] https://postgis.net/docs/geometry_distance_knn.html and index usage

Thu Jun 23 02:00:22 PDT 2022

>From: postgis-users <postgis-users-bounces at lists.osgeo.org> on behalf of Ben Madin <ben at ausvet.com.au>
>Sent: Thursday, June 23, 2022 4:04 AM
>To: PostGIS Users Discussion <postgis-users at lists.osgeo.org>
>Subject: Re: [postgis-users] https://postgis.net/docs/geometry_distance_knn.html and index usage
>
>Hi Lars,
>
>I'm just a user, but we have handled some pretty big datasets so I can understand your problem. I might be misunderstanding, but to confirm you are asking the system to cross join all of the possible points (something like 63483 * 4 * 542 * 4), use st_distance to convert them to geography and calculate the distance between every combination of these points, store this in an unindexed in memory table and and then sort it by a different but as yet uncalculated value (using the knn operator) so that you can find the largest value?
>
>I wonder if the search strategy could be refined.
>
>a) The Explain output suggests to me that most of the time in the query is calculating all the distances in the join table (and I'm guessing calculations). Having done this, why then invoke the knn approach to the same points to find the distances to sort, when you have already calculated this in the first join operation. It may be more efficient to sort by the result of the st_distance function.
>
>b) rather than force the calculation of distance between all the points geographically, find the furthest apart polygons using their inherent geometry, then calculated the exact distance for only the two points.
>
>I'd play around a bit, but the srid you are using covers a small area, so differences between using spheroid and not should not create massive errors.
>
>cheers
>
>Ben

Hi

Thanks for comments I played around this more after reading your mail.

First here is some more background info.

This tables represent work areas and and rows are moved between g2_job_done and g2_job_ready_to_start all the time.

We have 40 workers ready to do jobs and they work in parallel , but we want them do work as far part as possible .

So the first work is to find a place as far away other jobs done/started as possible to reduce spatial conflicts and database deadlocks.
To find the next work cell on using Postgres advisory locks to avoid concurrency errors.

Here are the number of rows in base tables now, the 542 is after an ST_union and then doing ST_dump.

SELECT count(*) from g2_job_done;
 count
-------
  2677

SELECT count(*) from g2_job_ready_to_start;
 count
-------
 63483

I played around with CROSS JOIN LATERAL and that reduced the time more than 50 %.
The light  red cell are jobs ready to start the yellow cells are done/started jobs

In the attached image the SQL below returns the green "diamond".

EXPLAIN ANALYZE
SELECT g2.id AS next_id,
       g2.geo::geometry(Polygon,4258) AS next_geo,
       g2.dist
FROM
(SELECT ST_Union(geo)::geometry(MultiPolygon,4258) AS geo FROM g2_job_done g1) g1
CROSS JOIN LATERAL (
  SELECT g2.geo, g2.id, g1.geo <-> g2.geo AS dist
  FROM g2_job_ready_to_start AS g2
  ORDER BY dist desc
) g2
LIMIT 1;

                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=86075.54..86075.56 rows=1 width=132) (actual time=26985.368..26985.370 rows=1 loops=1)
   ->  Nested Loop  (cost=86075.54..87503.93 rows=63483 width=132) (actual time=26985.365..26985.367 rows=1 loops=1)
         ->  Aggregate  (cost=412.39..412.41 rows=1 width=32) (actual time=232.006..232.007 rows=1 loops=1)
               ->  Seq Scan on g2_job_done g1  (cost=0.00..52.77 rows=2677 width=120) (actual time=0.018..1.010 rows=2677 loops=1)
         ->  Sort  (cost=85663.15..85821.85 rows=63483 width=132) (actual time=26753.353..26753.353 rows=1 loops=1)
               Sort Key: ((((st_union(g1.geo))::geometry(MultiPolygon,4258)) <-> g2.geo)) DESC
               Sort Method: quicksort  Memory: 18399kB
               ->  Seq Scan on g2_job_ready_to_start g2  (cost=0.00..80599.08 rows=63483 width=132) (actual time=0.253..26699.266 rows=63483 loops=1)
 Planning Time: 0.729 ms
 Execution Time: 26986.358 ms
(10 rows)

I have tested with point as you suggested like this,

CREATE TABLE g2_job_ready_to_start_p AS SELECT
ST_PointOnSurface(g.geo)::geometry(Point,4258) as geo
FROM g2_job_ready_to_start g;
CREATE INDEX ON g2_job_ready_to_start_p USING gist (geo);
ALTER TABLE g2_job_ready_to_start_p ADD COLUMN id serial primary key;

EXPLAIN ANALYZE
SELECT g2.id AS next_id,
       g2.geo::geometry(Point,4258) AS next_geo,
       g2.dist
FROM
(SELECT ST_Union(geo)::geometry(MultiPolygon,4258) AS geo FROM g2_job_done g1) g1
CROSS JOIN LATERAL (
  SELECT g2.geo, g2.id, g1.geo <-> g2.geo AS dist
  FROM g2_job_ready_to_start_p AS g2
  ORDER BY dist desc
) g2
LIMIT 1;

 QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=85730.04..85730.06 rows=1 width=44) (actual time=10420.754..10420.756 rows=1 loops=1)
   ->  Nested Loop  (cost=85730.04..87158.43 rows=63483 width=44) (actual time=10420.751..10420.752 rows=1 loops=1)
         ->  Aggregate  (cost=412.39..412.41 rows=1 width=32) (actual time=231.318..231.319 rows=1 loops=1)
               ->  Seq Scan on g2_job_done g1  (cost=0.00..52.77 rows=2677 width=120) (actual time=0.016..0.624 rows=2677 loops=1)
         ->  Sort  (cost=85317.65..85476.35 rows=63483 width=44) (actual time=10189.429..10189.429 rows=1 loops=1)
               Sort Key: ((((st_union(g1.geo))::geometry(MultiPolygon,4258)) <-> g2.geo)) DESC
               Sort Method: quicksort  Memory: 6496kB
               ->  Seq Scan on g2_job_ready_to_start_p g2  (cost=0.00..80253.58 rows=63483 width=44) (actual time=0.182..10159.231 rows=63483 loops=1)
 Planning Time: 0.526 ms
 Execution Time: 10420.818 ms
(10 rows)

Then we use the g2_job_ready_to_start_p and that reduced the time down 10 seconds but that cases another problem as you see in the image.
In the attached image the above SQL returns the dark red point.

Because the cells have different sizes because we use content based grids (smaller celle where have density of data) so may then get cells that share a border with another cell ready for work.

Since indexes only seems work when we need find close cell a not far away cells as I was hoping. If this is expected behavior, we can not use this index straight forward  .

In some cases it now take more time find out where to work than actually doing the work :) and I was hoping to speed up this a lot by using a simple index.

Thanks

Lars

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20220623/08155eb5/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot 2022-06-23 at 10.55.25.png
Type: image/png
Size: 81121 bytes
Desc: Screenshot 2022-06-23 at 10.55.25.png
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20220623/08155eb5/attachment.png>