[postgis-users] Spatial Join Index Use
Paul Ramsey
pramsey at cleverelephant.ca
Thu Jan 7 13:04:05 PST 2021
Probably your trajectories are not all that spatially distinct, so the spatial index doesn't give you much leverage. You might want to chop them up into much smaller parts. At a minimum, from a storage/speed point of view, getting them below the page size (~200 points) will make retrieval faster, while also making them spatiall way more atomic, so the index has something to work with.
P.
> On Jan 7, 2021, at 1:01 PM, Temir Karakurum <temirkarakurum at gmail.com> wrote:
>
> Hi Paul,
>
> Thanks for your prompt response (big fan of your work!)
>
> I tried it out with 0.01 and 0.001 degrees after your message.
> It is the same query plan but actually takes longer .
>
> 0.1 degrees ~ 360 sec, 0.01 degrees ~ 760 sec, 0.001 degrees ~ 1013 sec
>
> As per the properties of data, these are GPS traces collected in Beijing. There are 18670 trajectories. Average trajectory length is 1332 points, whereas max traj length is 92K and min traj length is 3 points. There are ~ 25 million points in the whole thing. Table takes 330MB disk space on my machine.
>
> tf (the result of temporal filtering) is only 63047 rows.
>
> For 0.001 degrees, the query plan is as follows:
>
> "Gather (cost=38404.10..557088.43 rows=23555 width=8) (actual time=1252.954..1013020.530 rows=17319 loops=1)"
> " Workers Planned: 2"
> " Workers Launched: 2"
> " -> Parallel Hash Join (cost=37404.10..553732.93 rows=9815 width=8) (actual time=600.525..1011078.265 rows=5773 loops=3)"
> " Hash Cond: (tf.id2 = t2.id)"
> " Join Filter: st_dwithin(t1.track, t2.track, '0.001'::double precision)"
> " Rows Removed by Join Filter: 15243"
> " -> Parallel Hash Join (cost=870.07..7313.93 rows=26270 width=32026) (actual time=3.574..9.264 rows=21016 loops=3)"
> " Hash Cond: (t1.id = tf.id1)"
> " -> Parallel Seq Scan on traj2 t1 (cost=0.00..6003.79 rows=7779 width=32022) (actual time=0.004..2.650 rows=6223 loops=3)"
> " -> Parallel Hash (cost=541.70..541.70 rows=26270 width=8) (actual time=3.429..3.429 rows=21016 loops=3)"
> " Buckets: 65536 Batches: 1 Memory Usage: 3008kB"
> " -> Parallel Seq Scan on tf (cost=0.00..541.70 rows=26270 width=8) (actual time=0.007..1.096 rows=21016 loops=3)"
> " -> Parallel Hash (cost=6003.79..6003.79 rows=7779 width=32022) (actual time=157.429..157.430 rows=6223 loops=3)"
> " Buckets: 4096 Batches: 16 Memory Usage: 2688kB"
> " -> Parallel Seq Scan on traj2 t2 (cost=0.00..6003.79 rows=7779 width=32022) (actual time=131.213..132.778 rows=6223 loops=3)"
> "Planning Time: 0.680 ms"
> "JIT:"
> " Functions: 60"
> " Options: Inlining true, Optimization true, Expressions true, Deforming true"
> " Timing: Generation 5.081 ms, Inlining 95.254 ms, Optimization 183.575 ms, Emission 114.379 ms, Total 398.289 ms"
> "Execution Time: 1013024.443 ms"
>
> Best,
> Temir
>
> On Thu, Jan 7, 2021 at 11:01 PM <postgis-users-request at lists.osgeo.org> wrote:
> Send postgis-users mailing list submissions to
> postgis-users at lists.osgeo.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.osgeo.org/mailman/listinfo/postgis-users
> or, via email, send a message with subject or body 'help' to
> postgis-users-request at lists.osgeo.org
>
> You can reach the person managing the list at
> postgis-users-owner at lists.osgeo.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of postgis-users digest..."
>
>
> Today's Topics:
>
> 1. Spatial Join Index Use (Temir Karakurum)
> 2. Re: Spatial Join Index Use (Paul Ramsey)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 7 Jan 2021 22:45:45 +0300
> From: Temir Karakurum <temirkarakurum at gmail.com>
> To: postgis-users at lists.osgeo.org
> Subject: [postgis-users] Spatial Join Index Use
> Message-ID:
> <CANP_axJ9STdN+rudRw+KngZ-bzRZBbFVfms7GUtxwo52okt=Ew at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi everyone,
>
> I have GPS trajectory data in LineStringM format (GeoLife data formatted as
> described in Anita Graser's blog post). I am trying to find trajectories
> that are close to each other via ST_DWithin and ST_CPAWithin.
>
> I have pre-filtered the trajectories with respect to time-ranges
> (tstzrange) and saved results in another table named `tf`. This table has
> two fields id1 and id2 corresponding to the pairs that satisfy the temporal
> filtering (overlap via &&) requirements.
>
> Then I apply ST_DWithin to find trajectories within 0.1 degrees of each
> other. Ideally this would be the second pre-filtering step before finally
> refining with the spatiotemporal function CPAWithin. However, I cannot seem
> to get postgis to use the available indexes. I ran the following query:
>
> EXPLAIN ANALYZE
> SELECT t1.id, t2.id
> FROM tf
> JOIN traj2 t1 ON t1.id = tf.id1
> JOIN traj2 t2 ON t2.id = tf.id2
> WHERE ST_DWithin(t1.track, t2.track,0.1)
>
> I have the following indexes on the GeoLife trajectory table called `traj2`:
> Indexes:
> "idx_traj2_ep_geog" gist (geography(end_point))
> "idx_traj2_ep_geom" gist (end_point)
> "idx_traj2_id" btree (id)
> "idx_traj2_sp_geog" gist (geography(start_point))
> "idx_traj2_sp_geom" gist (start_point)
> "idx_traj2_time_range" gist (time_range)
> "idx_traj2_time_track2d" gist (track)
> "idx_traj2_time_track2d_geog" gist (geography(track))
> "idx_traj2_time_track3d" gist (track gist_geometry_ops_nd)
> "idx_traj2_tracker" btree (tracker)
> Access method: heap
> Options: parallel_workers=8
>
> and the following on `tf`:
> Indexes:
> "idx_tf_id1" btree (id1)
> "idx_tf_id2" btree (id2)
> Access method: heap
> Options: parallel_workers=8
>
> I have run VACUUM ANALYZE before running EXPLAIN ANALYZE and the results
> are as follows:
>
> "Gather (cost=38404.10..557039.81 rows=23555 width=8) (actual
> time=182.494..362202.326 rows=48253 loops=1)"
> " Workers Planned: 2"
> " Workers Launched: 2"
> " -> Parallel Hash Join (cost=37404.10..553684.31 rows=9815 width=8)
> (actual time=217.140..361763.181 rows=16084 loops=3)"
> " Hash Cond: (tf.id1 = t1.id)"
> " Join Filter: st_dwithin(t1.track, t2.track, '0.1'::double
> precision)"
> " Rows Removed by Join Filter: 4931"
> " -> Parallel Hash Join (cost=870.07..7265.31 rows=26270
> width=32026) (actual time=3.649..10.452 rows=21016 loops=3)"
> " Hash Cond: (t2.id = tf.id2)"
> " -> Parallel Seq Scan on traj2 t2 (cost=0.00..6003.79
> rows=7779 width=32022) (actual time=0.003..2.695 rows=6223 loops=3)"
> " -> Parallel Hash (cost=541.70..541.70 rows=26270 width=8)
> (actual time=3.452..3.452 rows=21016 loops=3)"
> " Buckets: 65536 Batches: 1 Memory Usage: 3008kB"
> " -> Parallel Seq Scan on tf (cost=0.00..541.70
> rows=26270 width=8) (actual time=0.008..0.930 rows=21016 loops=3)"
> " -> Parallel Hash (cost=6003.79..6003.79 rows=7779 width=32022)
> (actual time=132.408..132.408 rows=6223 loops=3)"
> " Buckets: 4096 Batches: 16 Memory Usage: 2688kB"
> " -> Parallel Seq Scan on traj2 t1 (cost=0.00..6003.79
> rows=7779 width=32022) (actual time=121.500..123.018 rows=6223 loops=3)"
> "Planning Time: 3.335 ms"
> "JIT:"
> " Functions: 60"
> " Options: Inlining true, Optimization true, Expressions true, Deforming
> true"
> " Timing: Generation 3.102 ms, Inlining 79.410 ms, Optimization 173.707
> ms, Emission 110.965 ms, Total 367.184 ms"
> "Execution Time: 362206.853 ms"
>
> My current settings are
> shared_buffers: 2GB, work_mem: 64MB, maintenance_work_mem: 512MB,
> effective_cache_size: 4GB.
> (I am on a Ubuntu 18.04 laptop machine with 16 GB RAM and i7 2.30 GHz 16
> core CPU) (I have the following postgis installation: POSTGIS="3.1.0
> 5e2af69" [EXTENSION] PGSQL="120" GEOS="3.7.1-CAPI-1.11.1 27a5e771"
> PROJ="Rel. 4.9.3, 15 August 2016" LIBXML="2.9.4" LIBJSON="0.12.1"
> LIBPROTOBUF="1.2.1" WAGYU="0.5.0 (Internal)" )
>
> I have also tried setting STORAGE EXTERNAL as per Paul Ramsay's blog post
> on spatial joins to no avail.
>
> As a final note, I have SET parallel_workers=8 but query planner seems to
> be stuck with 2 workers launched/planned.
>
> I am fairly new to GIS and postgis altogether, so I might be missing
> something very simple and straightforward. I would be grateful if you guys
> can point me in the right direction.
>
> Best regards,
> Temir
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20210107/ace0841a/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 7 Jan 2021 11:53:03 -0800
> From: Paul Ramsey <pramsey at cleverelephant.ca>
> To: PostGIS Users Discussion <postgis-users at lists.osgeo.org>
> Subject: Re: [postgis-users] Spatial Join Index Use
> Message-ID: <F6650C97-BA31-484A-A1EE-1C3D60D159B8 at cleverelephant.ca>
> Content-Type: text/plain; charset=us-ascii
>
> Without more analysis if your data, it's hard to say, but an initial guess is that 0.1 degrees is actually quite a large distance so the st_dwithin join condition just isn't selective enough to run indexed in one of the inner filters. If you're *sure* you want it run first, you could try hauling it out into a CTE.
>
> P
>
> > On Jan 7, 2021, at 11:45 AM, Temir Karakurum <temirkarakurum at gmail.com> wrote:
> >
> > Hi everyone,
> >
> > I have GPS trajectory data in LineStringM format (GeoLife data formatted as described in Anita Graser's blog post). I am trying to find trajectories that are close to each other via ST_DWithin and ST_CPAWithin.
> >
> > I have pre-filtered the trajectories with respect to time-ranges (tstzrange) and saved results in another table named `tf`. This table has two fields id1 and id2 corresponding to the pairs that satisfy the temporal filtering (overlap via &&) requirements.
> >
> > Then I apply ST_DWithin to find trajectories within 0.1 degrees of each other. Ideally this would be the second pre-filtering step before finally refining with the spatiotemporal function CPAWithin. However, I cannot seem to get postgis to use the available indexes. I ran the following query:
> >
> > EXPLAIN ANALYZE
> > SELECT t1.id, t2.id
> > FROM tf
> > JOIN traj2 t1 ON t1.id = tf.id1
> > JOIN traj2 t2 ON t2.id = tf.id2
> > WHERE ST_DWithin(t1.track, t2.track,0.1)
> >
> > I have the following indexes on the GeoLife trajectory table called `traj2`:
> > Indexes:
> > "idx_traj2_ep_geog" gist (geography(end_point))
> > "idx_traj2_ep_geom" gist (end_point)
> > "idx_traj2_id" btree (id)
> > "idx_traj2_sp_geog" gist (geography(start_point))
> > "idx_traj2_sp_geom" gist (start_point)
> > "idx_traj2_time_range" gist (time_range)
> > "idx_traj2_time_track2d" gist (track)
> > "idx_traj2_time_track2d_geog" gist (geography(track))
> > "idx_traj2_time_track3d" gist (track gist_geometry_ops_nd)
> > "idx_traj2_tracker" btree (tracker)
> > Access method: heap
> > Options: parallel_workers=8
> >
> > and the following on `tf`:
> > Indexes:
> > "idx_tf_id1" btree (id1)
> > "idx_tf_id2" btree (id2)
> > Access method: heap
> > Options: parallel_workers=8
> >
> > I have run VACUUM ANALYZE before running EXPLAIN ANALYZE and the results are as follows:
> >
> > "Gather (cost=38404.10..557039.81 rows=23555 width=8) (actual time=182.494..362202.326 rows=48253 loops=1)"
> > " Workers Planned: 2"
> > " Workers Launched: 2"
> > " -> Parallel Hash Join (cost=37404.10..553684.31 rows=9815 width=8) (actual time=217.140..361763.181 rows=16084 loops=3)"
> > " Hash Cond: (tf.id1 = t1.id)"
> > " Join Filter: st_dwithin(t1.track, t2.track, '0.1'::double precision)"
> > " Rows Removed by Join Filter: 4931"
> > " -> Parallel Hash Join (cost=870.07..7265.31 rows=26270 width=32026) (actual time=3.649..10.452 rows=21016 loops=3)"
> > " Hash Cond: (t2.id = tf.id2)"
> > " -> Parallel Seq Scan on traj2 t2 (cost=0.00..6003.79 rows=7779 width=32022) (actual time=0.003..2.695 rows=6223 loops=3)"
> > " -> Parallel Hash (cost=541.70..541.70 rows=26270 width=8) (actual time=3.452..3.452 rows=21016 loops=3)"
> > " Buckets: 65536 Batches: 1 Memory Usage: 3008kB"
> > " -> Parallel Seq Scan on tf (cost=0.00..541.70 rows=26270 width=8) (actual time=0.008..0.930 rows=21016 loops=3)"
> > " -> Parallel Hash (cost=6003.79..6003.79 rows=7779 width=32022) (actual time=132.408..132.408 rows=6223 loops=3)"
> > " Buckets: 4096 Batches: 16 Memory Usage: 2688kB"
> > " -> Parallel Seq Scan on traj2 t1 (cost=0.00..6003.79 rows=7779 width=32022) (actual time=121.500..123.018 rows=6223 loops=3)"
> > "Planning Time: 3.335 ms"
> > "JIT:"
> > " Functions: 60"
> > " Options: Inlining true, Optimization true, Expressions true, Deforming true"
> > " Timing: Generation 3.102 ms, Inlining 79.410 ms, Optimization 173.707 ms, Emission 110.965 ms, Total 367.184 ms"
> > "Execution Time: 362206.853 ms"
> >
> > My current settings are
> > shared_buffers: 2GB, work_mem: 64MB, maintenance_work_mem: 512MB, effective_cache_size: 4GB.
> > (I am on a Ubuntu 18.04 laptop machine with 16 GB RAM and i7 2.30 GHz 16 core CPU) (I have the following postgis installation: POSTGIS="3.1.0 5e2af69" [EXTENSION] PGSQL="120" GEOS="3.7.1-CAPI-1.11.1 27a5e771" PROJ="Rel. 4.9.3, 15 August 2016" LIBXML="2.9.4" LIBJSON="0.12.1" LIBPROTOBUF="1.2.1" WAGYU="0.5.0 (Internal)" )
> >
> > I have also tried setting STORAGE EXTERNAL as per Paul Ramsay's blog post on spatial joins to no avail.
> >
> > As a final note, I have SET parallel_workers=8 but query planner seems to be stuck with 2 workers launched/planned.
> >
> > I am fairly new to GIS and postgis altogether, so I might be missing something very simple and straightforward. I would be grateful if you guys can point me in the right direction.
> >
> > Best regards,
> > Temir
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users at lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>
> ------------------------------
>
> End of postgis-users Digest, Vol 227, Issue 5
> *********************************************
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
More information about the postgis-users
mailing list