[postgis-devel] PointM(x,y,t) and gist_geometry_ops_nd

Thu May 12 13:45:41 PDT 2022

Hi,

Thanks for the summon. I think the question is much better analyzed in the
MobilityDB working group, so Esteban please feel free to add the findings
you have to the discussion :)

I would like to see the EXPLAIN (ANALYZE, VERBOSE, BUFFERS) from your
queries so that we're sure that there is indeed an over-scanning somewhere
and not a runaway Bitmap scan.

Indeed the first thing is that the penalty will be calculated in a
non-proportional way. What you get is essentially a 2D index that
occasionally gets indexed internally by Z too. In the past designs I've
mitigated this by using SRID=3857 which is nice for that kind of data,
since it looks good and also assumes things travel 1 meter per second which
is not far from the truth.

Second thing is our GiST rework adventure's finding by Aliaskandr Kalenik
on how Postgres handles multi-column gist indexes page split: after a split
on first column it checks if the resulting pages have tuples that can have
penalty=0 when inserted into either page and re-runs the split for them
based on second column. I suspect this will be happening a lot in AIS data.
Nothing similar happens to dimensions in 2D or ND gist implementation and
may be a way for improvement.

Third one that comes to mind is that our ND implementation is not getting
as much love as our 2D gist implementation: picksplit there is still not
updated to Alexander Korotkov's method from 2012, there is no universal
strategy for sorting build since it's shared between XYZ and lon-lat-ele
projections, and it is generally a varlen thing with a lot of conditional
loops on dimensions (which are still up to 4). For fair comparison, can you
create (geom_nd, T) index on it with geometry being 3D with M forced to 0
and check what is the pure difference between 2D and ND ops?

Fourth thing is that people with that kind of data don't want point-based
indexing of movement tracks, if someone disappears for 10 minutes I still
want to see where they were before and after that if I'm doing a slice. You
want short tracks that stitch together. MobilityDB has that abstraction,
and on that kind of data picksplit will work better and gist tree will be
nicer, since you now can't just split at arbitrary Z/M and get nice
separated buckets of data on first try.

Also in real data there will be an ID of vehicle and that usually will also
be in the indexing partition.

Hope this helps.

Darafei.

чт, 12 мая 2022 г. в 22:54, Paul Ramsey <pramsey at cleverelephant.ca>:

> All (but especially Darafei),
>
> I grabbed some AIS data (7M ship movements) to do up a blog post on
> indexing spatio-temporal data and some of the quirks therein.
>
> The data were long/lat/timestamp, as one would expect. I built a simple 2D
> index and did some queries that included an ST_Intersects() and a time
> range, where the filter was such that the result set was quite selective (a
> few thousand points) and it ran (predictably) slowly, as it had to choose
> either a temporal or a spatial index, and then plow through the remaining
> results in brute force.
>
> The naive approach took about 4s.
>
> Then I created a new table with XYT data, by doing PointM(long, lat,
> extract(epoch from ts)). Then I built an nd index on that table.
>
> Then I built a XYT query box that was the same as my original filter. And
> I ran it on the new table.
>
> It was faster than the naive approach, but only by about 4x. It took about
> 1s.
>
> Then I added the btree_gist extension and built a multi-key index on
> timestamp and geometry using the 2d ops. Running the same query as the
> naive approach, I got a 40ms return time. So, definitely this is the
> winning approach.
>
> However, I am left with the nd XYT approach  being really slow, and
> wanting to mentally model why.
>
> Here's my guess:
>
> In terms of data range, you have longitude, which in my data is no more
> than about 10 units wide. And similar for latitude. And then I have the
> time range, which is on a 1 second basis for a day so 86400 units.
>
> The nd penalty function looks at how the bounds volume and "edge" metrics
> are affected by adding new points. Given the huge difference in dimensional
> size, no matter what data is being added, the T dimension is always going
> to dominate. So the index will be made up of splits on T, over and over and
> over, until the variance in XY finally starts to get closer to the variance
> in T (when we're talking about time slices of a second or so).
>
> So the performance of queries on this index is going to be pretty bad for
> time ranges of more than a couple seconds, since the index will provide no
> spatial filtering at all on larger time ranges.
>
> Basically the moral of the story is "you cannot expect an ND index on
> non-spatial dimensions mixed with spatial dimensions to work unless you
> massage your dimensions to have proportions that match your expected query
> pattern".
>
> That's my story.
>
> Does that sound right to you?
>
> P
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20220512/baf5c4dc/attachment.htm>