[postgis-devel] PointM(x,y,t) and gist_geometry_ops_nd

Paul Ramsey pramsey at cleverelephant.ca
Mon May 16 08:54:43 PDT 2022


Thank you Darafei for sharing your experience. I'm glad this is not a new and exciting problem, but rather a well experienced one.

So, I did try out using Mercator instead of geographics to give the spatial dimension some more weight, and it resulted in a better performing index (about 700ms vs 1000ms before for the same query) but still way less good than using a multi-key on gist.

I also timed out just using a timestamp index and a spatial index separately, and letting the system bitmap them together, and: surprise! this ended up being the actual fastest option of them all. So "do nothing special and just add single indexes on the columns of interest" won the whole race.

This feels a little counter-intuitive, but maybe not. Maybe the gains available from an explicit spatial-temporal approach have quite a high floor already.

Anyways, I was thinking about what such an explicit approach might look like.

* put the XYT into a fixed-length key to get rid of the expense of the varlena key
* compute the penalty of cost proportionally instead of absolutely... so a given new key will expand an existing node by N% in the time dimensin and M% in the spatial dimensions. That stills leaves a magic number to balance time and space, but it at least doesn't just wash out due to something arbitrary like the input unit size

That still leaves splits to be calculated but maybe there's a similar proportional approach that can be applied.

Anyone know any good papers on spatio-temporal indexing?

P

> On May 12, 2022, at 1:45 PM, Дорофей Пролесковский <me at komzpa.net> wrote:
> 
> Hi,
> 
> Thanks for the summon. I think the question is much better analyzed in the MobilityDB working group, so Esteban please feel free to add the findings you have to the discussion :)
> 
> I would like to see the EXPLAIN (ANALYZE, VERBOSE, BUFFERS) from your queries so that we're sure that there is indeed an over-scanning somewhere and not a runaway Bitmap scan.
> 
> Indeed the first thing is that the penalty will be calculated in a non-proportional way. What you get is essentially a 2D index that occasionally gets indexed internally by Z too. In the past designs I've mitigated this by using SRID=3857 which is nice for that kind of data, since it looks good and also assumes things travel 1 meter per second which is not far from the truth.
> 
> Second thing is our GiST rework adventure's finding by Aliaskandr Kalenik on how Postgres handles multi-column gist indexes page split: after a split on first column it checks if the resulting pages have tuples that can have penalty=0 when inserted into either page and re-runs the split for them based on second column. I suspect this will be happening a lot in AIS data. Nothing similar happens to dimensions in 2D or ND gist implementation and may be a way for improvement.
> 
> Third one that comes to mind is that our ND implementation is not getting as much love as our 2D gist implementation: picksplit there is still not updated to Alexander Korotkov's method from 2012, there is no universal strategy for sorting build since it's shared between XYZ and lon-lat-ele projections, and it is generally a varlen thing with a lot of conditional loops on dimensions (which are still up to 4). For fair comparison, can you create (geom_nd, T) index on it with geometry being 3D with M forced to 0 and check what is the pure difference between 2D and ND ops?
> 
> Fourth thing is that people with that kind of data don't want point-based indexing of movement tracks, if someone disappears for 10 minutes I still want to see where they were before and after that if I'm doing a slice. You want short tracks that stitch together. MobilityDB has that abstraction, and on that kind of data picksplit will work better and gist tree will be nicer, since you now can't just split at arbitrary Z/M and get nice separated buckets of data on first try.
> 
> Also in real data there will be an ID of vehicle and that usually will also be in the indexing partition. 
> 
> Hope this helps.
> 
> Darafei.
> 
> чт, 12 мая 2022 г. в 22:54, Paul Ramsey <pramsey at cleverelephant.ca>:
> All (but especially Darafei),
> 
> I grabbed some AIS data (7M ship movements) to do up a blog post on indexing spatio-temporal data and some of the quirks therein.
> 
> The data were long/lat/timestamp, as one would expect. I built a simple 2D index and did some queries that included an ST_Intersects() and a time range, where the filter was such that the result set was quite selective (a few thousand points) and it ran (predictably) slowly, as it had to choose either a temporal or a spatial index, and then plow through the remaining results in brute force.
> 
> The naive approach took about 4s.
> 
> Then I created a new table with XYT data, by doing PointM(long, lat, extract(epoch from ts)). Then I built an nd index on that table.
> 
> Then I built a XYT query box that was the same as my original filter. And I ran it on the new table.
> 
> It was faster than the naive approach, but only by about 4x. It took about 1s.
> 
> Then I added the btree_gist extension and built a multi-key index on timestamp and geometry using the 2d ops. Running the same query as the naive approach, I got a 40ms return time. So, definitely this is the winning approach.
> 
> However, I am left with the nd XYT approach  being really slow, and wanting to mentally model why.
> 
> Here's my guess:
> 
> In terms of data range, you have longitude, which in my data is no more than about 10 units wide. And similar for latitude. And then I have the time range, which is on a 1 second basis for a day so 86400 units.
> 
> The nd penalty function looks at how the bounds volume and "edge" metrics are affected by adding new points. Given the huge difference in dimensional size, no matter what data is being added, the T dimension is always going to dominate. So the index will be made up of splits on T, over and over and over, until the variance in XY finally starts to get closer to the variance in T (when we're talking about time slices of a second or so). 
> 
> So the performance of queries on this index is going to be pretty bad for time ranges of more than a couple seconds, since the index will provide no spatial filtering at all on larger time ranges.
> 
> Basically the moral of the story is "you cannot expect an ND index on non-spatial dimensions mixed with spatial dimensions to work unless you massage your dimensions to have proportions that match your expected query pattern".
> 
> That's my story.
> 
> Does that sound right to you?
> 
> P
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel



More information about the postgis-devel mailing list