[postgis-devel] PointM(x,y,t) and gist_geometry_ops_nd

Thu May 12 12:54:26 PDT 2022

All (but especially Darafei),

I grabbed some AIS data (7M ship movements) to do up a blog post on indexing spatio-temporal data and some of the quirks therein.

The data were long/lat/timestamp, as one would expect. I built a simple 2D index and did some queries that included an ST_Intersects() and a time range, where the filter was such that the result set was quite selective (a few thousand points) and it ran (predictably) slowly, as it had to choose either a temporal or a spatial index, and then plow through the remaining results in brute force.

The naive approach took about 4s.

Then I created a new table with XYT data, by doing PointM(long, lat, extract(epoch from ts)). Then I built an nd index on that table.

Then I built a XYT query box that was the same as my original filter. And I ran it on the new table.

It was faster than the naive approach, but only by about 4x. It took about 1s.

Then I added the btree_gist extension and built a multi-key index on timestamp and geometry using the 2d ops. Running the same query as the naive approach, I got a 40ms return time. So, definitely this is the winning approach.

However, I am left with the nd XYT approach  being really slow, and wanting to mentally model why.

Here's my guess:

In terms of data range, you have longitude, which in my data is no more than about 10 units wide. And similar for latitude. And then I have the time range, which is on a 1 second basis for a day so 86400 units.

The nd penalty function looks at how the bounds volume and "edge" metrics are affected by adding new points. Given the huge difference in dimensional size, no matter what data is being added, the T dimension is always going to dominate. So the index will be made up of splits on T, over and over and over, until the variance in XY finally starts to get closer to the variance in T (when we're talking about time slices of a second or so). 

So the performance of queries on this index is going to be pretty bad for time ranges of more than a couple seconds, since the index will provide no spatial filtering at all on larger time ranges.

Basically the moral of the story is "you cannot expect an ND index on non-spatial dimensions mixed with spatial dimensions to work unless you massage your dimensions to have proportions that match your expected query pattern".

That's my story.

Does that sound right to you?

P