Intersection tests with curved polygons
Paul Ramsey
pramsey at cleverelephant.ca
Thu Jan 9 09:49:52 PST 2025
> On Jan 8, 2025, at 1:13 AM, Andrea Aime <andrea.aime at geosolutionsgroup.com> wrote:
>
> In the full dataset, there are around 10k records having actual curved edges, out of 36k.
> I was guessing the ideal would be to check if curves are there, and change on the fly the test.
> Something like this:
>
> SELECT ogc_fid FROM kiinteisto_alue WHERE geom && ST_GeomFromText('POINT (25492818 6677399.98)', 3879) and
> case
> when ST_HasArc(geom) then ST_Distance(geom, ST_GeomFromText('POINT (25492818 6677399.98)', 3879)) = 0
> else ST_Intersects(geom, ST_GeomFromText('POINT (25492818 6677399.98)', 3879))
> end;
>
> And yet, a pgbench of the above only returns 14500 TPS, whilst the simple distance test for all,
> gets up to around 20000.... looks like a ST_Distance test is faster even for geometries with straight segments.... isn't this odd?
> In my mind, checking if two geometries interfere (in any way) should be less expensive than calculating their
> exact distance.
>
> By the way, thanks for the details, I love to hear about how things are implemented!
>
This is getting into conference talk levels of depth, and without actually profiling it I am still kind of guessing, but…
So, the ST_Distance algorithm always runs native to PostGIS, so there’s never a “copy to GEOS” cost on it.
In addition, it’s a clever beast, and if both arguments are disjoint, it will perform a light sort of the edges and use that sorted set to prune the number of edge comparisons necessary to calculate the distance.
Now, your test pattern is one fixed point, and a changing array of polygonal things. This is a little backwards from the usual workload you see in, for example, a spatial join, where the execution will be a nested loop driven by the polygons.
Your workload means you are avoiding one of our best performance enhancers in the boolean predicates, the cached internal index. When polygons repeat, we cache an index on their edges, which makes things like predicates much faster. However, your pattern is the reverse. Because there is no benefit of preparing a point, and the polygon input changes every time, so there’s no way to prepare and cache it.
Even more complex though, in ST_Intersects there’s special logic for point-polygon tests, and that logic applies a quick indexing of the polygon edges every time. The implication (though it depends to some extent on how complex your polygons is) is that the cost of that quick indexing is somewhat higher than the cost of the light sorting applied by the distance algorithm. This point in poly special case avoids GEOS, so there’s no conversion overhead, so we can say with some certainty that you have probably sussed out a difference in performance between the index building and sorting approaches.
With just a little more (more!) complexity in the code base we could add another case into ST_Intersects to only build the index after seeing two polygons that are the same, and to otherwise delegate to a ST_Distance == 0 condition for non-cached cases.
Anyways, it’s quite a rats nest, and we’ve been squeezing pieces of performance out of it for a long time. (I didn’t even mention TOAST caching!)
ATB<
P
More information about the postgis-users
mailing list