[postgis-devel] Optimizing contains/within

Chris Hodgson chodgson at refractions.net
Mon Jul 23 14:43:37 PDT 2007


I don't have a good understanding of the level of optimization done to 
the geos code, although I feel there is probably room for improvement 
(likely at the cost of simplicity/understandability/synchronization with 
JTS). What I believe to be a bigger problem is the levels of abstraction 
between the SQL language, the postGIS internal structures, and the GEOS 
internal structures, and the fact that we aren't doing anything to 
"bridge the gaps" in usage.

If one was going to code an optimal solution to the problem "for each of 
these very many points, see if they are within this one polygon", one 
would build an indexed representation of the polygon that is designed 
for quickly finding the sides that intersect any given y-axis. However, 
the "point_in_ring" function is designed to find if just one point is 
within in this one ring, and so it doesn't build this indexed polygon 
structure, or if it does, it throws it away afterward.

Even if geos had a "batch optimized" function like "points_in_ring()" 
that was designed to use the more optimized strategy, it would be 
impossible to use it because your SQL query would likely look something 
like:

SELECT point_in_poly(a.geom_column, b.geom_column) from a join b;

and so the SQL plan would be to call point_in_poly() many times, not 
points_in_poly().

I think there are two solutions to this problem. The possibly simpler 
but less generic solution would be to create some new functions in geos 
and postgis that are designed to work on arrays/lists/sets of objects, 
so that you could use SQL like this:

SELECT points_in_polys( (select geom_col from a), (select geom_col from b) )

And be returned an array of results. This would provide a more optimized 
way to do the more commonly used "batch" functions. Eventually someone 
would want to be able to do every sort of function in the same sort of 
batch method, and we'd end up with a lot of functions with plurals in 
the names, and people would likely complain about how difficult it is to 
work with the results of these functions (big ugly arrays) and how it 
isn't very flexible.

I think the more generic solution that would enable you to do queries 
like the first one, but still taking advantage of "batch optimizations", 
is to somehow cache any data structures that geos creates to do any 
operations on geometries. The kind of structures that need to be cached 
are, first of all, the geos version of the geometry (so that it doesn't 
have to be converted from the postgis one each time), and then, any 
indexes or graph structures that are created for performing operations. 
If this was implemented, then when point_in_ring() is called, the ring 
would either have attached to it, or have a way to look up in a cache 
held somewhere in the geos session, the y-indexed sides of the ring. The 
very first call would create the index and store it there, and then 
future calls could reuse it.

I'm not entirely certain whether these objects would get cached on the 
postgis/postgresql side, or just within the GEOS "session", and I don't 
know exactly how it would work, but I really feel that something like 
this needs to be implemented in order to provide reasonable performance 
for queries involving a large number of spatial operations or 
comparisons on the same inputs. And, we are getting more and more 
complaints about this specific performance problem.

I'd welcome further discussion on this possible solution... and I still 
agree that you should definitely profile your query to see where it is 
spending it's time. But I suspect that the #1 spot will be in the loop 
over the sides in the point_in_ring function, and the only way to reduce 
the looping is to look in a (cached) index instead of the entire list.

Chris

Robert W. Burgholzer wrote:
> I am looking at the geos code for the contains calls for a point in a
> multipolygon, which appears to boil down to the function:
> 
> point_in_ring(POINTARRAY *pts, POINT2D *point)
> 
> I am a real green C-coder, so if anyone could provide some basic feedback to my
> questions, that would be great. Sorry if they are stupid or ridiculously basic.
>  If anyone could simply tell me whether my assumptions below are true or false,
> I would really appreciate it:
> 
> 1. If I am looking at the correct function,
> 2. The time-consuming aspect is related to the sheer number of rings that a
> given complicated polygon may contain.  This little function, point_in_ring,
> could literally be called thousands of times for each shape - multiplied by the
> number of possible points. (can be a really HUGE number)
> 3. The code uses a fairly compact algorithm that first determine the number of
> sides of the ring that cross the y-axis defined by the y-coord of the point,
> then adding up the number to the right and left of the point, an odd number
> means the point is inside, an even number means the point is outside. (I got
> this explanation from http://alienryderflex.com/polygon/ )
> 4. Given that point_in_ring may be called millions of times for a complex shape
> with numerous points to evaluate, any small increase in efficiency in memory
> addressing or other could cause a significant performance improvement.
> 
> 
> This is where my knowledge of C comes to a grinding halt. I have seen some
> information related to pointers, structures and such, so I was wondering what
> degree of confidence the function writers have regarding the use of such things
> in this particular function.
> 
> Thanks,
> r.b.
> 
> 
> --
> Robert W. Burgholzer
> --
> Finding the occasional straw of truth awash in a great ocean of confusion and
> bamboozle requires intelligence, vigilance, dedication and courage.  But if we
> don't practice these tough habits of thought, we cannot hope to solve the truly
> serious problems that face us -- and we risk becoming a nation of suckers, up
> for grabs by the next charlatan who comes along.
> -- Carl Sagan, "The Fine Art of Baloney Detection," Parade, February 1, 1987
> 
> Online Workout Editor:
> http://www.swat-swimming.org/workoutlog/
> http://soulswimmer.dynalias.net/
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-devel




More information about the postgis-devel mailing list