[postgis-devel] Optimizing contains/within
Chris Hodgson
chodgson at refractions.net
Mon Jul 23 14:43:37 PDT 2007
I don't have a good understanding of the level of optimization done to
the geos code, although I feel there is probably room for improvement
(likely at the cost of simplicity/understandability/synchronization with
JTS). What I believe to be a bigger problem is the levels of abstraction
between the SQL language, the postGIS internal structures, and the GEOS
internal structures, and the fact that we aren't doing anything to
"bridge the gaps" in usage.
If one was going to code an optimal solution to the problem "for each of
these very many points, see if they are within this one polygon", one
would build an indexed representation of the polygon that is designed
for quickly finding the sides that intersect any given y-axis. However,
the "point_in_ring" function is designed to find if just one point is
within in this one ring, and so it doesn't build this indexed polygon
structure, or if it does, it throws it away afterward.
Even if geos had a "batch optimized" function like "points_in_ring()"
that was designed to use the more optimized strategy, it would be
impossible to use it because your SQL query would likely look something
like:
SELECT point_in_poly(a.geom_column, b.geom_column) from a join b;
and so the SQL plan would be to call point_in_poly() many times, not
points_in_poly().
I think there are two solutions to this problem. The possibly simpler
but less generic solution would be to create some new functions in geos
and postgis that are designed to work on arrays/lists/sets of objects,
so that you could use SQL like this:
SELECT points_in_polys( (select geom_col from a), (select geom_col from b) )
And be returned an array of results. This would provide a more optimized
way to do the more commonly used "batch" functions. Eventually someone
would want to be able to do every sort of function in the same sort of
batch method, and we'd end up with a lot of functions with plurals in
the names, and people would likely complain about how difficult it is to
work with the results of these functions (big ugly arrays) and how it
isn't very flexible.
I think the more generic solution that would enable you to do queries
like the first one, but still taking advantage of "batch optimizations",
is to somehow cache any data structures that geos creates to do any
operations on geometries. The kind of structures that need to be cached
are, first of all, the geos version of the geometry (so that it doesn't
have to be converted from the postgis one each time), and then, any
indexes or graph structures that are created for performing operations.
If this was implemented, then when point_in_ring() is called, the ring
would either have attached to it, or have a way to look up in a cache
held somewhere in the geos session, the y-indexed sides of the ring. The
very first call would create the index and store it there, and then
future calls could reuse it.
I'm not entirely certain whether these objects would get cached on the
postgis/postgresql side, or just within the GEOS "session", and I don't
know exactly how it would work, but I really feel that something like
this needs to be implemented in order to provide reasonable performance
for queries involving a large number of spatial operations or
comparisons on the same inputs. And, we are getting more and more
complaints about this specific performance problem.
I'd welcome further discussion on this possible solution... and I still
agree that you should definitely profile your query to see where it is
spending it's time. But I suspect that the #1 spot will be in the loop
over the sides in the point_in_ring function, and the only way to reduce
the looping is to look in a (cached) index instead of the entire list.
Chris
Robert W. Burgholzer wrote:
> I am looking at the geos code for the contains calls for a point in a
> multipolygon, which appears to boil down to the function:
>
> point_in_ring(POINTARRAY *pts, POINT2D *point)
>
> I am a real green C-coder, so if anyone could provide some basic feedback to my
> questions, that would be great. Sorry if they are stupid or ridiculously basic.
> If anyone could simply tell me whether my assumptions below are true or false,
> I would really appreciate it:
>
> 1. If I am looking at the correct function,
> 2. The time-consuming aspect is related to the sheer number of rings that a
> given complicated polygon may contain. This little function, point_in_ring,
> could literally be called thousands of times for each shape - multiplied by the
> number of possible points. (can be a really HUGE number)
> 3. The code uses a fairly compact algorithm that first determine the number of
> sides of the ring that cross the y-axis defined by the y-coord of the point,
> then adding up the number to the right and left of the point, an odd number
> means the point is inside, an even number means the point is outside. (I got
> this explanation from http://alienryderflex.com/polygon/ )
> 4. Given that point_in_ring may be called millions of times for a complex shape
> with numerous points to evaluate, any small increase in efficiency in memory
> addressing or other could cause a significant performance improvement.
>
>
> This is where my knowledge of C comes to a grinding halt. I have seen some
> information related to pointers, structures and such, so I was wondering what
> degree of confidence the function writers have regarding the use of such things
> in this particular function.
>
> Thanks,
> r.b.
>
>
> --
> Robert W. Burgholzer
> --
> Finding the occasional straw of truth awash in a great ocean of confusion and
> bamboozle requires intelligence, vigilance, dedication and courage. But if we
> don't practice these tough habits of thought, we cannot hope to solve the truly
> serious problems that face us -- and we risk becoming a nation of suckers, up
> for grabs by the next charlatan who comes along.
> -- Carl Sagan, "The Fine Art of Baloney Detection," Parade, February 1, 1987
>
> Online Workout Editor:
> http://www.swat-swimming.org/workoutlog/
> http://soulswimmer.dynalias.net/
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-devel
More information about the postgis-devel
mailing list