[postgis-devel] Optimizing contains/within
Chris Hodgson
chodgson at refractions.net
Mon Jul 23 15:39:23 PDT 2007
Oh, my, I must be more out of touch with the development than I
thought... I didn't realize we already had this implemented. Is this
only for the point_in_poly test? The other one that comes to mind as
benefiting from a cached structure is the graph created for any of the
intersects/contains/etc comparisons or intersection/union/etc operations.
Chris
Mark Leslie wrote:
> The current implementation is of the second form. An RTREE index
> (http://lin-ear-th-inking.blogspot.com/2007/06/packed-1-dimensional-r-tree.html)
> is used to quickly determine candidate ring segments and is cached in
> the PostgreSQL function call context. Repeated requests against the
> same polygon will reuse the cached index. This is not yet implemented
> for MultiPolygons or MultiPoints.
> Mark
>
> Chris Hodgson wrote:
>> I don't have a good understanding of the level of optimization done to
>> the geos code, although I feel there is probably room for improvement
>> (likely at the cost of simplicity/understandability/synchronization with
>> JTS). What I believe to be a bigger problem is the levels of abstraction
>> between the SQL language, the postGIS internal structures, and the GEOS
>> internal structures, and the fact that we aren't doing anything to
>> "bridge the gaps" in usage.
>>
>> If one was going to code an optimal solution to the problem "for each of
>> these very many points, see if they are within this one polygon", one
>> would build an indexed representation of the polygon that is designed
>> for quickly finding the sides that intersect any given y-axis. However,
>> the "point_in_ring" function is designed to find if just one point is
>> within in this one ring, and so it doesn't build this indexed polygon
>> structure, or if it does, it throws it away afterward.
>>
>> Even if geos had a "batch optimized" function like "points_in_ring()"
>> that was designed to use the more optimized strategy, it would be
>> impossible to use it because your SQL query would likely look something
>> like:
>>
>> SELECT point_in_poly(a.geom_column, b.geom_column) from a join b;
>>
>> and so the SQL plan would be to call point_in_poly() many times, not
>> points_in_poly().
>>
>> I think there are two solutions to this problem. The possibly simpler
>> but less generic solution would be to create some new functions in geos
>> and postgis that are designed to work on arrays/lists/sets of objects,
>> so that you could use SQL like this:
>>
>> SELECT points_in_polys( (select geom_col from a), (select geom_col from
>> b) )
>>
>> And be returned an array of results. This would provide a more optimized
>> way to do the more commonly used "batch" functions. Eventually someone
>> would want to be able to do every sort of function in the same sort of
>> batch method, and we'd end up with a lot of functions with plurals in
>> the names, and people would likely complain about how difficult it is to
>> work with the results of these functions (big ugly arrays) and how it
>> isn't very flexible.
>>
>> I think the more generic solution that would enable you to do queries
>> like the first one, but still taking advantage of "batch optimizations",
>> is to somehow cache any data structures that geos creates to do any
>> operations on geometries. The kind of structures that need to be cached
>> are, first of all, the geos version of the geometry (so that it doesn't
>> have to be converted from the postgis one each time), and then, any
>> indexes or graph structures that are created for performing operations.
>> If this was implemented, then when point_in_ring() is called, the ring
>> would either have attached to it, or have a way to look up in a cache
>> held somewhere in the geos session, the y-indexed sides of the ring. The
>> very first call would create the index and store it there, and then
>> future calls could reuse it.
>>
>> I'm not entirely certain whether these objects would get cached on the
>> postgis/postgresql side, or just within the GEOS "session", and I don't
>> know exactly how it would work, but I really feel that something like
>> this needs to be implemented in order to provide reasonable performance
>> for queries involving a large number of spatial operations or
>> comparisons on the same inputs. And, we are getting more and more
>> complaints about this specific performance problem.
>>
>> I'd welcome further discussion on this possible solution... and I still
>> agree that you should definitely profile your query to see where it is
>> spending it's time. But I suspect that the #1 spot will be in the loop
>> over the sides in the point_in_ring function, and the only way to reduce
>> the looping is to look in a (cached) index instead of the entire list.
>>
>> Chris
>>
>> Robert W. Burgholzer wrote:
>>> I am looking at the geos code for the contains calls for a point in a
>>> multipolygon, which appears to boil down to the function:
>>>
>>> point_in_ring(POINTARRAY *pts, POINT2D *point)
>>>
>>> I am a real green C-coder, so if anyone could provide some basic
>>> feedback to my
>>> questions, that would be great. Sorry if they are stupid or
>>> ridiculously basic.
>>> If anyone could simply tell me whether my assumptions below are true
>>> or false,
>>> I would really appreciate it:
>>>
>>> 1. If I am looking at the correct function,
>>> 2. The time-consuming aspect is related to the sheer number of rings
>>> that a
>>> given complicated polygon may contain. This little function,
>>> point_in_ring,
>>> could literally be called thousands of times for each shape -
>>> multiplied by the
>>> number of possible points. (can be a really HUGE number)
>>> 3. The code uses a fairly compact algorithm that first determine the
>>> number of
>>> sides of the ring that cross the y-axis defined by the y-coord of the
>>> point,
>>> then adding up the number to the right and left of the point, an odd
>>> number
>>> means the point is inside, an even number means the point is outside.
>>> (I got
>>> this explanation from http://alienryderflex.com/polygon/ )
>>> 4. Given that point_in_ring may be called millions of times for a
>>> complex shape
>>> with numerous points to evaluate, any small increase in efficiency in
>>> memory
>>> addressing or other could cause a significant performance improvement.
>>>
>>>
>>> This is where my knowledge of C comes to a grinding halt. I have seen
>>> some
>>> information related to pointers, structures and such, so I was
>>> wondering what
>>> degree of confidence the function writers have regarding the use of
>>> such things
>>> in this particular function.
>>>
>>> Thanks,
>>> r.b.
>>>
>>>
>>> --
>>> Robert W. Burgholzer
>>> --
>>> Finding the occasional straw of truth awash in a great ocean of
>>> confusion and
>>> bamboozle requires intelligence, vigilance, dedication and courage.
>>> But if we
>>> don't practice these tough habits of thought, we cannot hope to solve
>>> the truly
>>> serious problems that face us -- and we risk becoming a nation of
>>> suckers, up
>>> for grabs by the next charlatan who comes along.
>>> -- Carl Sagan, "The Fine Art of Baloney Detection," Parade, February
>>> 1, 1987
>>>
>>> Online Workout Editor:
>>> http://www.swat-swimming.org/workoutlog/
>>> http://soulswimmer.dynalias.net/
>>> _______________________________________________
>>> postgis-devel mailing list
>>> postgis-devel at postgis.refractions.net
>>> http://postgis.refractions.net/mailman/listinfo/postgis-devel
>> _______________________________________________
>> postgis-devel mailing list
>> postgis-devel at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-devel
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-devel
More information about the postgis-devel
mailing list