API for optimized predicates (was Re: [postgis-devel] 1.3.3)

Tue Apr 1 17:20:22 PDT 2008

How large were your polys? If they're not big enough the work of caching 
and looking in the cache can easily overtake the savings of caching the 
prepared versions. ie. caching can take as much or more time than 
preparing for smallish geometries. I think the cut-off there, for the 
memcmp version, was quite high like in the 100's of vertices. The larger 
the poly's are the more useful the caching becomes. The customer that 
wants to make use of this feature has poly's in the 1000's of vertices 
(state/county boundaries in 50-20k resolution or something like that).

I think the identity value in the geometry might be the best solution. 
It might need to be a UUID (or at least a 64-bit int) in order avoid 
running out of Ids. Consider that we may have to put these Ids into 
literal geometries that are used, which means we'd cycle through one or 
more Ids for every query run with a literal geometry value in it. I 
suppose it would only be necessary for variable length geometry types, 
ie. points and bboxes wouldn't need it (would save id cycling but 
complicate code logic with special cases). I've heard that generating 
"real" UUIDs from the system can be slow but we only need them to be 
unique within our domain so perhaps there is a shortcut there (maybe 
even as simple as a sequence?)

Rambling thoughts at the end of the day...

Chris

Paul Ramsey wrote:
> I gave this a try, but in the three-parameter case it caused the
> backend to crash and in the two-parameter case provided the same speed
> as the usual un-prepared approach...
> 
> I was testing with st_contains(polycolumn, pointcolumn), with 80 polys
> and 7000 points.
> 
> P
> 
> On Mon, Mar 31, 2008 at 3:50 PM, Ben Jubb <benjubb at refractions.net> wrote:
>>  Hiya,
>>  I've attached a patch to lwgeom_geos_c.c, modifying its 1st arg caching
>> behaviour.
>>  The third argument is used as before, as a surrogate key, and the caching
>> will use that as its key;
>>  UNLESS the key is NULL.
>>  If the key is NULL, the predicates use the memcmp technique to determine if
>> the cached prepared geometry is in sync with the first arg.
>>  Note that the two caching approaches have essentially independent caches.
>>  This patch is intended for testing purposes only.
>>  enjoy
>>  b
>>
>>
>>
>>  Paul Ramsey wrote:
>>  A unique-on-insert ID would be another approach. It would, however,
>> involve a disk-format change, so we're talking about pretty big
>> hammers here, regardless of whether we did a hash or a uuid.
>>
>> Ben, maybe just stick some small timing statements into your current
>> code... start time, end time, and then do a noop memcmp with start/end
>> times as well. That way we can compare the memcmp times to the total
>> times.
>>
>> P.
>>
>> On Mon, Mar 31, 2008 at 10:17 AM, Martin Davis <mbdavis at refractions.net>
>> wrote:
>>
>>
>>  (renaming this thread, since the current one is way overloaded)
>>
>>  I agree with Paul and Mark - there should be a simple function signature
>>  for the fast preds. The more complex one can be provided as well, but
>>  it will need to be VERY well documented, since it's a tricky thing to grok.
>>
>>  re spatial hash - would you really trust a hash to confirm identity? I
>>  don't think I would...
>>
>>  Would another alternative would be to assign a hidden unique ID to each
>>  geom entered into the DB. This could be a honking big integer, or maybe
>>  a UUID.
>>
>>  Paul Ramsey wrote:
>>  > The problem is that the memcmp hit gets worse in exactly the cases
>>  > were we expect better and better performance from the prepared
>>  > algorithm... still, it would be nice to know what that hit is...
>>  > compared to the original, unprepared time, it will be small, but
>>  > compared to the prepared-with-id-API implementation... hard to say.
>>  >
>>  > Something to resolve before 1.4... It's unfortunate that all the
>>  > *fast* tests can only falsify identity, not confirm it. I was talking
>>  > to a fellow who has done a spatial db implementation on a proprietary
>>  > system, and he was pleased with the idea of a "geographic hash" that
>>  > he can calculate for each shape and use to test identity. If we were
>>  > to do something like that, it would have to be optional, like the bbox
>>  > calculation is currently.
>>  >
>>  > P.
>>  >
>>  > On Mon, Mar 31, 2008 at 2:51 AM, Mark Cave-Ayland
>>  > <mark.cave-ayland at siriusit.co.uk> wrote:
>>  >
>>  >> On Friday 28 March 2008 23:53:53 Ben Jubb wrote:
>>  >> > Howdy,
>>  >> > In my testing, I did see a performance hit when using the memcmp test,
>>  >> > although it was noticable only in the largest of my test geometries
>>  >> > (5000 vertices or so).
>>  >> > The three parameter form seemed like the best way to go because the
>>  >> > whole point of the prepared version of the functions was to get the
>> best
>>  >> > possible performance. The cases when the performance matters most is
>>  >> > with large geoms, and then the cost of doing the memcmp is the
>> highest.
>>  >> > Using a third argument seemed the simplest way to get the best
>> possible
>>  >> > performance from the predicates, with a minimal increase in the
>>  >> > complexity of the interface.
>>  >> > I agree it would be nice to have a single form for those predicates
>> that
>>  >> > automatically determines the most efficient manner to do the tests,
>> but
>>  >> > there didn't seem to be any efficient way to accomplish that.
>>  >> >
>>  >> > b
>>  >>
>>  >>
>>  >> Hi Ben,
>>  >>
>>  >> Well I think it really comes down to what exactly is the performance hit
>> and
>>  >> how did you measure it? Which platform/OS/C library did you use?
>> Obviously
>>  >> there will be *some* overhead having the extra memcmp() in there but
>> does it
>>  >> matter? For example, if the overhead is just 1-2s on a 30s query then
>> that
>>  >> doesn't really matter. Then again, if the overhead is 1s on a 3s query
>> then
>>  >> that is significant.
>>  >>
>>  >> Since this is a new feature then I'd be inclined to say that for a first
>> cut
>>  >> we should keep the standard API, and depending on the reports we get
>> back,
>>  >> look at improving it later. That seems a lot more preferable to having a
>>  >> fairly nasty API hack that will catch a lot of people out :(
>>  >>
>>  >>
>>  >>
>>  >> ATB,
>>  >>
>>  >> Mark.
>>  >>
>>  >> --
>>  >> Mark Cave-Ayland
>>  >> Sirius Corporation - The Open Source Experts
>>  >> http://www.siriusit.co.uk
>>  >> T: +44 870 608 0063
>>  >> _______________________________________________
>>  >> postgis-devel mailing list
>>  >> postgis-devel at postgis.refractions.net
>>  >> http://postgis.refractions.net/mailman/listinfo/postgis-devel
>>  >>
>>  >>
>>  > _______________________________________________
>>  > postgis-devel mailing list
>>  > postgis-devel at postgis.refractions.net
>>  > http://postgis.refractions.net/mailman/listinfo/postgis-devel
>>  >
>>  >
>>
>>  --
>>  Martin Davis
>>  Senior Technical Architect
>>  Refractions Research, Inc.
>>  (250) 383-3022
>>
>>  _______________________________________________
>>  postgis-devel mailing list
>>  postgis-devel at postgis.refractions.net
>>  http://postgis.refractions.net/mailman/listinfo/postgis-devel
>>
>>
>>  _______________________________________________
>> postgis-devel mailing list
>> postgis-devel at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-devel
>>
>>
>> _______________________________________________
>>  postgis-devel mailing list
>>  postgis-devel at postgis.refractions.net
>>  http://postgis.refractions.net/mailman/listinfo/postgis-devel
>>
>>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-devel