[postgis-devel] An Experiment

Mon Sep 21 16:42:29 PDT 2009

This book

http://www.amazon.com/Foundations-Multidimensional-Structures-Kaufmann-Computer/dp/0123694469

has a great little section on r-trees which gives capsule summaries of
various papers, and even better summaries of some of the empirical
testing that was done on different r-tree variants. Among other
things, they found that the Ang/Tan linear algorithm actually produced
a better index than the Guttman quadratic time algorithm, which is
sort of counter-intuitive, but c'est la vie.

And it also shows some visualizations which sure make one want to have
an R*Tree. However, the GiST infrastructure just doesn't provide that
option. So we're left with really very few knobs to twist (penalty and
picksplit).

However, perhaps we are wasting our time :) there was a very very
interesting talk at GeoWeb this year by some guys selling a RDF
triple-store, about how they did their spatial index. And they
basically did a one-level grid system, but even simpler, just encoding
things in strips, and it performed, because the strips could be read
in the fastest possible way off the disk: linearly in one direction.

In a similar vein, the QIX structure used by Mapserver has some really
really dumb behavior for points, creating an absurdly deep tree by
default, and yet, doesn't really suffer from it. Our trouble is, we
don't have a solid enough handle on where our performance bottlenecks
are, so we could very well be focussing on the wrong (but very
interesting) thing by looking at, for example, tree structure instead
of disk access.

Another interesting performance test would be r-tree versus
b-tree-over-geohash. The latter is more naive and un-balanced, but has
the advantage (when clustered) of being nicely spatially
autocorrelated, thereby providing potentially better disk access
profiles for our classic bounding box query.

P.

On Mon, Sep 21, 2009 at 4:19 PM, Chris Hodgson <chodgson at refractions.net> wrote:
> Hey Paul, it was actually almost a year ago that I looked into this stuff. I
> started a blog to publish my findings but I realized that a solution was
> probably outside the scope of postgresql/postgis and so I didn't end up
> going any further with it. Also (as I'm sure many other wannabe bloggers
> have discovered) I haven't done anything interesting enough to blog about
> since :)
>
> http://cognitivelychris.blogspot.com/
>
> (I originally had it on chris.ca but I ended up selling that domain.)
>
> Anyways, I explain in there why the real problem is progressive insertion of
> *variable density data*. It doesn't matter whether you order it randomly or
> semi-geo-correllated - unless it is just perfect (ie. you pre-process it
> with a magic process that knows about how the index will be built) you end
> up with a relatively poor index because the dense areas effectively take
> over the index and mess it up.
>
> I came up with some interesting algorithms to make what I envisioned to be
> nearly perfectly balanced r-tree indexes but they require sorting the
> bounding boxes for all of the data in the table at once, which isn't
> necessarily feasible with large data sets and is definitely outside of the
> postgresql workflow of inserting one record at a time. I'll try to make
> another post with some of my prettier pictures.
>
> Chris
>
> Paul Ramsey wrote:
>>
>> Kevin,
>>
>> Many many moons ago, you and Chris showed me the visual representation
>> of an index, and lo, it had a damn funny layout compared to what we
>> expected.
>>
>> Having now gone through the picksplit code with a fine comb, I'm
>> pretty sure I know why. I don't have any general means of preventing
>> that (an R*Tree would be awesome, but GiST lacks the infrastructure to
>> build one, at this time) I do have an observation it would be fun to
>> see tested.
>>
>> First, note that spatial data tends to be delivered to databases in
>> shape files, and that the data in those files has a tendency to be
>> spatially autocorrelated. (Watch the draw in JUMP or ArcView.)
>> Spatially autocorrelated data is pretty much guaranteed to generate
>> screwed up R-Trees, since the top nodes are going to be built around
>> one small area, then expanded in odd ways to fit new data as is fills
>> in areas outside the original node boundary.
>>
>> A better R-Tree is going to be build against randomized inputs. So,
>> try two tables, one loaded and built the old fashioned way, another
>> built by creating a second table with "create table random_mytable as
>> select * from mytable order by random()". Obviously you'll need enough
>> records in the tables to make it interesting (bearing in mind there
>> are hundreds of entries in each index node).
>>
>> If you get a chance to run those, I'd love to hear how it goes :)
>>
>> Paul
>>
>> PS - Of course, once the random-based index is built, clustering the
>> table against the index to bring the tuples back into some kind of
>> spatial correlation will be necessary for performance purposes.
>> Probably all these minor fine-tuning things make ****-all difference
>> in the big picture. But it's fun to fiddle.
>> _______________________________________________
>> postgis-devel mailing list
>> postgis-devel at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-devel
>>
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-devel
>