[postgis-devel] An Experiment

Tue Sep 22 09:58:45 PDT 2009

Yeah, it was a while ago we last looked at this (great write-up Chris).  I do recall looking at many different ways to 
order the input of geometries before creating a gist index (order by random() being one of them).

For those that don't know what we are talking about, the creating of a gist index involves making a tree of bounding 
boxes.  At the lowest level, the individual bounding boxes correspond to the individual geometries.  At higher levels, 
these bboxes are grouped together into some sort of groupings.  At the highest level, there is a handful of bboxes that 
cover your entire dataset.  When you invoke the index in a spatial query, it compares your supplied bbox with the tree 
nodes to [quickly] walk down the tree to identify the target collection of geometries.  (Others can correct me if my 
overly simplistic definition is off base)

The main function of interest for us is the picksplit method defined by the GiST api.  When the index is create and 
geometries are added to the tree, eventually some subtree needs to get split to create two smaller trees.  Now, 
logically, it makes sense that the best performance might be obtained when the bounding boxes at the subtree nodes are 
as square as possible and non-overlapping.

It would seem that the ordering of the geometries before creating a gist index plays a huge role in what the index looks 
like.  As Chris posted in his blog, a typical gist index in PostGIS "looks" horrible "looks" like it should perform 
badly.  He posted an image that shows one level of an index where the tree nodes are very long and skinny and very much 
overlapping.  This means that at any given point, you are likely to select many subtrees when trying to get to your 
target geometries at the lowest levels.

I found that the best "looking" gist trees were created by ordering the geometries spatially, like using a grid specific 
to the dataset (I would think Paul's geohash would make nice looking trees).  The subtrees created tended to be more 
spatially disjoint.  This is good since this means that close geometries are grouped together and are also "close" on 
disk (which means potentially fewer disk accesses).

Another approach I found that worked well was repeatedly creating a spatial index, cluster on the index (which orders 
the table based on the spatial index), then drop and recreate the index.  After a few iterations, the tree looks like it 
should perform well.

Having said all that, I'm kinda with Paul on this ... this may just be a waste of time.  We don't have a solid enough 
handle on where the bottleneck is.  With a gist implementation, I'm not sure we can do much better than simply 
clustering a table based on a spatial index.

Cheers,
Kevin

Paul Ramsey wrote:
> This book
> 
> http://www.amazon.com/Foundations-Multidimensional-Structures-Kaufmann-Computer/dp/0123694469
> 
> has a great little section on r-trees which gives capsule summaries of
> various papers, and even better summaries of some of the empirical
> testing that was done on different r-tree variants. Among other
> things, they found that the Ang/Tan linear algorithm actually produced
> a better index than the Guttman quadratic time algorithm, which is
> sort of counter-intuitive, but c'est la vie.
> 
> And it also shows some visualizations which sure make one want to have
> an R*Tree. However, the GiST infrastructure just doesn't provide that
> option. So we're left with really very few knobs to twist (penalty and
> picksplit).
> 
> However, perhaps we are wasting our time :) there was a very very
> interesting talk at GeoWeb this year by some guys selling a RDF
> triple-store, about how they did their spatial index. And they
> basically did a one-level grid system, but even simpler, just encoding
> things in strips, and it performed, because the strips could be read
> in the fastest possible way off the disk: linearly in one direction.
> 
> In a similar vein, the QIX structure used by Mapserver has some really
> really dumb behavior for points, creating an absurdly deep tree by
> default, and yet, doesn't really suffer from it. Our trouble is, we
> don't have a solid enough handle on where our performance bottlenecks
> are, so we could very well be focussing on the wrong (but very
> interesting) thing by looking at, for example, tree structure instead
> of disk access.
> 
> Another interesting performance test would be r-tree versus
> b-tree-over-geohash. The latter is more naive and un-balanced, but has
> the advantage (when clustered) of being nicely spatially
> autocorrelated, thereby providing potentially better disk access
> profiles for our classic bounding box query.
> 
> P.
> 
> On Mon, Sep 21, 2009 at 4:19 PM, Chris Hodgson <chodgson at refractions.net> wrote:
>> Hey Paul, it was actually almost a year ago that I looked into this stuff. I
>> started a blog to publish my findings but I realized that a solution was
>> probably outside the scope of postgresql/postgis and so I didn't end up
>> going any further with it. Also (as I'm sure many other wannabe bloggers
>> have discovered) I haven't done anything interesting enough to blog about
>> since :)
>>
>> http://cognitivelychris.blogspot.com/
>>
>> (I originally had it on chris.ca but I ended up selling that domain.)
>>
>> Anyways, I explain in there why the real problem is progressive insertion of
>> *variable density data*. It doesn't matter whether you order it randomly or
>> semi-geo-correllated - unless it is just perfect (ie. you pre-process it
>> with a magic process that knows about how the index will be built) you end
>> up with a relatively poor index because the dense areas effectively take
>> over the index and mess it up.
>>
>> I came up with some interesting algorithms to make what I envisioned to be
>> nearly perfectly balanced r-tree indexes but they require sorting the
>> bounding boxes for all of the data in the table at once, which isn't
>> necessarily feasible with large data sets and is definitely outside of the
>> postgresql workflow of inserting one record at a time. I'll try to make
>> another post with some of my prettier pictures.
>>
>> Chris
>>
>> Paul Ramsey wrote:
>>> Kevin,
>>>
>>> Many many moons ago, you and Chris showed me the visual representation
>>> of an index, and lo, it had a damn funny layout compared to what we
>>> expected.
>>>
>>> Having now gone through the picksplit code with a fine comb, I'm
>>> pretty sure I know why. I don't have any general means of preventing
>>> that (an R*Tree would be awesome, but GiST lacks the infrastructure to
>>> build one, at this time) I do have an observation it would be fun to
>>> see tested.
>>>
>>> First, note that spatial data tends to be delivered to databases in
>>> shape files, and that the data in those files has a tendency to be
>>> spatially autocorrelated. (Watch the draw in JUMP or ArcView.)
>>> Spatially autocorrelated data is pretty much guaranteed to generate
>>> screwed up R-Trees, since the top nodes are going to be built around
>>> one small area, then expanded in odd ways to fit new data as is fills
>>> in areas outside the original node boundary.
>>>
>>> A better R-Tree is going to be build against randomized inputs. So,
>>> try two tables, one loaded and built the old fashioned way, another
>>> built by creating a second table with "create table random_mytable as
>>> select * from mytable order by random()". Obviously you'll need enough
>>> records in the tables to make it interesting (bearing in mind there
>>> are hundreds of entries in each index node).
>>>
>>> If you get a chance to run those, I'd love to hear how it goes :)
>>>
>>> Paul
>>>
>>> PS - Of course, once the random-based index is built, clustering the
>>> table against the index to bring the tuples back into some kind of
>>> spatial correlation will be necessary for performance purposes.
>>> Probably all these minor fine-tuning things make ****-all difference
>>> in the big picture. But it's fun to fiddle.
>>> _______________________________________________
>>> postgis-devel mailing list
>>> postgis-devel at postgis.refractions.net
>>> http://postgis.refractions.net/mailman/listinfo/postgis-devel
>>>
>> _______________________________________________
>> postgis-devel mailing list
>> postgis-devel at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-devel
>>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-devel