[postgis-devel] An Experiment

Mon Sep 21 16:19:55 PDT 2009

Hey Paul, it was actually almost a year ago that I looked into this 
stuff. I started a blog to publish my findings but I realized that a 
solution was probably outside the scope of postgresql/postgis and so I 
didn't end up going any further with it. Also (as I'm sure many other 
wannabe bloggers have discovered) I haven't done anything interesting 
enough to blog about since :)

http://cognitivelychris.blogspot.com/

(I originally had it on chris.ca but I ended up selling that domain.)

Anyways, I explain in there why the real problem is progressive 
insertion of *variable density data*. It doesn't matter whether you 
order it randomly or semi-geo-correllated - unless it is just perfect 
(ie. you pre-process it with a magic process that knows about how the 
index will be built) you end up with a relatively poor index because the 
dense areas effectively take over the index and mess it up.

I came up with some interesting algorithms to make what I envisioned to 
be nearly perfectly balanced r-tree indexes but they require sorting the 
bounding boxes for all of the data in the table at once, which isn't 
necessarily feasible with large data sets and is definitely outside of 
the postgresql workflow of inserting one record at a time. I'll try to 
make another post with some of my prettier pictures.

Chris

Paul Ramsey wrote:
> Kevin,
>
> Many many moons ago, you and Chris showed me the visual representation
> of an index, and lo, it had a damn funny layout compared to what we
> expected.
>
> Having now gone through the picksplit code with a fine comb, I'm
> pretty sure I know why. I don't have any general means of preventing
> that (an R*Tree would be awesome, but GiST lacks the infrastructure to
> build one, at this time) I do have an observation it would be fun to
> see tested.
>
> First, note that spatial data tends to be delivered to databases in
> shape files, and that the data in those files has a tendency to be
> spatially autocorrelated. (Watch the draw in JUMP or ArcView.)
> Spatially autocorrelated data is pretty much guaranteed to generate
> screwed up R-Trees, since the top nodes are going to be built around
> one small area, then expanded in odd ways to fit new data as is fills
> in areas outside the original node boundary.
>
> A better R-Tree is going to be build against randomized inputs. So,
> try two tables, one loaded and built the old fashioned way, another
> built by creating a second table with "create table random_mytable as
> select * from mytable order by random()". Obviously you'll need enough
> records in the tables to make it interesting (bearing in mind there
> are hundreds of entries in each index node).
>
> If you get a chance to run those, I'd love to hear how it goes :)
>
> Paul
>
> PS - Of course, once the random-based index is built, clustering the
> table against the index to bring the tuples back into some kind of
> spatial correlation will be necessary for performance purposes.
> Probably all these minor fine-tuning things make ****-all difference
> in the big picture. But it's fun to fiddle.
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-devel
>