[GRASS-dev] vector libs: file based spatial index

Tue Jun 23 12:50:03 EDT 2009

Hi,

I have now a completely file based spatial index for vector libs that
could reduce memory consumption considerably. The spatial index file is
usually 2 - 3 times larger than the topo file, for point datasets it is
about 5 times larger than the topo file. In GRASS6.x all that is always
kept in memory and built from scratch.
My implementation is completely file based, also when creating or
updating a vector. This comes obviously with a speed penalty because
reading in memory is faster than reading from file. With all sorts of
tricks and relying on the system to cache files, the file based spatial
index needs 2 times as much time to get built than the memory based
spatial index, e.g. v.build takes twice as long (on my test system).
That's less than I was afraid it would take but still two times as long.
Other new features are dynamic 2D or 3D spatial index (was fixed to 3D
also for 2D vectors), higher portability across different platforms,
speed increase and more control over memory consumption by translating
all recursive functions to non-recursive functions, and it recycles dead
space in files (principle of Radim's suggestion in vector TODO). A
complete rewrite of the rtree library I'm afraid.

Considering that a file based spatial index is only useful for massive
vectors where memory can become a limiting factor, I hesitate to commit
to trunk. A sophisticated compromise would be to build the spatial index
in memory and then write it out to disk. Opening an old vector for
reading only (Vect_open_old()) would be much faster, spatial queries not
much slower than before because the spatial index head is loaded only,
searches in file are fairly fast. When opening a new vector, the spatial
index could be built in memory and then written out. When opening an old
vector for update the spatial index could be read to memory completely
and written out when closing.

What to do now? Leave it all in memory as in grass6, build in memory
then write out (risk of running out of memory on massive datasets), or
keep it always in file? I'll not commit any time soon (also waiting for
the lib/raster commotion to settle down), I need feedback on how to
proceed or if I should forget about it.

Markus M