[GRASS-dev] vector libs: file based spatial index

Thu Jun 25 08:43:39 EDT 2009

Paul Kelly wrote:
> On Thu, 25 Jun 2009, Markus GRASS wrote:
>
> [...]
>>> very little about in this case. E.g. for completely random access
>>> there might not be a lot of gain.
>>
>> It is completely random, the next chunk to be read/written can be
>> anywhere in the file.
>>
> [...]
>>> But if there was random access only within a certain section of a
>>> file, that section could be mapped into memory and access would then
>>> be quite fast.
>>
>> Does it make sense to always map a different section of the file? This
>> mapping would then need to replace each call to fread/fwrite because of
>> completely random access. I'm currently using the same method like for
>> the coor file which seems to work fine so far.
>
> No, I can't imagine that would give much improvement. The scenario I
> had hoped for was that when processing a particular sub-region of a
> vector map, all the needed data from the index would be near each
> other in memory 
Unfortunately not, you need the whole search tree also for processing a
subregion. The index is not a liner array but a search tree, and for a
search tree I don't know how to keep spatially close items close to each
other in file, particularly when modifying the tree.
> (and thus in the file), thus that area of the file (I was thinking up
> to several hundred megabytes max. for performance reasons, but this
> could obviously scale with increased datasizes) could be mapped into
> memory and intensive I/O (or just reading) done there, with the
> memory-mapping mechanism automatically taking care of the reads and
> writes from/to the underlying file.
>
> From what you have said it sounds like that would need a complete
> redesign of the algorithm used for indexing, or might not be possible.
> The other option would be to map the whole of the index file into memory;
Then we can just as well keep the whole index in memory as before ?
> it might be worth trying, but as you have already tried to optimise
> reading/writing performance based on knowledge of the algorithm, it is
> unlikely the OS could do any better. Still might be worth trying, but
> maybe too much work. So I think I should shut up now since (as is
> obvious) I don't know how the indexing algorithm works...
It's essentially a glorified binary search tree where a node has not 2
but (in this case) 10 children. As for BSTs, a search always starts at
the root of the tree and then descends down to end nodes. Such a search
is also performed for insertion and deletion. With the current
implementation, the location in file is completely mixed up and has very
little to do with spatial distance or search tree structure.

The principle of a search tree is very fast and should be kept for
speed. AFAIU the nature of the tree does not allow to have spatially
nearby features close to each other in file, if only because the root
node and following nodes spatially cover the whole vector, i.e. on that
lowest level all features are close to each other as far as the search
tree is concerned. I'm happy that I found a way to write out the index
in a way that allows search in file, always keeping memory requirements
down to a very few KB. There are other file-based RTree implementations,
but I don't understand them, cryptic code, no documentation (SQLite for
example has some RTree, didn't test it, also Norbert Beckmann's R*-tree
implementation).

Markus M