[GRASS-dev] vector libs: file based spatial index

Markus GRASS markus.metz.giswork at googlemail.com
Thu Jun 25 02:51:54 EDT 2009


Moritz Lennert wrote:
> Markus:
>> If an old vector is opened just for reading (v.what, v.info, probably
>> also d.vect), the fastest solution is probably to only load the header
>> of the spatial index, as is done for the coor file, and perform spatial
>> queries in file. This is very fast AFAIKT.
>
> Then the main issue is during editing ? 
Yes. Modules where a speed decrease will be observed are e.g. v.in.ogr,
v.clean, v.buffer. Some modules only build topology together with the
spatial index for a new vector at the end, when the new vector is ready,
e.g. v.select, v.kernel, in these cases there will be no obvious speed
difference.
> I haven't used vector files, yet, that have caused memory problems,
> but I have had serious speed problems...So, I would plead for whatever
> makes things faster.
Speed problems can have many reasons and need to be tackled one at a
time. Poor speed can be caused by either the module or the vector libs
or both.
>> Hmm yes, I think these massive datasets are still the exception, now and
>> then someone tries to work with huge vectors but this is not the
>> everyday case (maybe because it takes so long...).
>
> They will probably become more normal (cf Lidar), so we should plan
> with them in mind, without making everyone else "suffer" because a few
> people need to use them.
Lidar is a special case, I don't see a reason to drag along with them
topo and the spatial index, maybe the spatial index, but not topo (will
reply to Hamish too).
>
>> 2) have
>> an env variable (could work), 
>
> As this retains flexibility for the future, I would favor this, but
> have no idea of what this entails in terms of increase of complexity
> of the code.
Two different rtree libraries, one memory based and one file based, need
to be available and maintained. Higher level stuff needs to prepare
temporary files only if needed, lower level stuff (diglib/spindex.c and
spindex_rw.c) need to handle both cases properly. Some functions must be
present in two versions, file based and memory based. Not sure how much
change is required in the headers, particularly dig_structs.h.

>
> The largest file I have used is about 125000 areas with a topo file
> weighing 42M, so taking your worst estimation, this would mean around
> 200MB of spatial index, which is still largely acceptable for me.
When testing LFS in the vector libs, I had vectors with > 2GB coor file,
800MB topo (>400000 areas). Importing them required 5 - 6GB memory,
handling them was simply a PITA, lots of coffee breaks.
>
> I find it a bit difficult to give you a definitive answer on the base
> of theory alone. Do you have any means of testing the impact of one
> choice over the other for different use cases (editing, v.build,
> v.what - the latter especially when using in the GUI) ?
v.build is slower when keeping the modified spatial index in file, but
v.build means rebuilding the spatial index, lots of file IO if
intermediate data are stored on file.

v.what is simply a charm when using from the gui :-), both with query in
display mode and query in edit mode (db speed is limiting). There is
nothing that needs to be rebuilt, a full 113 bytes are read (sidx
header), then search can immediately take place in file.

I would suggest that I first implement a new version were the spatial
index is always written out when a new or modifed vector is closed.
Intermediate data are still stored in memory. Opening an old vector in
read-only mode would then be faster, opening an old vector in update
mode would be the same like currently done, the spatial index is loaded
to memory. This can then be tested and polished, and once that is
stable, an env var could be added to keep the spatial index in file when
modifying (Vect_open_new or Vect_open_update). This would only be needed
for massive vectors.

Markus M



More information about the grass-dev mailing list