[GRASS5] large point data sets

Helena Mitasova hmitaso at unity.ncsu.edu
Sun Jan 15 00:21:34 EST 2006


I have just submitted modification of v.surf.rst that allows it to read 
millions of points
imported by v.in.ascii -btz (that means 3D points x,y,z no topology)
It was tested for both level1 and level2 data but only with few data 
sets, so more
testing is needed to make sure that we did not break anything.
Thanks to Radim, Andy and Jaro for advice and implementation.

Helena

As late a response to lively discussion regarding bug #3877 r.to.vect
(I was away), here is a summary where we stand with large vector point 
data sets:

Currently you can import with v.in.ascii -btz (other options not 
tested),
display with d.vect ... and interpolate with v.surf.rst.
tested with
1.8 mil points: reads in seconds both by v.in.ascii and by v.surf.rst;
22mil points, v.in.ascii needs 15min to read it (there may be still a 
way how to make this faster),
v.surf.rst reads the imported file in a minute (interpolates much 
longer)

This is by no means a full solution and so far applies only to 3D 
vector points (x,y,z).
just for the record, the following still needs to be done (and there is 
for sure more):

1. skip check for building topology when it is not needed:

Radim's suggestions for v.info:
I think that it is necessary to handle differently level 1 and level 2.
The reported info will be different and you have to decide if you
want to calculate extension for example for level 1 or it will not be
available.
Instead of
   Vect_set_open_level (2);
   Vect_open_old_head (&Map, in_opt->answer, mapset);
use
   level = Vect_open_old_head (&Map, in_opt->answer, mapset);
and then
   if ( level >= 2 ) {
       // current report
   } else if (level == 1) {
      // print informations available on level 1, i.e. (probably)
      // without number of elements and without extension
      // or scan all elements and count them
   } else {
      // error
   }

Radim's suggestion for v.to.rast:
Open the vector like in v.info, then if it is level < 2, print warning
and do not call do_areas().
do_lines() must be rewritten to use Vect_read_next_line()
(sequential access) instead of Vect_read_line() (random access).

something similar as above for writing the vector file can be done for 
r.to.vect
I am not sure about g.region vect=

question - there is a really useful program s.to.rast2 in outgoing
that computes average, min, max of points that are found in the grid 
cell -
can this be ported/added to v.to.rast? does it need topology?

2. programs that need topology (level2) should say
that topology is missing and user should run v.build
(most do it already, or they say "cannot read at level2",
that could be replaced by "topology is missing, run v.build")

For anything that needs topology and/or DBF:

3. V_build should be modified so that it does not freeze the machine:
- it should exit with some message (split your data or get a bigger 
computer)
or V_build should be modified to avoid swapping - here are the 
suggestions
from Hamish and Radim, but I don't quite understand it so I don't feel I
can do anything about it:

Hamish suggests:
Once number of points is known, a calculation of memory use could be
done (~300 bytes per vector point?, best create a test point & sizeof()
rather than hardcode "300"). G_malloc() and G_free() could be
called as a test, which will call G_fatal_error() if the dataset is too
huge to complete. I don't think this reduces the need for a solution,
just makes the failure friendlier.

Radim explains:
Spatial index occupies a lot of memory but it is necessary for
topology building. Also, it takes long time to release the memory
occupied by spatial index (dig_spidx_free) .

The function building topology (Vect_build) is usually called
at the end of module (before Vect_close) so it is faster to call
exit() and operating system releases all the memory much faster.
By default the memory is not released.

It is possible to call Vect_set_release_support() before Vect_close()
to force  to release the memory, but it takes long time on large files.
Currently most of the modules do not release spatial index and work
like this:
main
{
      Vect_open_new()
      //writing new vector

      Vect_build()
      Vect_close()  // memory is not released
}

you can add Vect_set_release_support():

main
{
      Vect_open_new()
      // writing new vector

      Vect_build()
      Vect_set_release_support()
      Vect_close()  // memory is released
}

but it only takes longer time.

It make sense to release spatial index if it is used only at the 
beginning
of a module or in permanently running programs like QGIS.
For example:

main
{
      Vect_open_old()
      // select features using spatial index, e.g.  
Vect_select_lines_by_box()
      Vect_set_release_support()
      Vect_close()  // memory is released

      // do some processing which needs memory
}

4. dbf problem has not been addressed - will be needed for multiple 
point attributes
(e.g. multiple returns, intensities etc.)
see Roger's comment re bug #3877 r.to.vect

One little wish:
add computation and output of number of points to v.in.ascii -b




More information about the grass-dev mailing list