[GRASS-dev] Re: [GRASS GIS] #426: v.in.ogr: split long boundaries

Thu Jan 29 18:37:35 EST 2009

#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
  Reporter:  neteler      |       Owner:  grass-dev at lists.osgeo.org
      Type:  enhancement  |      Status:  new                      
  Priority:  major        |   Milestone:  6.5.0                    
 Component:  Vector       |     Version:  unspecified              
Resolution:               |    Keywords:                           
  Platform:  All          |         Cpu:  All                      
--------------------------+-------------------------------------------------
Comment (by mmetz):

 No boundary splitting is only one aspect that leads to complaints about
 v.in.ogr. I have added boundary splitting to my local copy of
 develbranch_6 and found sometimes substantial speed improvements (up to 6x
 as fast), sometimes no speed improvement (more or less the same, give and
 take a few seconds). It depends on properties of the vector to be imported
 that are independent of the size of the vector to be imported and the
 number of features in it (see below). I have used ESRI Shapefiles for
 testing, they are the most common vectors to be imported and they have
 only what is called "simple features", most importantly no topology.

 I figured out four possibilities to improve (ESRI Shapefile) vector import
 with v.in.ogr

 1) code comment quoting "TODO: is it necessary to build here? probably
 not, consumes time"

 Indeed, build partial with GV_BUILD_BASE is sufficient and results in a
 substantial speed improvement for large vectors.

 2) when area cleaning is desired (no -c flag), support for the output
 vector is released, the vector is closed, opened for update and a partial
 build with GV_BUILD_BASE is done

 This gives me an error with large vectors: the size of the coor file does
 not match the size given by topology (the topo file). I disabled that part
 of the code and now it works. There must have been a reason to do that but
 I can not think of a reason. But then I'm not that deep into GRASS vector
 processing, please help me out. My argument is that v.in.ogr is far from
 finished with that vector and that it is safer to keep it open and keep
 working on it. For me, this is both a speed improvement and avoids import
 failures (most important).

 3) split boundaries when import vectors have many areas ( > 500)

 I make this No 3) because No 2) is crucial, it avoids import failures, and
 No 2) depends on No 1). Splitting boundaries is just another speed
 improvement but probably welcome for many users because the speed
 improvement can be quite a bit. With my proposed method, splitting
 boundaries is done with a threshold for boundary length. Whenever a new
 vertex is added to a boundary and the boundary length exceeds that
 threshold, the boundary is written out and a new boundary started with the
 same Cats if given. The reasoning to determine the threshold is that a
 useful threshold is a function of vector area size and the number of
 areas. I propose to use map unit / ln(features). Map unit is sqrt(area
 size), reasonable for boundary length. For bounding box of boundary length
 I would use are size directly (keep units identical). Using ln(features)
 avoids creating tiny tiny thresholds when many many features are in the
 vector to be imported. I would undertand if you think that thid is
 nonsense but it works, really! Both for a global map of the world with
 political boundaries in latlon and a vector with watershed basins in UTM
 with 150x150 km extends. Anyway, splitting boundaries will only happen f
 the -s flag is set (keeping compatibility with 6.4.0).

 4) use a temporary vector, not by me but by Radim Blazek

 Do all the processing and cleaning in a temporary vector, then copy only
 alive lines to the output vector. In case of ESRI Shapefile polygon
 import, this might reduce the size of the coor file by a factor of 2 (all
 boundaries used by GRASS are present twice in the shapefile). That would
 not only be a speed improvement for further processing but also be safer.
 Thinking about it that should be No 1) because as Radim Blazek suggested,
 every module should do that to keep the size of the coor file small and
 speed up vector processing.

 My ultimate testing shapefile that I referred to above has a size of 4.6
 MB and 3421 polygons. I would laugh at that and expect seconds to import
 it. Then I noticed that the coor file created by GRASS is 4.5 GB = 4608 MB
 large (no typo). That is the size after reading in all boundaries, before
 cleaning. I'm not done yet with cleaning and am confident that the size
 will go over 5 GB, the possible maximum size should be at least below 8
 GB. If anybody out there gets as far as "Remove duplicates:" with the
 current v.in.ogr version in 6.4.0.RC3 or devbr_6, I will be ready to
 deliver a substantial price. Let's say I pay your next pizza delivery :-)
 I will send that shapefile on request.

 This innocent looking little shapefile has one polygon with thousands of
 islands, that one is responsible for the large coor file and the long time
 needed to import that shapefile. I will manage to import it (some more
 hours later, still busy) and thus prove that my suggested improvement No
 2) does indeed make sense.

 Markus M

-- 
Ticket URL: <http://trac.osgeo.org/grass/ticket/426#comment:1>
GRASS GIS <http://grass.osgeo.org>