[GRASS-dev] Re: [GRASS GIS] #426: v.in.ogr: split long boundaries
GRASS GIS
trac at osgeo.org
Thu Jan 29 18:37:35 EST 2009
#426: v.in.ogr: split long boundaries
--------------------------+-------------------------------------------------
Reporter: neteler | Owner: grass-dev at lists.osgeo.org
Type: enhancement | Status: new
Priority: major | Milestone: 6.5.0
Component: Vector | Version: unspecified
Resolution: | Keywords:
Platform: All | Cpu: All
--------------------------+-------------------------------------------------
Comment (by mmetz):
No boundary splitting is only one aspect that leads to complaints about
v.in.ogr. I have added boundary splitting to my local copy of
develbranch_6 and found sometimes substantial speed improvements (up to 6x
as fast), sometimes no speed improvement (more or less the same, give and
take a few seconds). It depends on properties of the vector to be imported
that are independent of the size of the vector to be imported and the
number of features in it (see below). I have used ESRI Shapefiles for
testing, they are the most common vectors to be imported and they have
only what is called "simple features", most importantly no topology.
I figured out four possibilities to improve (ESRI Shapefile) vector import
with v.in.ogr
1) code comment quoting "TODO: is it necessary to build here? probably
not, consumes time"
Indeed, build partial with GV_BUILD_BASE is sufficient and results in a
substantial speed improvement for large vectors.
2) when area cleaning is desired (no -c flag), support for the output
vector is released, the vector is closed, opened for update and a partial
build with GV_BUILD_BASE is done
This gives me an error with large vectors: the size of the coor file does
not match the size given by topology (the topo file). I disabled that part
of the code and now it works. There must have been a reason to do that but
I can not think of a reason. But then I'm not that deep into GRASS vector
processing, please help me out. My argument is that v.in.ogr is far from
finished with that vector and that it is safer to keep it open and keep
working on it. For me, this is both a speed improvement and avoids import
failures (most important).
3) split boundaries when import vectors have many areas ( > 500)
I make this No 3) because No 2) is crucial, it avoids import failures, and
No 2) depends on No 1). Splitting boundaries is just another speed
improvement but probably welcome for many users because the speed
improvement can be quite a bit. With my proposed method, splitting
boundaries is done with a threshold for boundary length. Whenever a new
vertex is added to a boundary and the boundary length exceeds that
threshold, the boundary is written out and a new boundary started with the
same Cats if given. The reasoning to determine the threshold is that a
useful threshold is a function of vector area size and the number of
areas. I propose to use map unit / ln(features). Map unit is sqrt(area
size), reasonable for boundary length. For bounding box of boundary length
I would use are size directly (keep units identical). Using ln(features)
avoids creating tiny tiny thresholds when many many features are in the
vector to be imported. I would undertand if you think that thid is
nonsense but it works, really! Both for a global map of the world with
political boundaries in latlon and a vector with watershed basins in UTM
with 150x150 km extends. Anyway, splitting boundaries will only happen f
the -s flag is set (keeping compatibility with 6.4.0).
4) use a temporary vector, not by me but by Radim Blazek
Do all the processing and cleaning in a temporary vector, then copy only
alive lines to the output vector. In case of ESRI Shapefile polygon
import, this might reduce the size of the coor file by a factor of 2 (all
boundaries used by GRASS are present twice in the shapefile). That would
not only be a speed improvement for further processing but also be safer.
Thinking about it that should be No 1) because as Radim Blazek suggested,
every module should do that to keep the size of the coor file small and
speed up vector processing.
My ultimate testing shapefile that I referred to above has a size of 4.6
MB and 3421 polygons. I would laugh at that and expect seconds to import
it. Then I noticed that the coor file created by GRASS is 4.5 GB = 4608 MB
large (no typo). That is the size after reading in all boundaries, before
cleaning. I'm not done yet with cleaning and am confident that the size
will go over 5 GB, the possible maximum size should be at least below 8
GB. If anybody out there gets as far as "Remove duplicates:" with the
current v.in.ogr version in 6.4.0.RC3 or devbr_6, I will be ready to
deliver a substantial price. Let's say I pay your next pizza delivery :-)
I will send that shapefile on request.
This innocent looking little shapefile has one polygon with thousands of
islands, that one is responsible for the large coor file and the long time
needed to import that shapefile. I will manage to import it (some more
hours later, still busy) and thus prove that my suggested improvement No
2) does indeed make sense.
Markus M
--
Ticket URL: <http://trac.osgeo.org/grass/ticket/426#comment:1>
GRASS GIS <http://grass.osgeo.org>
More information about the grass-dev
mailing list