[GRASS-dev] vector large file support
Markus Metz
markus.metz.giswork at googlemail.com
Tue Feb 3 04:51:50 EST 2009
While I looked into possibilities to optimize v.in.ogr I noticed that
grass does not support coor files larger than 2 GB. With topological
information stored in that file, and often many dead lines wasting
space, the coor file can easily exceed 2 GB nowadays. While v.in.ogr was
cleaning one particular vector, the coor file size went up to 9 GB, I
killed v.in.ogr before it was finished, the resulting coor file when
writing out that vector may have been well above 10 GB. GRASS can
process such large coor files (to a degree) as long as topo is kept in
memory, e.g. with v.in.ogr and v.clean, and can close such a vector and
write it to disk. But that thing can not be opened again, the topo file
stores the size of the coor file, and that stored value in the topo file
must not exceed 2 GB (integer limit), giving a mismatch and error.
I want to propose some solutions to this problem:
high-level
Modules modifying the coor file, e.g. v.in.ogr, v.clean, v.overlay,
v.buffer, should do all the processing in a temporary vector and at the
end only copy alive lines to the final output vector,
Vect_copy_map_lines() does that. When importing shapefiles with areas I
noticed a coor file size reduction by a factor 2 to 5 which is quite a
lot (e.g. 1 a GB coor file can be melted down to 200 MB, much nicer).
This is also suggested in the vector TODO [1], I'm just pressing again.
low-level
coor file size is stored in memory as type long (32 bit integer on a
32bit system, and on my 64bit Linux with 32bit compatibility) counting
bytes of the coor file. That gives the 2 GB limit. When closing the
vector, this number is written to the topo file. When opening that
vector again, this number is read from the topo file and compared to the
actual coor file size, this is the 2 GB limit.
If this coor file size information in the topo file is just a safety
check and not needed to process the coor file, it could be omitted
altogether, making the supported coor file size unlimited (limited by
the current system and filesystem). All references to the coor file size
would need to be removed from the vector library.
If the coor file size stored in the topo file is indeed needed to
properly process the coor file, the respective variables must be
something else than long in order to support coor files larger than 2
GB, maybe long long? Same for all intermediate variables in the vector
library storing coor file size.
Looking at limits.h, long can be like int or like long long (only true
64 bit systems). I use Linux 64bit with 32bit compatibility, here long
is like int. Someone more familiar with type limits and type
declarations on different systems please help!
I suspect some integer overflow for large coor files also in rtree,
maybe someone in the know could look into that? :-)
Regards,
Markus M
[1] http://freegis.org/cgi-bin/viewcvs.cgi/*checkout*/grass6/doc/vector/TODO
More information about the grass-dev
mailing list