[GRASS-dev] vector large file support

Markus Metz markus.metz.giswork at googlemail.com
Tue Feb 3 04:51:50 EST 2009


While I looked into possibilities to optimize v.in.ogr I noticed that 
grass does not support coor files larger than 2 GB. With topological 
information stored in that file, and often many dead lines wasting 
space, the coor file can easily exceed 2 GB nowadays. While v.in.ogr was 
cleaning one particular vector, the coor file size went up to 9 GB, I 
killed v.in.ogr before it was finished, the resulting coor file when 
writing out that vector may have been well above 10 GB. GRASS can 
process such large coor files (to a degree) as long as topo is kept in 
memory, e.g. with v.in.ogr and v.clean, and can close such a vector and 
write it to disk. But that thing can not be opened again, the topo file 
stores the size of the coor file, and that stored value in the topo file 
must not exceed 2 GB (integer limit), giving a mismatch and error.

I want to propose some solutions to this problem:
high-level
Modules modifying the coor file, e.g. v.in.ogr, v.clean, v.overlay, 
v.buffer, should do all the processing in a temporary vector and at the 
end only copy alive lines to the final output vector, 
Vect_copy_map_lines() does that. When importing shapefiles with areas I 
noticed a coor file size reduction by a factor 2 to 5 which is quite a 
lot (e.g. 1 a GB coor file can be melted down to 200 MB, much nicer). 
This is also suggested in the vector TODO [1], I'm just pressing again.

low-level
coor file size is stored in memory as type long (32 bit integer on a 
32bit system, and on my 64bit Linux with 32bit compatibility) counting 
bytes of the coor file. That gives the 2 GB limit. When closing the 
vector, this number is written to the topo file. When opening that 
vector again, this number is read from the topo file and compared to the 
actual coor file size, this is the 2 GB limit.
If this coor file size information in the topo file is just a safety 
check and not needed to process the coor file, it could be omitted 
altogether, making the supported coor file size unlimited (limited by 
the current system and filesystem). All references to the coor file size 
would need to be removed from the vector library.
If the coor file size stored in the topo file is indeed needed to 
properly process the coor file, the respective variables must be 
something else than long in order to support coor files larger than 2 
GB, maybe long long? Same for all intermediate variables in the vector 
library storing coor file size.
Looking at limits.h, long can be like int or like long long (only true 
64 bit systems). I use Linux 64bit with 32bit compatibility, here long 
is like int. Someone more familiar with type limits and type 
declarations on different systems please help!

I suspect some integer overflow for large coor files also in rtree, 
maybe someone in the know could look into that? :-)

Regards,

Markus M

[1] http://freegis.org/cgi-bin/viewcvs.cgi/*checkout*/grass6/doc/vector/TODO



More information about the grass-dev mailing list