[GRASS-dev] vector large file support

Markus Metz markus.metz.giswork at googlemail.com
Sat Feb 7 11:44:39 EST 2009


Glynn Clements wrote:
> lseek() always uses off_t. Originally it used long (hence the name
> "l"seek), but that's ancient history; you won't find such a system
> outside of a museum.
>
> _FILE_OFFSET_BITS determines whether off_t is 32 or 64 bits. If it's
> 64 bits, many of the POSIX I/O functions (open, read, write, lseek)
> are redirected to 64-bit equivalents (open64, read64, etc).
>   
That's why I asked if fseek, fread, fwrite etc can be replaced with 
lseek, read, write etc :-) , no need to check HAVE_LARGEFILES with lseek 
etc, just compile with -D_FILE_OFFSET_BITS=64.
Do I understand right that fseeko and ftello are only needed on 32-bit 
systems that want D_FILE_OFFSET_BITS=64? fseek e.g. returns long which 
is on my 64bit Linux 64bit, I guess that's why I can write coor files > 
2GB with the current vector libs.
> It's not worth using "raw" I/O just to avoid this issue. Apart from
> anything else, there's a potentially huge performance hit, as the
> vector library tends to use many small read/write operations. Using
> low-level I/O requires a system call for each operation, while the
> stdio interface will coalesce these, reading/writing whole blocks.
>   
Interesting and good to know. So we do need G_fseek() and G_ftell()
>   
>> The problem I see is that offset values are stored in topo and cidx 
>> (e.g. the topo file knows that line i is in the coor file at offset o). 
>> So if the topo file was written with 64-bit off_t but the current 
>> compiled library uses 32-bit off_t, can this 32-bit library somehow get 
>> these 64-bit offset values out of the topo file?
>>     
>
> In the worst case, it can just perform 2 32-bit reads, and check that
> the high word is zero and the low word is positive.
>   
Uff. Some more safety checks in the code. From a coding perspective it's 
easier just to request a topology rebuild. Annoying for the user though. 
OTOH, that coor file size check is done before anything is read from the 
coor file, the libs could say something like "Sorry, that vector is too 
big for you. Please recompile GRASS with LFS" (more friendly phrasing 
needed). Also potentially annoying. But if the coor file size check is 
passed (<= 2GB), the high word must be always zero, otherwise it would 
refer to an offset beyond EOF. You could just use the low word value. 
Would you have to swap high word and low word if the byte order of the 
vector is different from the byte order of the current system? Can 
happen when e.g. a whole grass location is copied to another system. I 
think not because the vector libs use their own fixed byte order. I 
would really just request a topology rebuild to avoid all this hassle.
> If the topo file contains any offsets which exceed the 2GiB range,
> then the coor file will be larger than 2GiB. If you aren't using
> _FILE_OFFSET_BITS=64, open()ing the coor file will likely fail.
>   
Opening the coor file is not even attempted with the current code in 
this situation, because the coor file size stored in the topo header can 
not be larger than 2GB and this size is used for a safety check before 
opening the coor file. Actually, I don't know what would happen on a 
32-bit system. If new vector libs are compiled without LFS, does a 
32-bit system have a chance to find out that the coor file is too large? 
To be precise, when calling stat(path, &stat_buf), what would be the 
maximum possible value of stat_buf.st_size in 32-bit? Likely LONG_MAX.
> OTOH, this amounts to a format change, so you may as well just add a
> new field to the header. Either way, the version number needs to be
> increased.
>   
Increasing the minor version number of topo should be sufficient, but 
the backwards compatibility minor version number of topo must also be 
increased to enforce rebuilding of topology when vectors written with 
new libs are opened with old libs, that will write new topo and cidx 
files. I would try to keep the coor version numbers as they are, that 
would at least give backwards/forwards portability of vector files. cidx 
version numbers could stay unchanged, only that offset values could be 
stored as 64bit. But topo is read first, the information in the header 
of the topo file can (must?) be used for safety checks. I guess we are 
lost if someone produces a topo file >2GB, but a vector with such a 
large topo file would be a nightmare to work with anyway. No idea if 
this still holds true in say 5 years from now (I got max 600MB already, 
unworkable though because no LFS in vector libs and coor >2GB).

I think we could soon come up with a detailed plan of action: what are 
the currently known caveats, what should be done where in what order to 
get LFS into the vector libs. Anybody taking on this task would profit 
from such a guideline, with a big warning that the suggested changes may 
not be sufficient, that something may have been missed, and that the 
list of caveats is most likely not complete.

Lots of "if"s and "but"s and "?" in this post of mine.

PS: Thanks for your patience, Glynn.



More information about the grass-dev mailing list