[GRASS-dev] vector large file support

Glynn Clements glynn at gclements.plus.com
Fri Feb 6 21:40:46 EST 2009


Markus Metz wrote:

> What about off_t lseek(int fd, off_t offset, int whence) ?
> From the GNU C library: "The lseek function is the 
> underlying primitive for the fseek, fseeko, ftell, ftello and rewind 
> functions [...]" lseek is used in libgis and several modules, I didn't 
> see something like the above #ifdef construct.

lseek() always uses off_t. Originally it used long (hence the name
"l"seek), but that's ancient history; you won't find such a system
outside of a museum.

_FILE_OFFSET_BITS determines whether off_t is 32 or 64 bits. If it's
64 bits, many of the POSIX I/O functions (open, read, write, lseek)
are redirected to 64-bit equivalents (open64, read64, etc).

> Not an option for the 
> vector libs I assume, because these would need to be largely rewritten 
> when using lseek instead of fseek, read instead of fread and so on 
> (using file descriptor instead of stream pointer throughout). Probably a 
> nonsense idea anyway.

It's not worth using "raw" I/O just to avoid this issue. Apart from
anything else, there's a potentially huge performance hit, as the
vector library tends to use many small read/write operations. Using
low-level I/O requires a system call for each operation, while the
stdio interface will coalesce these, reading/writing whole blocks.

> > I think that the code which reads these files needs functions to
> > read/write off_t values at the size used by the file, not the size
> > used by the code.
> >
> > I.e. if the code is built for 64-bit off_t, it should still be able to
> > directly read/write files using a 32-bit off_t. Code built for 32-bit
> > off_t should also directly read/write files which use a 64-bit off_t,
> > subject to the constraint that only 31 bits are non-zero (if you have
> > a 32-bit off_t, attempting to open a file >=2GiB will fail, as will
> > attempting to enlarge a file beyond that size).
> >   
> The problem I see is that offset values are stored in topo and cidx 
> (e.g. the topo file knows that line i is in the coor file at offset o). 
> So if the topo file was written with 64-bit off_t but the current 
> compiled library uses 32-bit off_t, can this 32-bit library somehow get 
> these 64-bit offset values out of the topo file?

In the worst case, it can just perform 2 32-bit reads, and check that
the high word is zero and the low word is positive.

If the topo file contains any offsets which exceed the 2GiB range,
then the coor file will be larger than 2GiB. If you aren't using
_FILE_OFFSET_BITS=64, open()ing the coor file will likely fail.

> Granted that these 
> values are in the 32-bit range. I have really no idea if this can be 
> done, my suggestion would be to rebuild topology if there is a mismatch 
> between off_t size used in the topo file and off_t size used by the 
> current library. The other way around may be less problematic, when you 
> have a 64-bit off_t library and a topo file with 32-bit offset values. 
> As long as you know what off_t size was used to write the topo and cidx 
> files. And now the mess starts, I'm afraid. The header of the topo file 
> would need to get modified so that it holds the off_t size used to write 
> this file. This information must be available before any attempt is made 
> to retrieve an offset value from the topo file. Then do some safety 
> checking if the offset values can be properly retrieved, if no, request 
> rebuilding topology...

As an alternative, you could just examine the header size, which
precedes the header. As the header contains offsets, the size will
vary with the offset size.

OTOH, this amounts to a format change, so you may as well just add a
new field to the header. Either way, the version number needs to be
increased.

-- 
Glynn Clements <glynn at gclements.plus.com>


More information about the grass-dev mailing list