[GRASS-dev] vector large file support

Glynn Clements glynn at gclements.plus.com
Sun Feb 8 02:44:35 EST 2009


Markus Metz wrote:

> Do I understand right that fseeko and ftello are only needed on 32-bit 
> systems that want D_FILE_OFFSET_BITS=64? fseek e.g. returns long which 
> is on my 64bit Linux 64bit, I guess that's why I can write coor files > 
> 2GB with the current vector libs.

Yes. There's no point in using them unless off_t is larger than long
(i.e. 64-bit off_t versus 32-bit long).

> > It's not worth using "raw" I/O just to avoid this issue. Apart from
> > anything else, there's a potentially huge performance hit, as the
> > vector library tends to use many small read/write operations. Using
> > low-level I/O requires a system call for each operation, while the
> > stdio interface will coalesce these, reading/writing whole blocks.
>
> Interesting and good to know. So we do need G_fseek() and G_ftell()

Yes. Those would be useful regardless of anything related to the
vector format.

> >> The problem I see is that offset values are stored in topo and cidx 
> >> (e.g. the topo file knows that line i is in the coor file at offset o). 
> >> So if the topo file was written with 64-bit off_t but the current 
> >> compiled library uses 32-bit off_t, can this 32-bit library somehow get 
> >> these 64-bit offset values out of the topo file?
> >>     
> >
> > In the worst case, it can just perform 2 32-bit reads, and check that
> > the high word is zero and the low word is positive.
> 
> Uff. Some more safety checks in the code. From a coding perspective it's 
> easier just to request a topology rebuild. Annoying for the user though. 
> OTOH, that coor file size check is done before anything is read from the 
> coor file, the libs could say something like "Sorry, that vector is too 
> big for you. Please recompile GRASS with LFS" (more friendly phrasing 
> needed). Also potentially annoying.

Right. But if you have a >=2GiB coor file with a 32-bit off_t, the OS
will refuse to open() to the coor file regardless of any checks GRASS
performs.

> But if the coor file size check is 
> passed (<= 2GB), the high word must be always zero, otherwise it would 
> refer to an offset beyond EOF. You could just use the low word value. 
> Would you have to swap high word and low word if the byte order of the 
> vector is different from the byte order of the current system?

Yes.

> Can 
> happen when e.g. a whole grass location is copied to another system. I 
> think not because the vector libs use their own fixed byte order. I 
> would really just request a topology rebuild to avoid all this hassle.

Bear in mind that a GRASS database may be on a networked file system,
and accessed by both 32- and 64-bit systems, and by both big- and
little-endian systems.

Also, the user shouldn't need write permission in order to read a map. 
Or, rather, don't assume that the user has write permission for a map
which they are reading.

> > If the topo file contains any offsets which exceed the 2GiB range,
> > then the coor file will be larger than 2GiB. If you aren't using
> > _FILE_OFFSET_BITS=64, open()ing the coor file will likely fail.
> 
> Opening the coor file is not even attempted with the current code in 
> this situation, because the coor file size stored in the topo header can 
> not be larger than 2GB and this size is used for a safety check before 
> opening the coor file. Actually, I don't know what would happen on a 
> 32-bit system. If new vector libs are compiled without LFS, does a 
> 32-bit system have a chance to find out that the coor file is too large? 
> To be precise, when calling stat(path, &stat_buf), what would be the 
> maximum possible value of stat_buf.st_size in 32-bit? Likely LONG_MAX.

Effectively; using stat() on a file >=2GiB results in:

       EOVERFLOW
	      (stat()) path refers to a file whose size cannot be  represented
	      in  the  type	   off_t.   This can occur when an application
	      compiled on a  32-bit  platform  without	-D_FILE_OFFSET_BITS=64
	      calls stat() on a file whose size exceeds (2<<31)-1 bits.

-- 
Glynn Clements <glynn at gclements.plus.com>


More information about the grass-dev mailing list