[GRASS-dev] vector large file support

Mon Feb 9 06:06:34 EST 2009

Markus Metz wrote:

> I have 
> read "The issues" and understand the problem, but some sort of 
> implementation of G_fseek and G_ftell is needed, otherwise modules and 
> libraries need a workaround like the iostream library is doing now. 
> Instead of having many (potentially different) workarounds, one proper 
> solution is preferable. This may not be easy, and as much as I like 
> tackling not easy problems, here I can only say: Please do it!.

I have added G_fseek() and G_ftell() to 7.0 in r35818.

> >> As you suggested, 2 32bit reads can be done, and 
> >> depending on the endian-ness of the host system either the high word 
> >> value or the low word value used.
> >>     
> >
> > The low word is always used. That might be the first word or the
> > second word, but it's always the low word.
> 
> I got confused by this endian-ness and confused low/high word with 
> first/second word. With the current code, the low word would be the 
> second word when doing 2 32bit reads on a 64bit sized buffer, 
> independent on a endian-ness mismatch. In this case, the libs would have 
> to check if the high word is != 0 and then exit with an ERROR message, 
> right?

Right. The files are always written big-endian, so the high word will
always be first in the file.

As well as checking that the high word is zero, you also need to check
that the low word is <= 0x7fffffff (off_t is signed, hence the limit
being 2GiB not 4GiB).

> >> When writing offsets, it would be easiest (also safest?) to always use 
> >> sizeof(off_t) of the libs. There will be no mix of different offset 
> >> sizes because topo and cidx are currently written anew when the vector 
> >> was updated.
> >>     
> >
> > It would be both easiest and safest. Although it would be preferable
> > to use 32 bits if that is known to be sufficient, I don't know whether
> > this is feasible.
> 
> I don't think so. With v.in.ogr, you have no chance to estimate the coor 
> file size. Coming back to my test shapefile for v.in.ogr with a total 
> size below 5MB, that thing results in a coor file > 8GB with cleaning 
> and > 4GB without cleaning. When working on a grass vector, each module 
> would have to estimate the increase of the coor file. Most modules copy 
> the input vector to the output vector, do the requested modifications on 
> the output vector and write out the output vector. You would have to do 
> some very educated guessing on the size of the final coor file, 
> considering the expected amount of dead lines and the expected amount of 
> additional vertices, to decide if a 32bit off_t would be sufficient. 
> Instead I would prefer to use 64 bits whenever possible. Personally, I 
> would regard 32bit support as a courtesy, but please don't start a 
> discussion about that.

The issue is whether the coor file size is known at the point that you
start writing the topo/cidx files. If the files are generated
concurrently, then it isn't feasible. If the coor file is generated
first, then it is.

-- 
Glynn Clements <glynn at gclements.plus.com>