[GRASS-dev] vector large file support
Glynn Clements
glynn at gclements.plus.com
Mon Feb 9 06:06:34 EST 2009
Markus Metz wrote:
> I have
> read "The issues" and understand the problem, but some sort of
> implementation of G_fseek and G_ftell is needed, otherwise modules and
> libraries need a workaround like the iostream library is doing now.
> Instead of having many (potentially different) workarounds, one proper
> solution is preferable. This may not be easy, and as much as I like
> tackling not easy problems, here I can only say: Please do it!.
I have added G_fseek() and G_ftell() to 7.0 in r35818.
> >> As you suggested, 2 32bit reads can be done, and
> >> depending on the endian-ness of the host system either the high word
> >> value or the low word value used.
> >>
> >
> > The low word is always used. That might be the first word or the
> > second word, but it's always the low word.
>
> I got confused by this endian-ness and confused low/high word with
> first/second word. With the current code, the low word would be the
> second word when doing 2 32bit reads on a 64bit sized buffer,
> independent on a endian-ness mismatch. In this case, the libs would have
> to check if the high word is != 0 and then exit with an ERROR message,
> right?
Right. The files are always written big-endian, so the high word will
always be first in the file.
As well as checking that the high word is zero, you also need to check
that the low word is <= 0x7fffffff (off_t is signed, hence the limit
being 2GiB not 4GiB).
> >> When writing offsets, it would be easiest (also safest?) to always use
> >> sizeof(off_t) of the libs. There will be no mix of different offset
> >> sizes because topo and cidx are currently written anew when the vector
> >> was updated.
> >>
> >
> > It would be both easiest and safest. Although it would be preferable
> > to use 32 bits if that is known to be sufficient, I don't know whether
> > this is feasible.
>
> I don't think so. With v.in.ogr, you have no chance to estimate the coor
> file size. Coming back to my test shapefile for v.in.ogr with a total
> size below 5MB, that thing results in a coor file > 8GB with cleaning
> and > 4GB without cleaning. When working on a grass vector, each module
> would have to estimate the increase of the coor file. Most modules copy
> the input vector to the output vector, do the requested modifications on
> the output vector and write out the output vector. You would have to do
> some very educated guessing on the size of the final coor file,
> considering the expected amount of dead lines and the expected amount of
> additional vertices, to decide if a 32bit off_t would be sufficient.
> Instead I would prefer to use 64 bits whenever possible. Personally, I
> would regard 32bit support as a courtesy, but please don't start a
> discussion about that.
The issue is whether the coor file size is known at the point that you
start writing the topo/cidx files. If the files are generated
concurrently, then it isn't feasible. If the coor file is generated
first, then it is.
--
Glynn Clements <glynn at gclements.plus.com>
More information about the grass-dev
mailing list