[GRASS-dev] vector large file support
Markus Metz
markus.metz.giswork at googlemail.com
Sun Feb 8 14:45:41 EST 2009
Glynn Clements wrote:
> Markus Metz wrote:
>
>> I like the point of Ivan that off_t is the native type for file offsets.
>> Could G_fseek then use fseeko whenever fseeko is available (ditto for
>> ftello)?
>>
>
> Well, that's the general idea. The only advantage of fseek/ftell is
> that they are always available.
>
I think grass7 should get LFS support as far as possible, today's
datasets can easily exceed the 2GB limit. Before modules can get LFS,
the underlying libraries must be enabled. According to the LFS wish list
in the wiki on LFS, these are the vector libs and the DB libs. For that,
this fseeko/ftello problem needs to be solved for 32bit systems. I have
read "The issues" and understand the problem, but some sort of
implementation of G_fseek and G_ftell is needed, otherwise modules and
libraries need a workaround like the iostream library is doing now.
Instead of having many (potentially different) workarounds, one proper
solution is preferable. This may not be easy, and as much as I like
tackling not easy problems, here I can only say: Please do it!.
>
>>> Bear in mind that a GRASS database may be on a networked file system,
>>> and accessed by both 32- and 64-bit systems, and by both big- and
>>> little-endian systems.
>>>
>>> Also, the user shouldn't need write permission in order to read a map.
>>> Or, rather, don't assume that the user has write permission for a map
>>> which they are reading.
>>>
>> OK, the biggest problem is to support reading a vector written with
>> sizeof(off_t) == 8 when the libs use sizeof(off_t) == 4, without
>> rebuilding topology.
>>
>
> The biggest problem is when the compiler doesn't provide a 64-bit
> integral type (off_t doesn't necessarily have to be 64 bits).
>
There is a handy function called buf_alloc() in the vector libs,
allocating a temporary buffer of the needed size (can be of any size),
to read content of any of the vector files. You could then read this
temporary buffer in chunks of the size supported by the current vector
libs. The code is essentially there and would need only little adjustment.
>
>> As you suggested, 2 32bit reads can be done, and
>> depending on the endian-ness of the host system either the high word
>> value or the low word value used.
>>
>
> The low word is always used. That might be the first word or the
> second word, but it's always the low word.
>
I got confused by this endian-ness and confused low/high word with
first/second word. With the current code, the low word would be the
second word when doing 2 32bit reads on a 64bit sized buffer,
independent on a endian-ness mismatch. In this case, the libs would have
to check if the high word is != 0 and then exit with an ERROR message,
right?
>> When writing offsets, it would be easiest (also safest?) to always use
>> sizeof(off_t) of the libs. There will be no mix of different offset
>> sizes because topo and cidx are currently written anew when the vector
>> was updated.
>>
>
> It would be both easiest and safest. Although it would be preferable
> to use 32 bits if that is known to be sufficient, I don't know whether
> this is feasible.
>
I don't think so. With v.in.ogr, you have no chance to estimate the coor
file size. Coming back to my test shapefile for v.in.ogr with a total
size below 5MB, that thing results in a coor file > 8GB with cleaning
and > 4GB without cleaning. When working on a grass vector, each module
would have to estimate the increase of the coor file. Most modules copy
the input vector to the output vector, do the requested modifications on
the output vector and write out the output vector. You would have to do
some very educated guessing on the size of the final coor file,
considering the expected amount of dead lines and the expected amount of
additional vertices, to decide if a 32bit off_t would be sufficient.
Instead I would prefer to use 64 bits whenever possible. Personally, I
would regard 32bit support as a courtesy, but please don't start a
discussion about that.
More information about the grass-dev
mailing list