[GRASS-dev] vector large file support

Markus Metz markus.metz.giswork at googlemail.com
Sun Feb 8 14:45:41 EST 2009


Glynn Clements wrote:
> Markus Metz wrote:
>   
>> I like the point of Ivan that off_t is the native type for file offsets. 
>> Could G_fseek then use fseeko whenever fseeko is available (ditto for 
>> ftello)?
>>     
>
> Well, that's the general idea. The only advantage of fseek/ftell is
> that they are always available.
>   
I think grass7 should get LFS support as far as possible, today's 
datasets can easily exceed the 2GB limit. Before modules can get LFS, 
the underlying libraries must be enabled. According to the LFS wish list 
in the wiki on LFS, these are the vector libs and the DB libs. For that, 
this fseeko/ftello problem needs to be solved for 32bit systems. I have 
read "The issues" and understand the problem, but some sort of 
implementation of G_fseek and G_ftell is needed, otherwise modules and 
libraries need a workaround like the iostream library is doing now. 
Instead of having many (potentially different) workarounds, one proper 
solution is preferable. This may not be easy, and as much as I like 
tackling not easy problems, here I can only say: Please do it!.
>   
>>> Bear in mind that a GRASS database may be on a networked file system,
>>> and accessed by both 32- and 64-bit systems, and by both big- and
>>> little-endian systems.
>>>
>>> Also, the user shouldn't need write permission in order to read a map. 
>>> Or, rather, don't assume that the user has write permission for a map
>>> which they are reading.
>>>       
>> OK, the biggest problem is to support reading a vector written with 
>> sizeof(off_t) == 8 when the libs use sizeof(off_t) == 4, without 
>> rebuilding topology.
>>     
>
> The biggest problem is when the compiler doesn't provide a 64-bit
> integral type (off_t doesn't necessarily have to be 64 bits).
>   
There is a handy function called buf_alloc() in the vector libs, 
allocating a temporary buffer of the needed size (can be of any size), 
to read content of any of the vector files. You could then read this 
temporary buffer in chunks of the size supported by the current vector 
libs. The code is essentially there and would need only little adjustment.
>   
>> As you suggested, 2 32bit reads can be done, and 
>> depending on the endian-ness of the host system either the high word 
>> value or the low word value used.
>>     
>
> The low word is always used. That might be the first word or the
> second word, but it's always the low word.
>   
I got confused by this endian-ness and confused low/high word with 
first/second word. With the current code, the low word would be the 
second word when doing 2 32bit reads on a 64bit sized buffer, 
independent on a endian-ness mismatch. In this case, the libs would have 
to check if the high word is != 0 and then exit with an ERROR message, 
right?
>> When writing offsets, it would be easiest (also safest?) to always use 
>> sizeof(off_t) of the libs. There will be no mix of different offset 
>> sizes because topo and cidx are currently written anew when the vector 
>> was updated.
>>     
>
> It would be both easiest and safest. Although it would be preferable
> to use 32 bits if that is known to be sufficient, I don't know whether
> this is feasible.
>   
I don't think so. With v.in.ogr, you have no chance to estimate the coor 
file size. Coming back to my test shapefile for v.in.ogr with a total 
size below 5MB, that thing results in a coor file > 8GB with cleaning 
and > 4GB without cleaning. When working on a grass vector, each module 
would have to estimate the increase of the coor file. Most modules copy 
the input vector to the output vector, do the requested modifications on 
the output vector and write out the output vector. You would have to do 
some very educated guessing on the size of the final coor file, 
considering the expected amount of dead lines and the expected amount of 
additional vertices, to decide if a 32bit off_t would be sufficient. 
Instead I would prefer to use 64 bits whenever possible. Personally, I 
would regard 32bit support as a courtesy, but please don't start a 
discussion about that.



More information about the grass-dev mailing list