[GRASS5] Raster lib and CELL files > 2GB

Glynn Clements glynn.clements at virgin.net
Mon Aug 2 21:16:12 EDT 2004


Glynn Clements wrote:

> > for a remote sensing project we face the problem to arrive
> > at file sizes > 2GB. I see two options:
> > 
> > a) change the CELL file compression from RLE to DEFLATE 
> >   -> how to do that? I just want to change locally for me
> 
> That's far from simple. The existing low-level raster I/O
> implementation is a total mess; learning how it works may well take
> longer than re-writing it from scratch.

I have created some flowcharts (in .dia or EPS formats) for get_row.c
and put_row.c if anyone is interested. OTOH, I'm planning on cleaning
this code up, so the flowcharts will become outdated quite quickly.

> Also, note that the raster files begin with an array of offsets to the
> start of each row. These would have to be changed to use off_t
> (format.c, G.h, maybe other places).

Actually, the row pointers are only used for compressed maps. For
uncompressed maps, it just seeks to row * bytes_per_row. 

It probably wouldn't be that hard to support uncompressed maps larger
than 2Gb. Mostly[1] it should just be a matter of changing long to
off_t in:

	read_data_uncompressed()
	G__read_null_bits()
	G__write_null_bits()
	put_data()
	seek_random()

then compiling with -D_FILE_OFFSET_BITS=64. [The last two are only
necessary to support G_open_cell_new_random().]

[1] The position of some of the type casts needs to change; e.g. code
such as:

   offset = (long) (size * R * sizeof(unsigned char)) ;

should actually be:

   offset = (long) size * R * sizeof(unsigned char) ;

In the first case, the value is computed using ints (size, R and
sizeof(unsigned char) are all ints) then promoted to a long at the end
(by which point it will have been truncated to the size of an int).

In the second case, size gets promoted to a long, so the
multiplications are all performed using longs.

So, the existing code won't actually cope with files >2Gb even on
platforms where long is 64 bits, because the intermediate values are
calculated as ints.

I've already changed this in the code which I've been working on,
which is currenly all of get_row.c, although I've mostly left the null
handling alone. I haven't started on put_row.c yet.

The null handling is an even bigger mess than the rest of it. It's
also very inefficient.

Essentially, reads blocks of 8 lines (NULL_ROWS_INMEM, from G.h) at a
time, into the NULL_ROWS array of the fileinfo structure. For each
block, it locates the null file using G_find_file(), opens it with
G_open_old (which also locates it using G_find_file()), reads the
data, then closes the file.

This saves having to keep the descriptor open, but it's likely to have
a significant performance impact. OTOH, keeping the null descriptor
open could halve the maximum number of raster maps which could be open
at a time (assuming that the limiting factor is the OS' open files
limit).

-- 
Glynn Clements <glynn.clements at virgin.net>




More information about the grass-dev mailing list