[gdal-dev] Raster size and ReadAsArray()

Wed Aug 3 15:02:42 EDT 2011

Le mercredi 03 août 2011 20:28:35, Jay L. a écrit :
> Evan,
> 
> To ensure that i understand here is an example:
> 
> If I have a GTiff, where the block size is one row by all of the columns 
> (a single scanline), I should try to read in either one scanline at a
> time, or multiple entire scanlines. 

Yes

> It is inefficient to take say 10 rows
> and only half of the columns.

Yes. You will end up reading the data twice from disk if the file size is 
bigger than the block cache size you are using (the default is 40 MB and it is 
shared for all I/O, reading and writing and all datasets).

> 
> What if my application requires that I read one entire column by an
> arbitrary number of scanlines?  Essentially reading at a 90 degree angle to
> the block size.

That's the worse situation you can imagine. Essentially for each column you 
read, you will end up reading the entire image. It might be reasonable to 
consider rewriting your algorithm to fit with the file data organization. Or if 
you can't afford doing that, then try to read as much columns as you can, in 
order to minimize the number of passes where the entire image will be read 
again.

> Other an increasing the cache size and flushing the cache,
> are their other techniques to reduce thrashing (and therefore processing
> time)?

Not really apart the above suggestion. Or translate first your dataset to 
another one with a block size compatible of your algorithm, or at least to 
make it square tiled (128x128 for example). That might be the easiest solution 
if you know you will process the data several times. If it is just once, then 
it won't give much benefice of course...

Well, I have another idea but it is definitely an involved one and I'm not sure 
at all of the result (performance wise). If your geotiff is not compressed, 
then you could hack into the GTiff driver ( not the one I would advertize to a 
new-comer to hack into... ) to report a block dimension compatible of your 
reading scheme and short-circuit libtiff to just read the bytes you need 
instead of the whole strip. However chances are that the performance will not 
increase substantially because you will loose a lot of time in disk seeking. I 
repeat that it would only be doable for uncompressed geotiffs.

> 
> J
> 
> On Wed, Aug 3, 2011 at 11:19 AM, Even Rouault
> 
> <even.rouault at mines-paris.org>wrote:
> > Le mercredi 03 août 2011 17:32:53, Antonio Valentino a écrit :
> > > Hi Jay,
> > > 
> > > Il 03/08/2011 16:53, Jay L. ha scritto:
> > > > I have been working on this problem as well.  Initially, the attempt
> > 
> > was
> > 
> > > > to ReadAsArray small chunks.  Unfortunately this is quite
> > > > inefficient. Someone more knowledgeable will know why, but I suspect
> > > > it has to do with either thrashing or the fact that full blocks are
> > > > not being read
> > 
> > in
> > 
> > > > (as is the case when a 5000x5000 pixel block is read in on a 12567,
> > > > 12764 GTiff).
> > > 
> > > Yes, using chunks that are too small can cause inefficiency, and yes
> > > using blocks as that are aligned (exact size of multiple size) to I/O
> > > blocks is a good idea whenever it is possible.
> > 
> > Yes I strongly concurr with that. Reading 5000x5000 in a 12567x12764
> > raster is
> > likely to be inefficient if the raster is scanline oriented, that is to
> > if the
> > say the dimension of a bock reported by gdalinfo or GetBlockSize() is
> > 12567x Y
> > rows. In such as situation you should try to read chunks of Y (or a
> > multiple
> > of Y) whole lines.
> > 
> > Another point to take into consideration is when you read a multiband
> > dataset.
> > If the data in the dataset is pixel interleaved, then you should try to
> > read
> > all the bands at a time with DatasetRasterIO() so that GDAL avoids
> > re-reading
> > from disk the same blocks for each band. On the contrary, if the data is
> > band
> > interleaved, reading band by band is OK (using DatasetRasterIO() too
> > because
> > it will detect and adapt itself to the data organization to select the
> > best algorithm).
> > 
> > There are other possible caveats depending on the file format itself. For
> > example if you read a JPEG, PNG or GIF image, you must know that you
> > cannot read back lines without causing decompression to be restarted
> > from the top line. But such formats are rarely used for that big images.
> > I somehow remember
> > that it is also the case for some formulations of HDF4 (
> > http://trac.osgeo.org/gdal/ticket/3386 ).
> > 
> > You can check if your way of reading is efficient or not by defining
> > CPL_DEBUG=ON
> > and look at the warnings. If you see something about "Potential thrashing
> > on
> > band XXX of YYY", it is a hint that you didn't employ the most efficient
> > reading
> > scheme.
> > 
> > > I don't know very well internals of the python binding implementation.
> > > Looking at the release notes it seems that some important change in
> > > this are as been don in release 1.8.0
> > > 
> > > http://trac.osgeo.org/gdal/wiki/Release/1.8.0-News#SWIGLanguageBindings
> > 
> > Yes there have been a few optimizations to save some useless temporary
> > buffer
> > copies, and a few fixes as well. One of them allow to read more than 2GB
> > for 64
> > bit builds of GDAL.
> > 
> > Regards,
> > 
> > Even