[gdal-dev] Raster size and ReadAsArray()

Wed Aug 3 14:28:35 EDT 2011

Evan,

To ensure that i understand here is an example:

If I have a GTiff, where the block size is one row by all of the columns  (a
single scanline), I should try to read in either one scanline at a time, or
multiple entire scanlines.  It is inefficient to take say 10 rows and only
half of the columns.

What if my application requires that I read one entire column by an
arbitrary number of scanlines?  Essentially reading at a 90 degree angle to
the block size.  Other an increasing the cache size and flushing the cache,
are their other techniques to reduce thrashing (and therefore processing
time)?

J

On Wed, Aug 3, 2011 at 11:19 AM, Even Rouault
<even.rouault at mines-paris.org>wrote:

> Le mercredi 03 août 2011 17:32:53, Antonio Valentino a écrit :
> > Hi Jay,
> >
> > Il 03/08/2011 16:53, Jay L. ha scritto:
> > > I have been working on this problem as well.  Initially, the attempt
> was
> > > to ReadAsArray small chunks.  Unfortunately this is quite inefficient.
> > > Someone more knowledgeable will know why, but I suspect it has to do
> > > with either thrashing or the fact that full blocks are not being read
> in
> > > (as is the case when a 5000x5000 pixel block is read in on a 12567,
> > > 12764 GTiff).
> >
> > Yes, using chunks that are too small can cause inefficiency, and yes
> > using blocks as that are aligned (exact size of multiple size) to I/O
> > blocks is a good idea whenever it is possible.
>
> Yes I strongly concurr with that. Reading 5000x5000 in a 12567x12764 raster
> is
> likely to be inefficient if the raster is scanline oriented, that is to if
> the
> say the dimension of a bock reported by gdalinfo or GetBlockSize() is
> 12567x Y
> rows. In such as situation you should try to read chunks of Y (or a
> multiple
> of Y) whole lines.
>
> Another point to take into consideration is when you read a multiband
> dataset.
> If the data in the dataset is pixel interleaved, then you should try to
> read
> all the bands at a time with DatasetRasterIO() so that GDAL avoids
> re-reading
> from disk the same blocks for each band. On the contrary, if the data is
> band
> interleaved, reading band by band is OK (using DatasetRasterIO() too
> because
> it will detect and adapt itself to the data organization to select the best
> algorithm).
>
> There are other possible caveats depending on the file format itself. For
> example if you read a JPEG, PNG or GIF image, you must know that you cannot
> read back lines without causing decompression to be restarted from the top
> line. But such formats are rarely used for that big images. I somehow
> remember
> that it is also the case for some formulations of HDF4 (
> http://trac.osgeo.org/gdal/ticket/3386 ).
>
> You can check if your way of reading is efficient or not by defining
> CPL_DEBUG=ON
> and look at the warnings. If you see something about "Potential thrashing
> on
> band XXX of YYY", it is a hint that you didn't employ the most efficient
> reading
> scheme.
>
> >
> > I don't know very well internals of the python binding implementation.
> > Looking at the release notes it seems that some important change in this
> > are as been don in release 1.8.0
> >
> > http://trac.osgeo.org/gdal/wiki/Release/1.8.0-News#SWIGLanguageBindings
>
> Yes there have been a few optimizations to save some useless temporary
> buffer
> copies, and a few fixes as well. One of them allow to read more than 2GB
> for 64
> bit builds of GDAL.
>
> Regards,
>
> Even
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.osgeo.org/pipermail/gdal-dev/attachments/20110803/90973e2f/attachment-0001.html