[gdal-dev] Raster size and ReadAsArray()

Wed Aug 3 11:32:53 EDT 2011

Hi Jay,

Il 03/08/2011 16:53, Jay L. ha scritto:
> I have been working on this problem as well.  Initially, the attempt was to
> ReadAsArray small chunks.  Unfortunately this is quite inefficient.  Someone
> more knowledgeable will know why, but I suspect it has to do with either
> thrashing or the fact that full blocks are not being read in (as is the case
> when a 5000x5000 pixel block is read in on a 12567, 12764 GTiff).

Yes, using chunks that are too small can cause inefficiency, and yes
using blocks as that are aligned (exact size of multiple size) to I/O
blocks is a good idea whenever it is possible.

> My intention this morning is to try to implement an if statement which
> checks total file size and breaks it down into manageable (500MB maybe)
> chunks.  Then using numpy slices, I can read, slice, mask, and manipulate as
> needed.

This is the point I would like to stress.

Depending on your HW characteristics and depending on the size of the
memory chunk you use performing a trivial 2D boxcar filtering can cause
a lot of cache miss and become a very slow operation.

Please note that using numpy slicing does not fix the issue.

The same operation performed on the same total amount of data spit is
blocks with reasonable size can be significantly faster.

Nowadays even a medium PS has a lot of memory, fast data buses and large
processor caches so the issue I'm talking about is not alway triggered.

But also data flels are getting larger so the bottleneck pops up again.

The problem is the memory locality in CPU bound operations

> Again, the biggest issue I have had with this issue is performance.
>  Profiling shows that _gdal.IORaster is utilizing a ton of CPU time.
> 
> Hope that helps,
> Jay
> 

I don't know very well internals of the python binding implementation.
Looking at the release notes it seems that some important change in this
are as been don in release 1.8.0

http://trac.osgeo.org/gdal/wiki/Release/1.8.0-News#SWIGLanguageBindings

regards

> 
> On Wed, Aug 3, 2011 at 7:39 AM, Antonio Valentino <
> antonio.valentino at tiscali.it> wrote:
> 
>> Hi Alexander,
>>
>> Il 03/08/2011 15:35, Alexander Bruy ha scritto:
>>> Hi,
>>>
>>> There is a well-know "problem": reading really large rasters or bands
>>> into memory with DataSource.ReadAsArray() method impossible due
>>> memory limitations. For example, when I try to read one band with
>>> size 53109x29049 I get error:
>>
>> [CUT]
>>
>>> I want to know is it possible to get maximum raster size that can be
>> handled
>>> using ReadAsArray() without errors because I want to implement a fallback
>>> algorithm for large rasters in my tool.
>>
>> In my experience using too large chunks of memory can cause,
>> paradoxically, slowdowns.
>> My suggestion is to define a reasonable maximum size for arrays in your
>> application/library and switch to the "fallback algorithm for large
>> rasters" every time that MAX_SIZE is exceeded, even if using ReadAsArray
>> still works.
>>
>>> Currently I try to implement it with try-except statement but maybe
>>> there is more
>>> elegant solution?
>>>
>>>
>>> Thanks
>>
>> IMHO using try-except *is* elegant and perfectly in line with python's
>> philosophy.
>>
>> best regards
>>
>> --
>> Antonio Valentino

-- 
Antonio Valentino