[postgis-devel] [raster] Memory management and IO concerns

Thu Jun 23 15:20:56 PDT 2011

Hey Bryce,

On 06/23/2011 02:20 PM, Bryce L Nordgren wrote:
> A nasty thought was nagging at me the entire time I was writing raster code.
> It mostly has to do with "how many copies of this raster am I holding in
> memory at the same time". This arises from the basic flow:
>
> 1] Get a (rt_pgraster*) from the PG_FUNCTION_ARGS
> 2] Get a rt_raster by deserializing the (rt_pgraster*)
> 3] Get a GDALDatasetH by opening the rt_raster using the "mem" driver.
>
> So I poked around and #1 is a hollow shell with just a few metadata fields.
> #2 is "all or nothing": either you get just the header metadata or you load
> all the data from all the bands into memory.  I hope (but don't know) that
> #3 simply manipulates the data/metadata in the rt_raster/rt_band buffer
> (e.g., hope that it doesn't make ANOTHER COPY.)
>

Assuming you're using rt_raster_to_gdal_mem, the band data in 
GDALDatasetH just points to the appropriate address of the band data in 
rt_raster.

I've been thinking about expanding rt_raster_deserialize to allow more 
than the "all or nothing" approach with a selective list of the bands to 
deserialize, but isn't at the top of my todo right now.  Adding it 
wouldn't make more than an hour or two of work though.

> I was worried because I have three rasters active for the duration of my
> call (two inputs and one output), two of which have open GDALDatasetH
> handles, and I could foresee running out of memory pretty quickly. However,
> this may have serious performance implications for anything written in
> SQL...for IO reasons instead of memory reasons.  If you take the core loop
> of the mapalgebra code as typical (and it should be, for anything that loops
> over all the cells), you have:
>
>          FOR x IN 1..newwidth LOOP
>>              FOR y IN 1..newheight LOOP
>>                  r1 := ST_Value(rast1, band1, x - rast1offsetx, y -
>> rast1offsety);
>>                  r2 := ST_Value(rast2, band2, x - rast2offsetx1, y -
>> rast2offsety1);
>>                  ---- insert code here
>>
>>                  newrast = ST_SetValue(newrast, 1, x, y, newval);
>>              END LOOP;
>>          END LOOP;
>>
> Each call to ST_Value equals "loading all data from all bands into memory".
> Each call to ST_SetValue equals "loading all data from all bands into
> memory, then saving everything back out to the postgres backend". That's a
> LOT of I/O to read two pixels and set one. Also, if you assume a square
> raster (or any fixed aspect ratio), this is an O(N^2) situation, where N is
> the length of one of the dimensions. When things start getting slow, they'll
> screech to a halt. Worse: the PostGIS backend will act as a bottleneck, so
> two separate processes operating on different rasters will likely be waiting
> on the same disk for their IO.
>
> It would seem that adding the ability to "deserialize" a partial raster has
> the potential to vastly improve performance from SQL (by fixing an IO
> issue), and to a lesser extent, C (by fixing a memory issue). It would also
> allow larger rasters to be manipulated/operated on.
>
> Is it possible to implement a server-side GDAL driver for rasters stored in
> PostGIS? GDAL seems to have the ability to load/store blocks of image data
> (perhaps even a "block cache", if I saw correctly?) What is the current
> thinking on this issue?
>

I can't be certain but the rt_pgraster object received in the backend is 
already in memory from PostgreSQL, so there isn't much memory savings 
possible.  The ability of GDAL for interacting with blocks of data is of 
benefit when you're working against file rasters on the filesystem that 
aren't loaded into memory and can be accessed as a stream.

-bborie