[postgis-devel] [raster] Memory management and IO concerns

Thu Jun 23 15:54:46 PDT 2011

Paul, does this mean that the serialized raster stored in the database 
isn't loaded into PostgreSQL's memory until you call PG_DETOAST_DATUM?

So in the typical usage:

pgraster = (rt_pgraster *) PG_DETOAST_DATUM(PG_GETARG_DATUM(0));

The argument at index 0 isn't loaded into PostgreSQL's memory until you 
call PG_DETOAST_DATUM?

-bborie

On 06/23/2011 03:35 PM, Paul Ramsey wrote:
> Once you DETOAST_DATUM, it's loaded in. However you do have the option
> of just doing DETOAST_DATUM_SLICE. Obviously, this begins to
> complicate things :) however where performance is paramount, it's
> worth doing. The index code detoasts just the first slice of a
> gserialized that contains the bounding box, for example.
>
> P.
>
> On Thu, Jun 23, 2011 at 3:20 PM, dustymugs<dustymugs at gmail.com>  wrote:
>> Hey Bryce,
>>
>> On 06/23/2011 02:20 PM, Bryce L Nordgren wrote:
>>>
>>> A nasty thought was nagging at me the entire time I was writing raster
>>> code.
>>> It mostly has to do with "how many copies of this raster am I holding in
>>> memory at the same time". This arises from the basic flow:
>>>
>>> 1] Get a (rt_pgraster*) from the PG_FUNCTION_ARGS
>>> 2] Get a rt_raster by deserializing the (rt_pgraster*)
>>> 3] Get a GDALDatasetH by opening the rt_raster using the "mem" driver.
>>>
>>> So I poked around and #1 is a hollow shell with just a few metadata
>>> fields.
>>> #2 is "all or nothing": either you get just the header metadata or you
>>> load
>>> all the data from all the bands into memory.  I hope (but don't know) that
>>> #3 simply manipulates the data/metadata in the rt_raster/rt_band buffer
>>> (e.g., hope that it doesn't make ANOTHER COPY.)
>>>
>>
>> Assuming you're using rt_raster_to_gdal_mem, the band data in GDALDatasetH
>> just points to the appropriate address of the band data in rt_raster.
>>
>> I've been thinking about expanding rt_raster_deserialize to allow more than
>> the "all or nothing" approach with a selective list of the bands to
>> deserialize, but isn't at the top of my todo right now.  Adding it wouldn't
>> make more than an hour or two of work though.
>>
>>> I was worried because I have three rasters active for the duration of my
>>> call (two inputs and one output), two of which have open GDALDatasetH
>>> handles, and I could foresee running out of memory pretty quickly.
>>> However,
>>> this may have serious performance implications for anything written in
>>> SQL...for IO reasons instead of memory reasons.  If you take the core loop
>>> of the mapalgebra code as typical (and it should be, for anything that
>>> loops
>>> over all the cells), you have:
>>>
>>>          FOR x IN 1..newwidth LOOP
>>>>
>>>>              FOR y IN 1..newheight LOOP
>>>>                  r1 := ST_Value(rast1, band1, x - rast1offsetx, y -
>>>> rast1offsety);
>>>>                  r2 := ST_Value(rast2, band2, x - rast2offsetx1, y -
>>>> rast2offsety1);
>>>>                  ---- insert code here
>>>>
>>>>                  newrast = ST_SetValue(newrast, 1, x, y, newval);
>>>>              END LOOP;
>>>>          END LOOP;
>>>>
>>> Each call to ST_Value equals "loading all data from all bands into
>>> memory".
>>> Each call to ST_SetValue equals "loading all data from all bands into
>>> memory, then saving everything back out to the postgres backend". That's a
>>> LOT of I/O to read two pixels and set one. Also, if you assume a square
>>> raster (or any fixed aspect ratio), this is an O(N^2) situation, where N
>>> is
>>> the length of one of the dimensions. When things start getting slow,
>>> they'll
>>> screech to a halt. Worse: the PostGIS backend will act as a bottleneck, so
>>> two separate processes operating on different rasters will likely be
>>> waiting
>>> on the same disk for their IO.
>>>
>>> It would seem that adding the ability to "deserialize" a partial raster
>>> has
>>> the potential to vastly improve performance from SQL (by fixing an IO
>>> issue), and to a lesser extent, C (by fixing a memory issue). It would
>>> also
>>> allow larger rasters to be manipulated/operated on.
>>>
>>> Is it possible to implement a server-side GDAL driver for rasters stored
>>> in
>>> PostGIS? GDAL seems to have the ability to load/store blocks of image data
>>> (perhaps even a "block cache", if I saw correctly?) What is the current
>>> thinking on this issue?
>>>
>>
>> I can't be certain but the rt_pgraster object received in the backend is
>> already in memory from PostgreSQL, so there isn't much memory savings
>> possible.  The ability of GDAL for interacting with blocks of data is of
>> benefit when you're working against file rasters on the filesystem that
>> aren't loaded into memory and can be accessed as a stream.
>>
>> -bborie
>> _______________________________________________
>> postgis-devel mailing list
>> postgis-devel at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-devel
>>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-devel
>