[postgis-devel] [raster] Memory management and IO concerns

Thu Jun 23 14:20:48 PDT 2011

A nasty thought was nagging at me the entire time I was writing raster code.
It mostly has to do with "how many copies of this raster am I holding in
memory at the same time". This arises from the basic flow:

1] Get a (rt_pgraster*) from the PG_FUNCTION_ARGS
2] Get a rt_raster by deserializing the (rt_pgraster*)
3] Get a GDALDatasetH by opening the rt_raster using the "mem" driver.

So I poked around and #1 is a hollow shell with just a few metadata fields.
#2 is "all or nothing": either you get just the header metadata or you load
all the data from all the bands into memory.  I hope (but don't know) that
#3 simply manipulates the data/metadata in the rt_raster/rt_band buffer
(e.g., hope that it doesn't make ANOTHER COPY.)

I was worried because I have three rasters active for the duration of my
call (two inputs and one output), two of which have open GDALDatasetH
handles, and I could foresee running out of memory pretty quickly. However,
this may have serious performance implications for anything written in
SQL...for IO reasons instead of memory reasons.  If you take the core loop
of the mapalgebra code as typical (and it should be, for anything that loops
over all the cells), you have:

        FOR x IN 1..newwidth LOOP
>             FOR y IN 1..newheight LOOP
>                 r1 := ST_Value(rast1, band1, x - rast1offsetx, y -
> rast1offsety);
>                 r2 := ST_Value(rast2, band2, x - rast2offsetx1, y -
> rast2offsety1);
>                 ---- insert code here
>
>                 newrast = ST_SetValue(newrast, 1, x, y, newval);
>             END LOOP;
>         END LOOP;
>
Each call to ST_Value equals "loading all data from all bands into memory".
Each call to ST_SetValue equals "loading all data from all bands into
memory, then saving everything back out to the postgres backend". That's a
LOT of I/O to read two pixels and set one. Also, if you assume a square
raster (or any fixed aspect ratio), this is an O(N^2) situation, where N is
the length of one of the dimensions. When things start getting slow, they'll
screech to a halt. Worse: the PostGIS backend will act as a bottleneck, so
two separate processes operating on different rasters will likely be waiting
on the same disk for their IO.

It would seem that adding the ability to "deserialize" a partial raster has
the potential to vastly improve performance from SQL (by fixing an IO
issue), and to a lesser extent, C (by fixing a memory issue). It would also
allow larger rasters to be manipulated/operated on.

Is it possible to implement a server-side GDAL driver for rasters stored in
PostGIS? GDAL seems to have the ability to load/store blocks of image data
(perhaps even a "block cache", if I saw correctly?) What is the current
thinking on this issue?

Bryce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20110623/ea0b5bb3/attachment.html>