[postgis-devel] [raster] Memory management and IO concerns

Wed Jun 29 13:01:07 PDT 2011

On Wed, Jun 29, 2011 at 5:45 PM, Pierre Racine
<Pierre.Racine at sbf.ulaval.ca>wrote:

> > There needs to be a way to go from the raster's coordinates to the (tile
> ID, tile coordinates)
>
> SELECT rid, ST_World2RasterCoordX(ras, ST_MakePoint(x, y)),
> ST_World2RasterCoordX(rast, ST_MakePoint(x, y))
> FROM myrastcoverage WHERE ST_Intersects(raster, ST_MakePoint(x, y))
>
> > and back;
>
> SELECT ST_Raster2WorldCoordX(ras, x, y), ST_Raster2WorldCoordXY (rast,x, y)
> FROM myrastcoverage WHERE rid = requestedtileid
>
>
These are O(N) operations, where N is the number of rows. If we have many
rows of tiny tiles (which appears to be the paradigm), we get slow. If the
extents have been indexed, maybe you get O(log N). Tile computations on
perfectly rectangular rasters (used in the "image" sense are O(1)
(constant), fairly trivial, and almost always a part of a per-pixel
operation. They always take the same negligible amount of time no matter how
big the image is.

> You can easily add a column specifying which tile (or row) is part of which
> image. You can never guarantee nobody will not brake your nice image.
>

You've hit on the point I was trying to make. A single row (or a single
raster) is guaranteed to be rectangular, has no gap, no overlap and it is
always guaranteed that interlopers will fail to break my nice image. The
standard way to provide efficient I/O and memory management for these
perfect rasters is to provide the tiling services that people from a raster
background have come to expect.

Think of it this way: adding "normal raster" tiling services to individual
rows increases performance for anything requiring data access. The size of
the rasters in individual rows can be larger, there can be fewer rows, and
all of the O(N) raster coverage queries complete quicker. Furthermore, after
the correct row has been located (faster), the raster has additional
internal structure which can be further exploited to accelerate access to
the exact pixel required. Deciding to ignore this internal structure in
favor of loading the whole tile/raster is probably the only reason
rows/tiles need to be small.

Providing these services is simply an acknowledgement that each row of data
may contain more information than we want to load into memory at once. For
that matter, the row may contain more information than we would typically
work with inside a single function call. Rasters simply have enough internal
structure to quickly locate data which are "in the neighborhood" we are
interested in.

I'm pretty sure we can all have our cake. I'm in a cake-eatin' mood.

>
> > To put it another way, the SQL version of MapAlgebra is currently O(N^4).
> (N is
> > the raster length along one side). Being able to tile individual rows
> could knock
> > that down to O(N^2). Real tiling is probably the single biggest
> performance
> > enhancement postgis raster could have.
>
> There is real tiling. It's just not implemented the traditional way. Please
> explain me this O(N^4). For me it is always O(N^2).
>

 From
http://postgis.refractions.net/pipermail/postgis-devel/2011-June/013878.html

If you take the core loop
of the mapalgebra code as typical (and it should be, for anything that loops
over all the cells), you have:

        FOR x IN 1..newwidth LOOP
>*             FOR y IN 1..newheight LOOP*>*                 r1 := ST_Value(rast1, band1, x - rast1offsetx, y -*>* rast1offsety);*>*                 r2 := ST_Value(rast2, band2, x - rast2offsetx1, y -*>* rast2offsety1);*>*                 ---- insert code here*>**>*                 newrast = ST_SetValue(newrast, 1, x, y, newval);*>*             END LOOP;*>*         END LOOP;*>**Each call to ST_Value equals "loading all data from all bands into memory".
Each call to ST_SetValue equals "loading all data from all bands into
memory, then saving everything back out to the postgres backend". That's a
LOT of I/O to read two pixels and set one. Also, if you assume a square
raster (or any fixed aspect ratio), this is an O(N^2) situation, where N is
the length of one of the dimensions.

The fact that these four O(N^2) operations are inside an O(N^2) loop
makes this O(N^4) by my count.

Using the "all-or-nothing" tile-access paradigm, the output raster (the
external loop) and the rasters loaded/saved by the pixel accessors are
likely to be all about the same size; hence O(N^4). If, however, each of the
pixel accessors loaded/saved a tiny fraction of the rasters they operate on,
the accessors become ~O(1) instead of O(N^2). In essense, the accessors
depend only on the (constant) size of a small I/O block instead of the size
of the overall raster.

Bryce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20110629/f15be7a6/attachment.html>