[postgis-devel] [raster] Memory management and IO concerns

Wed Jun 29 14:01:20 PDT 2011

> These are O(N) operations, where N is the number of rows. If we have many
> rows of tiny tiles (which appears to be the paradigm), we get slow. If the extents
> have been indexed, maybe you get O(log N). Tile computations on perfectly
> rectangular rasters (used in the "image" sense are O(1) (constant), fairly trivial,
> and almost always a part of a per-pixel operation. They always take the same
> negligible amount of time no matter how big the image is.

I have someone here who is extracting elevation from a 10 GB raster coverage for 500 000 lidar points. He loaded the 10GB as 10x10 tiles and he get the result in 340 seconds. ArcGIS 10 does the same in 111 second. We did not yet try to optimize his query. I think we can do better. Apparently the fact that ST_Value is also O(log N) does not lead to catastrophic result comparing with a classical tiled image architecture.

Keep in mind that PostGIS raster goal is not to brake performance records. Our goal is to provide nice raster/vector analysis functionalities (being able to do it in SQL, interacting natively with PostGIS geometries) at reasonable performance. Up to now, for all the test I have seen, the performance is VERY acceptable and for sure we are the first to provide interesting raster/vector analysis capacity in a relational database.

>> 	You can easily add a column specifying which tile (or row) is part of
>> which image. You can never guarantee nobody will not brake your nice image.
> 
> You've hit on the point I was trying to make. A single row (or a single raster) is
> guaranteed to be rectangular, has no gap, no overlap and it is always
> guaranteed that interlopers will fail to break my nice image. The standard way
> to provide efficient I/O and memory management for these perfect rasters is to
> provide the tiling services that people from a raster background have come to
> expect.

At the price of not being able to support other tile arrangement.

> Think of it this way: adding "normal raster" tiling services to individual rows
> increases performance for anything requiring data access. The size of the rasters
> in individual rows can be larger, there can be fewer rows, and all of the O(N)
> raster coverage queries complete quicker. Furthermore, after the correct row
> has been located (faster), the raster has additional internal structure which can
> be further exploited to accelerate access to the exact pixel required. Deciding to
> ignore this internal structure in favor of loading the whole tile/raster is probably
> the only reason rows/tiles need to be small.

It is true that we could have optimized everything (including the case where you want to store big images in one row) by tiling the band data inside rt_api.c. But still you could only store 1 GB images per row. Out goal is to work with 1 TB coverages... We decided to be efficient for that particular (more useful) case instead by relying on the spatial index for fast retrival of tiles. We could go further and provide fast subtile retrieving for big tiles but I'm not sure it is necessary to go that far.

> Providing these services is simply an acknowledgement that each row of data
> may contain more information than we want to load into memory at once. For
> that matter, the row may contain more information than we would typically
> work with inside a single function call. Rasters simply have enough internal
> structure to quickly locate data which are "in the neighborhood" we are
> interested in.

I would put this as a future refactoring/optimization of the rt_api internals. But you will first have to justify why storing small tiles does not fulfill your application needs.

> 	There is real tiling. It's just not implemented the traditional way. Please
> explain me this O(N^4). For me it is always O(N^2).
> 
> 
> 
>  From http://postgis.refractions.net/pipermail/postgis-devel/2011-
> June/013878.html
> 
> If you take the core loop
> of the mapalgebra code as typical (and it should be, for anything that loops over
> all the cells), you have:
> 
>         FOR x IN 1..newwidth LOOP
> >             FOR y IN 1..newheight LOOP
> >                 r1 := ST_Value(rast1, band1, x - rast1offsetx, y -
> > rast1offsety);
> >                 r2 := ST_Value(rast2, band2, x - rast2offsetx1, y -
> > rast2offsety1);
> >                 ---- insert code here
> >
> >                 newrast = ST_SetValue(newrast, 1, x, y, newval);
> >             END LOOP;
> >         END LOOP;
> >
> Each call to ST_Value equals "loading all data from all bands into memory".
> Each call to ST_SetValue equals "loading all data from all bands into memory,
> then saving everything back out to the postgres backend". That's a LOT of I/O to
> read two pixels and set one. Also, if you assume a square raster (or any fixed
> aspect ratio), this is an O(N^2) situation, where N is the length of one of the
> dimensions.
> 
> The fact that these four O(N^2) operations are inside an O(N^2) loop makes this
> O(N^4) by my count.
> Using the "all-or-nothing" tile-access paradigm, the output raster (the external
> loop) and the rasters loaded/saved by the pixel accessors are likely to be all
> about the same size; hence O(N^4). If, however, each of the pixel accessors
> loaded/saved a tiny fraction of the rasters they operate on, the accessors
> become ~O(1) instead of O(N^2). In essense, the accessors depend only on the
> (constant) size of a small I/O block instead of the size of the overall raster.

Hum no... Both ST_SetValue(newrast, 1, x, y, newval) are O(1) operations... Since every raster has its own georeference and you can easily derive the memory location of the requested pixel value in O(1). Take also into account the fact that those two rasters are already loaded.

I think a better count would be O(log(T))*O(N^2) where T is the number of indexed rows and N the width/height of one tile, which in the end means O(N^2). Tell me if I'm wrong...

Pierre