[postgis-devel] [raster] Memory management and IO concerns

Wed Jun 29 10:45:18 PDT 2011

> I agree. Tiling api's are typically an abstraction layer which do not necessarily
> depend on explicit tiling support from the backing store. 

The only abstraction we provide is functions working on coverages when necessary (see the stats functions) or aggregates. You have to think about PostGIS raster as a tiled raster coverage similar to a tiled vector coverage (in a vector coverage every geometry is essentially a tile of the coverage. It is not rectangular. That's it.). The same way you have to be aware of the arrangement of your geometries (topological or not) in a vector coverage before beginning any manipulation, the same way you have to be aware of your raster tile arrangement in a raster coverage before beginning any manipulation. This may prevent the typical raster abstraction (that we kind of fulfilling by providing functions working on whole coverage or aggregates) but it allows lots more flexibility on the arrangement of rasters (one raster per row, one tiled raster per table, overlapping, gaps) allowing easy back and forth conversion between a raster coverage and a vector coverage. It also prevent having an Oracle style two types schema (GEORASTER and RASTER) (event though you can still implement one by yourself if the notion of image is important in your coverage.

All this is somewhat explained in http://trac.osgeo.org/postgis/wiki/WKTRaster/Documentation01

People coming from a raster background have somehow problem dealing with this because there are used to work with images. And I know also that people used to work in software environments) other than databases (allowing complex data structures also have problems understanding data access in a relational context (getting the tile id for example). Here we try to work with coverage of information in a relational database context. I know it requires a little mind shift. I often had this kind of discussion about PostGIS raster. You will notice also that most raster/vector packages do not offer great seamless raster/vector because the raster side is (IMHO) unable to express space the same way as vector. Here we can without much problems and this is (IMHO) the key to easy raster/vector analysis capabilities.

> Tiles are almost never explicitly managed by the end user, and are usually transparent. 

As I said, for this we provide, when necessary, functions working on a coverage (table) or aggregates.

> There needs to be a way to go from the raster's coordinates to the (tile ID, tile coordinates) 

SELECT rid, ST_World2RasterCoordX(ras, ST_MakePoint(x, y)), ST_World2RasterCoordX(rast, ST_MakePoint(x, y))
FROM myrastcoverage WHERE ST_Intersects(raster, ST_MakePoint(x, y))

> and back;

SELECT ST_Raster2WorldCoordX(ras, x, y), ST_Raster2WorldCoordXY (rast,x, y)
FROM myrastcoverage WHERE rid = requestedtileid

> retrieve and store pixels based on the raster (not tile) coordinates; 

SELECT ST_Value(rast, ST_MakePoint(x, y))
FROM myrastcoverage
WHERE ST_Intersects(raster, ST_MakePoint(x, y))

> a tile cache should exist, 

PostgreSQL cache everything.

> should not require explicit management, etc. 

Again, we should provide function facilitating management when necessary.

> Rows are not tiles because they are not related in any particular way. 

They are related in the way that they are part of the same table. And as for a vector table in a well-designed GIS database it should imply that they represent the same theme (temperature, elevation, etc...) in the same unit, SRID, pixeltype, etc... You can also build something messy (it is very possible) as you can do in a vector coverage. But don't expect everything to work smoothly.

> Rows may even store data from different images! Tiles need to offer a guarantee that they are tiles from a specific image. Rows don't do that.

You can easily add a column specifying which tile (or row) is part of which image. You can never guarantee nobody will not brake your nice image.

> Using rows in place of tiles issomewhat analgous to dividing one big image into many little images with
> random filenames (unsorted rows), and dumping them all in a directory (table).

Right this is the right analogy.

> In any case, even with a little 64Mb row ("tile") there needs to be a way to load
> only part of the tile.

If you load everything in the same row there are no tiles. Period.

> I'm not sure it makes sense to load (and save) all 64Mb just to set one band's Nodata value, for instance.

In this case you have to set the nodata value for each tiles... This may seems silly but it leads to other interesting advantages.

> To put it another way, the SQL version of MapAlgebra is currently O(N^4). (N is
> the raster length along one side). Being able to tile individual rows could knock
> that down to O(N^2). Real tiling is probably the single biggest performance
> enhancement postgis raster could have.

There is real tiling. It's just not implemented the traditional way. Please explain me this O(N^4). For me it is always O(N^2).

In brief, to achieve easy vector/raster conversion, seamless raster/vector operations in order to implement many raster/vector analysis operations and good vector integration some design choices were made that differ slightly from tradition image packages. There is one level of abstraction less providing more flexibility over raster arrangement. This is why I think this design is superior to Oracle Georaster

We are working in a relational database context which make normal architecture not really fitted. Some examples of this are: 

-in a strict raster/tile architecture people cannot delete/add tiles outside of the rectangle image area. In a relational database you can delete/add tiles anywhere you want eventually braking the nice rectangle. This is one reason why we need one geoereference per tile and this is why you cannot really rely on the raster_column table (which could be seen as the actual image or Oracle GEORASTER type).

-in a strict raster/tile architecture there is a strict relation between images and tiles. In a relational database you can't guarantee the order of the tiles and that what is added to the table fits with what is already there. You most construct your API so it works with this.

To understand PostGIS raster architecture you have to stop thinking in terms of images and think in terms of tiled raster coverage. The perfect raster coverage is rectangular, have no gap and no overlap. A more realistic one have gaps or missing tiles and a vector coverage converted to a raster coverage must support overlaps. This is what we are optimized for (even if we can deal with other arrangements). You can drop a bunch of rasters, tile them to smaller tiles and you're ready to work. This is very practical for working with raster coverage available on the web like SRTM. You don't have to make sure you have a nice rectangular images. Sometime it might be impossible or a nightmare to try to create such an image if the raster coverage is too large (2TB). With PostGIS raster just happen tiles (-a raster2pgsql.py option) to the table and that's it.

The big question you have to answer is "How is it possible to convert a vector coverage to a raster coverage without losing information?" (You might lose precision but not information.) This is the first question you have to answer before being able to do things like ST_MapAlgebra(raster, geometry, expression). My answer is: convert the geometry to a raster, and then you must be able to convert a whole geometry table to raster, and then you have to support overlaps, and then you need one georeference per tile and then...

People not happy with this architecture are very welcome to develop their own GEORASTER-like solutions, with all its limitations. I think they will face many problems and end-up with something not very useful (like only storing raster in the database: Rasterlite and most of Oracle Georaster even though it provides many raster/vector analysis tools).

Pierre