[postgis-users] selected aligned tiles in pgraster between several large raster tables.

Wed Jun 12 05:23:56 PDT 2013

Hello everyone,

Short version:

I have some identically specified/tiled rasters that differ only in terms of the data values in the pixels and table name*. 

I want to find a quick, easy and 100% trustworthy test of tile position equivalence between the tiles of several raster tables, so that I can match up the pixels in corresponding tiles for mathematical operations. 

Is there a quick and easy way to do this?

* in theory. 

Painfully long version: 

I have some big rasters, around 18 to 72 giga-pixels each. The rasters have a single band containing 8-bit values. They are specified identically in terms of tiling, georeferencing, skew, scale, pixel size etc. The only difference is in the contents of the pixels. 

I've been processing these with GDAL and Python I'm getting very nice speeds, e.g.  <60 minutes to perform calculations using several rasters together.

I've imported my rasters (A,B,C,D) into PG raster using 100x100 tiles, and I'm trying some analysis in postgis. This is deliberate; I could  do the analysis in gdalcalc very easily, but I want to make it possible for database users to do this kind of work. 

I hit a stumbling block when it comes to carrying out aligned calculations between pixels. I'm basically carrying out an operation over every pixel in D and comparing it with the corresponding pixels in A, B, and C. 

Logically, what I want to say is:

- for each 'rast' tile in raster table D
  - for each pixel of that tile (x=1..100, y=1..100)
    - get the corresponding pixel value from the corresponding tile in rasters A,B and C and do some work

The problem arises around the word 'corresponding' and the tradeoff between accuracy, speed and query elegance/debuggability. 

To my mind, if the rasters are specified identically and the import routines are stable (e.g. always producing the same metadata for the same raster tiles), then it should be possible to pick out the identical tiles between rasters.  

Possible approaches: 

1. Bounding box intersection.  If you try bounding box intersection to identify corresponding tiles, it's suitably fast:

>    where D.rast && A.rast and D.rast && B.rast and D.rast && C.rast

Unfortunately, you find that the bounding boxes of the neighbouring tiles in all directions (8-neighbours) are also matched for && intersection.

The consquence is that you get 256 rows where you want to have 1. I've checked this is what is happening by looking at the returned rids. I expect this is happening because of the BBox of a tile is slightly larger than the tile, by definition. 

2. Make a multi-band raster.  One solution would be to merge the rasters outside postgis into a multi-band aligned raster and import that (or do the same inside postgis raster using e.g. addband). 

I don't want to do that because a) the data is big, copies are bad  and  b) the data is logically distinct, so I want to store and address it using logically distinct raster tables. 

3.  Address raster pixels indirectly using geography.  I could iterate across the rasters using geographical points on the landscape that I've calculated to correspond to the center of pixels on the rasters. However, for this problem, please assume that I want to index the polygon tiles/pixels *directly* between corresponding tiles, rather than *indirectly*. I suspect it would be rather slow to do things this way, too. 

4. Identify exactly identical bounding boxes. Checking for equality between st_envelopes of tiles works: 

>    where st_envelope(D.rast) = st_envelope(C.rast) and st_envelope(D.rast) = st_envelope(B.rast) and st_envelope(D.rast) = st_envelope(A.rast) 

but it is very slow - many many times slower than the && method. Also, I expect the st_envelope generation algorithm is stable and will always make exactly the same output for a given tile across each raster, but I can't be 100% certain without taking a close look at the source code of postgis. 

5. I can use matching rids (since the raster tiles get numbered identically in practice from identically specified source tiles), 

> where D.rid=A.rid and D.rid=B.rid and D.rid=C.rid

This is ideal - super fast, quick to write - but it seems there might be a theoretical risk that the rids might not be numbered in the order I expect when imported through raster2pgsql. How real is that risk?

6. I could compare ST_metadata equality between tiles. This also has the advantage that if the tiles aren't set up identically, it will fail horribly. It seems there is a lot of data being compared (skew etc) which I know for certain matches between the tiles. 

> where st_metadata(D.rast) = st_metadata(C.rast) and st_metadata(D.rast) = st_metadata(B.rast) and st_metadata(D.rast) = st_metadata(A.rast)

This is very slow. 

7. Knowing the tiles are aligned and equivalent, I could compare only the distinct components of the tile metadata, e.g. the topleft pixels (ST_Raster2WorldCoordX, ST_Raster2WorldCoordY).

> st_raster2worldcoordx(D.rast,1) = st_raster2worldcoordx(C.rast,1) and st_raster2worldcoordx(D.rast,1) = st_raster2worldcoordx(B.rast,1) and st_raster2worldcoordx(D.rast,1) = st_raster2worldcoordx(A.rast,1) and
st_raster2worldcoordy(D.rast,1) = st_raster2worldcoordy(C.rast,1) and st_raster2worldcoordy(D.rast,1) = st_raster2worldcoordy(B.rast,1) and st_raster2worldcoordy(D.rast,1) = st_raster2worldcoordy(A.rast,1)

This is fast and close to ideal, but it involves quite a bit of text compared to the cute && syntax. I don't mind typing, but it's harder to spot errors.

8. I could precalculate a key column over my raster tiles that allows me identify identically positioned tiles quickly - e.g.  the same purpose as 'rid', but without the worry that the sequence could be out of order somewhere because of unknown gremlins in the serial generation process of postgres/raster2pgsql. 

9. So.. is there another way of detecting tile equivalence that is:

- 'short' e.g. visibly concise, like   A.rast && D.rast
- computationally fast
- robust, i.e. not likely to generate 1 missing tile among millions.

I suppose what I am getting towards is:

>  where D.rast ~~ A.rast and D.rast ~~ B.rast and D.rast ~~ C.rast

where ~~ is some computationally fast and trustworthy operator to pick out tiles in matching positions in equivalent rasters.

Graeme.