[postgis-users] selected aligned tiles in pgraster between several large raster tables.

Graeme B. Bell grb at skogoglandskap.no
Mon Jun 17 07:22:18 PDT 2013


Hello bborie (and everyone else). 

To recap. I'm trying to do some work with 4 gigantic raster tables which have been set up in the same way (e.g. georeferenced/tiled identically). I want to compare equivalent tiles in each raster with one another. Each raster has 7 million rows now but I'll be using 28 million row versions shortly. 

I wanted to quickly and reliably match up corresponding raster tiles between the raster tables, but I wasn't confident that the geometry description for each rid key would always be *exactly* alike between different imports of rasters via raster2pgsql. 

bborie proposed: 

> A.rast::geometry = B.rast::geometry will check for spatially equal
> tiles. This check will be faster on out-db vs in-db as there's less data
> for postgres to load regardless of your SQL.
> 
> If you're asking for a canned SQL query...
> 
> SELECT
> 	SOMEWORK
> FROM A
> JOIN B
> 	ON A.rast::geometry = B.rast::geometry
> 
> -bborie

which is an elegant approach, but unfortunately does not seem practical. The problem is that the geometry has to be generated by the cast and it's not an indexed column. This makes it a very slow process to build the join. 

I tested, and regardless of the type of join I tried, the cost of the join grows as the square of the number of rows. 

n=100: 0.3s
n=1000: 26s
n=2000: 105s

At that rate, n=28 million is going to be a big problem. A possible workaround might be to precalculate and explicitly store some geometry for every raster tile, index it, and then do the comparison. That would be a bit slow and space-consuming, but is viable.

Preferably it would be nicer to join on the RID column, which is a simple int and is already indexed. But how can I be confident that the rid columns and geometries will line up between the raster rows of each table?


B0. Common sense - "they ought to", in principle; there is no reason why they shouldn't other than algorithm instability. 


B1. Pick out some rows and check visually in QGIS how the numbering works, and that the geometry appears to match.

> select a.rid, a.rast::geometry into temp.test2 from temp.a as a order by rid desc limit 1000;


B2. Check that the highest rid value matches the number of rows in the table (e.g. 'there is no room for gaps in the rid sequences')

> select rid from temp.b order by rid desc limit 10;
> select rid from temp.b order by rid asc limit 10;
...

B3. Select out the start and the end of the raster tables and ensure that when joined on RID, the geometries match. 

> select a.rid as arid, b.rid as brid from temp.a join temp.b using (rid) where st_equals(a.rast::geometry,b.rast::geometry) order by rid desc limit 10; 

> select a.rid as arid, b.rid as brid from temp.a join temp.b using (rid) where st_equals(a.rast::geometry,b.rast::geometry) order by rid asc limit 10; 

B4. You could also sample 1000's of random rids within the tables and check they line up between rasters, to gain confidence that the ordering is maintained throughout the data.


in short - if the rasters were given in the same coordinate system/scale/tiling, and if the start of the raster tables line up, and the ends line up, if the rid ordering looks simple/predictable, and if there is no gap in the number sequence along the way, then it seems reasonable to trust the rid as a proxy for the geometry.

This seems to work well. Using a join on 'rid' across the 4 tables is extremely fast because it is a simple integer and it is already implicitly indexed. I was able to complete the entire analysis (including checking st_value() across the raster) in less than an hour.  The results look as expected.

I would recommend using rid in this way as a proxy for geometry to anyone else doing analysis between large rasters that can be naturally placed into an identical coordinate system, tiling, position and so on. But if you are doing this, remember to run tests B0-B4 to gain some confidence that your rids really are a safe proxy. 

select ...
  from a join b using (rid) join c using (rid) join d using (rid) ... (join x,y sequence columns)
  where ... (st_value stuff on pixels)

Hope someone finds this interesting.

Graeme



More information about the postgis-users mailing list