[gdal-dev] Update on RFC 62: Raster algebra

Fri Sep 23 05:19:23 PDT 2016

>> Richard Barnes:
>>In building tile/block managers, I too have found it difficult to use iterators or design algorithms without specifically considering both tiles/blocks >>and cells. Without doing so, it is very easy to write code which is (extremely) cache inefficient.

I think that is also Ari's main objection and a valid one too. It is possible to qualify "extremely" though: for pixel-by-pixel operations it means having to cache a whole row of blocks for each band, rather than a single block (or do some really expensive swapping). So in the worst case, I have blocks that are one full column of the band, then I need to cache the complete dataset. In the best case I have blocks that are a single row and I need to cache only a single block.

My personal justification for going this route anyway is that I intend to do operations on multiple bands that may have different block sizes. Now imagine a design the works on the basis of blocks and that keeps blocks in cache until they are no longer needed.  In the worst case I would have a combination of bands where one has full rows as blocks and the other has full columns as blocks; I would then need to cache the complete dataset. In the best case I would have two datasets with identical block sizes and I would need to cache only a single block per dataset.

So, in both solutions there is a worst case that requires caching the whole dataset and a best case requiring only a single block per dataset. In both solutions the problem can be solved if you can control the block sizes. I can see that row-by-row will be problematic more often than block-by-block, but the extremes are the same.

I think another problem people see with the row-by-row solution is that it does not parallelize well. However, I do not think that is a real problem because it is possible to rewrite a large row-by-row operation as a number of row-by-row operations on subsets (possibly blocks, but not necessarily so) of the raster data sets.

>> I'm not sure if flow algebras have arisen in the discussion yet, but they come to mind when I think of raster algebras. 
>> They permit operations in
>> which the values of "downstream" cells are functions of upstream cells. In such a case, efficient calculations are then driven both by blocking and >> by the data itself. In recent work, I've found that a number of flow algebra functions can be written by considering only one block at a time (link, >> link). I'm working on generalizing the concept now and can imagine it forming an easy way to quickly add general terrain analysis functionality.

Ari has mentioned catchment delineation, so it certainly is an interest.  The fill algorithm has been a computational bottleneck for me, so I will read your paper with interest too. And I would be highly interested in your generalized flow algebra, IMO there is great potential for generalized spatial analysis that sits somewhere between the hard-baked tools in GIS software and the low level raster data interface of, say, GDAL.