[gdal-dev] GDAL raster processing: parallel computing

James Ramm jamessramm at gmail.com
Wed Aug 17 13:26:00 PDT 2016


Hi,
I wasn't talking about MapReduce per se (although this would also be very
interesting), but rather the 'classic' functional programming functions of
'map' and 'reduce'.

A 'map' function essentially takes some user defined function and applies
it to every element of a list/vector or some other iterable. For a raster
dataset, I was proposing that it take a user defined function or class
instance and applies it to every 'block' of a raster dataset - with plenty
of choices about how each block is supplied ('natural' blocks, user defined
sizes, overlaps or 'pixel' buffers etc).

A 'reduce' function aims to apply a user defined function cumulatively to
the iterable, so as to reduce it to a single value. For raster dataset(s)
then, a 'reduce' function would take in multiple datasets and pass the
block of each dataset to the user-defined function together, expecting a
single block to be returned. I.E it reduces multiple datasets to a single
datasets. Image mosaicking would be an example of this.

I think a 'map' and 'reduce' function is all that is needed to allow the
user to do pretty much any kind of processing they want, without actually
having to worry about how to apply to a raster dataset of any size and all
that 'boring' boilerplate. Just define the algorithm and hand it over.

A MapReduce system would be, perhaps, a higher level 'organiser' of how
these tasks are run - perhaps distributing them across multiple machines.
Wikipedia does a much better job of explaining MapReduce vs map & reduce
than me:
https://en.wikipedia.org/wiki/MapReduce

I started to gather my ideas and snippets of code in a repository here:
https://github.com/JamesRamm/GeoAlg

It is more of a place to help me form ideas than anything usable atm.

I would be open to merging relevant parts etc.

What I think is important is:

- Different kinds of block 'iterators' so you can support e.g. overlapping
neighbourhoods, mosaicking etc. Iterating in blocks ofc is essential for
handling rasters of any size
- Ability to save state in the callbacks (I decided to support this by
making the callback a class instance conforming to some interface rather
than a function...i guess there are other ways too)
- It would be good to support complex raster 'masks' as well (e.g.
different regions of the mask indicating different processing parameters or
something). I am going to clarify my ideas on this on a wiki page in the
above repo.

When I have tried parallel processing in the past, I have generally done it
on the block array, which neatly avoids having to have multiple GDALDataset
instances and possible getting in a tangle with that. However, this limits
the usefulness of parallel processing to only the most complicated of
algorithms as most of the time I find processing to be heavily I/O bound.
I was hoping to at some point come up with a parallel, block based
alternative for gdal Polygonize as it runs quite slowly for large, tiled
geotiffs, where access by scanline is sub optimal. A polygonize function
also seems sufficiently complex to benefit from parallelisation. OpenCV
probably has much of the necessary work already done - of course, work
relying on OpenCV might be better outside gdal, rather than introducing
such a large & specialised dependency.

Ill be in Bonn from late Tues afternoon for the conference...

I have t

On 17 August 2016 at 13:28, Ari Jolma <ari.jolma at gmail.com> wrote:

> Hi,
>
> This relates to RFC 62, raster algebra.
>
> I realized that parallel processing is really an essential element of
> this. I don't have a lot of experience with parallel processing and threads
> so please let me know if I'm writing silly or ignorant things.
>
> James, in your emails you write that map and reduce functions are
> essential. That seems to point to parallel processing - can you elaborate a
> bit more, what's you approach there and are you using some specific
> libraries etc?
>
> Rutger mentioned Dask and Numba, which seem to be a high level solution.
>
> Anyway, I thought I'd make a try with OpenMP and the C++ code I have
> written so far. On a very very simple level it seems that it might be
> enough to add "#pragma omp parallel for" before the for loop, which
> iterates over the (cached) blocks. And then compile the code with -fopenmp.
> Of course this does not work (or it seems to work but not make the code use
> more than one cpu at the same time) since a single GDAL Dataset object
> should not be used by several threads (GDAL FAQ).
>
> There seems to be a solution in a book "Remote Sensing Raster
> Programming", which I found with google and google books shows the relevant
> page. The book suggests adding #pragma omp barrier before GDALRasterIO. To
> me it seems that that would cause all the raster data to accumulate into
> the RAM. I did not try it though.
>
> It seems that I should somehow make the code spawn a new Dataset object
> for each thread. The function for that is GDALOpenShared. Now a simple
> question: What if the raster is created in the code? My test application
> for this is a simple one, which takes an existing raster and returns a 0/1
> raster, where the cell has 1 if the original raster has value 48 and 0
> elsewhere. Is the solution to create the dataset, and then open new
> connections to it using OpenShared?
>
> By the way, I'll be at the FOSS4G code sprint Tuesday afternoon and
> Saturday morning if anyone wants to discuss this.
>
> Best,
>
> Ari
>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/gdal-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20160817/41094b63/attachment.html>


More information about the gdal-dev mailing list