[gdal-dev] GDAL raster processing: parallel computing
Tim Keitt
tkeitt at utexas.edu
Wed Aug 17 15:40:31 PDT 2016
http://www.keittlab.org/
On Wed, Aug 17, 2016 at 3:26 PM, James Ramm <jamessramm at gmail.com> wrote:
> Hi,
> I wasn't talking about MapReduce per se (although this would also be very
> interesting), but rather the 'classic' functional programming functions of
> 'map' and 'reduce'.
>
> A 'map' function essentially takes some user defined function and applies
> it to every element of a list/vector or some other iterable. For a raster
> dataset, I was proposing that it take a user defined function or class
> instance and applies it to every 'block' of a raster dataset - with plenty
> of choices about how each block is supplied ('natural' blocks, user defined
> sizes, overlaps or 'pixel' buffers etc).
>
> A 'reduce' function aims to apply a user defined function cumulatively to
> the iterable, so as to reduce it to a single value. For raster dataset(s)
> then, a 'reduce' function would take in multiple datasets and pass the
> block of each dataset to the user-defined function together, expecting a
> single block to be returned. I.E it reduces multiple datasets to a single
> datasets. Image mosaicking would be an example of this.
>
> I think a 'map' and 'reduce' function is all that is needed to allow the
> user to do pretty much any kind of processing they want, without actually
> having to worry about how to apply to a raster dataset of any size and all
> that 'boring' boilerplate. Just define the algorithm and hand it over.
>
> A MapReduce system would be, perhaps, a higher level 'organiser' of how
> these tasks are run - perhaps distributing them across multiple machines.
> Wikipedia does a much better job of explaining MapReduce vs map & reduce
> than me:
> https://en.wikipedia.org/wiki/MapReduce
>
> I started to gather my ideas and snippets of code in a repository here:
> https://github.com/JamesRamm/GeoAlg
>
> It is more of a place to help me form ideas than anything usable atm.
>
> I would be open to merging relevant parts etc.
>
> What I think is important is:
>
> - Different kinds of block 'iterators' so you can support e.g. overlapping
> neighbourhoods, mosaicking etc. Iterating in blocks ofc is essential for
> handling rasters of any size
> - Ability to save state in the callbacks (I decided to support this by
> making the callback a class instance conforming to some interface rather
> than a function...i guess there are other ways too)
> - It would be good to support complex raster 'masks' as well (e.g.
> different regions of the mask indicating different processing parameters or
> something). I am going to clarify my ideas on this on a wiki page in the
> above repo.
>
> When I have tried parallel processing in the past, I have generally done
> it on the block array, which neatly avoids having to have multiple
> GDALDataset instances and possible getting in a tangle with that. However,
> this limits the usefulness of parallel processing to only the most
> complicated of algorithms as most of the time I find processing to be
> heavily I/O bound.
> I was hoping to at some point come up with a parallel, block based
> alternative for gdal Polygonize as it runs quite slowly for large, tiled
> geotiffs, where access by scanline is sub optimal. A polygonize function
> also seems sufficiently complex to benefit from parallelisation. OpenCV
> probably has much of the necessary work already done - of course, work
> relying on OpenCV might be better outside gdal, rather than introducing
> such a large & specialised dependency.
>
I would tend to agree. GDAL is really good at translation, but not such a
great platform for implementing generic iterator-based algorithms. I would
suggest using GDAL to translate and then write algorithms separately, where
you could e.g. take advantage of newer C++ features. The Generic Image
Library is one possibility.
THK
>
> Ill be in Bonn from late Tues afternoon for the conference...
>
> I have t
>
> On 17 August 2016 at 13:28, Ari Jolma <ari.jolma at gmail.com> wrote:
>
>> Hi,
>>
>> This relates to RFC 62, raster algebra.
>>
>> I realized that parallel processing is really an essential element of
>> this. I don't have a lot of experience with parallel processing and threads
>> so please let me know if I'm writing silly or ignorant things.
>>
>> James, in your emails you write that map and reduce functions are
>> essential. That seems to point to parallel processing - can you elaborate a
>> bit more, what's you approach there and are you using some specific
>> libraries etc?
>>
>> Rutger mentioned Dask and Numba, which seem to be a high level solution.
>>
>> Anyway, I thought I'd make a try with OpenMP and the C++ code I have
>> written so far. On a very very simple level it seems that it might be
>> enough to add "#pragma omp parallel for" before the for loop, which
>> iterates over the (cached) blocks. And then compile the code with -fopenmp.
>> Of course this does not work (or it seems to work but not make the code use
>> more than one cpu at the same time) since a single GDAL Dataset object
>> should not be used by several threads (GDAL FAQ).
>>
>> There seems to be a solution in a book "Remote Sensing Raster
>> Programming", which I found with google and google books shows the relevant
>> page. The book suggests adding #pragma omp barrier before GDALRasterIO. To
>> me it seems that that would cause all the raster data to accumulate into
>> the RAM. I did not try it though.
>>
>> It seems that I should somehow make the code spawn a new Dataset object
>> for each thread. The function for that is GDALOpenShared. Now a simple
>> question: What if the raster is created in the code? My test application
>> for this is a simple one, which takes an existing raster and returns a 0/1
>> raster, where the cell has 1 if the original raster has value 48 and 0
>> elsewhere. Is the solution to create the dataset, and then open new
>> connections to it using OpenShared?
>>
>> By the way, I'll be at the FOSS4G code sprint Tuesday afternoon and
>> Saturday morning if anyone wants to discuss this.
>>
>> Best,
>>
>> Ari
>>
>>
>> _______________________________________________
>> gdal-dev mailing list
>> gdal-dev at lists.osgeo.org
>> http://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/gdal-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20160817/76747c0f/attachment-0001.html>
More information about the gdal-dev
mailing list