[gdal-dev] GDAL raster processing: parallel computing
Even Rouault
even.rouault at spatialys.com
Wed Aug 17 06:09:42 PDT 2016
Hi Ari,
>
> This relates to RFC 62, raster algebra.
>
> I realized that parallel processing is really an essential element of
> this. I don't have a lot of experience with parallel processing and
> threads so please let me know if I'm writing silly or ignorant things.
>
> James, in your emails you write that map and reduce functions are
> essential. That seems to point to parallel processing - can you
> elaborate a bit more, what's you approach there and are you using some
> specific libraries etc?
>
> Rutger mentioned Dask and Numba, which seem to be a high level solution.
There's also Spark that is popular :
https://en.wikipedia.org/wiki/Apache_Spark
There's also MPI: https://en.wikipedia.org/wiki/Message_Passing_Interface
I've no direct experience with any of those though. Happy to hear about other
comments. But I'm not sure we would want to use directly one of those
solutions into GDAL itself
>
> Anyway, I thought I'd make a try with OpenMP and the C++ code I have
> written so far. On a very very simple level it seems that it might be
> enough to add "#pragma omp parallel for" before the for loop, which
> iterates over the (cached) blocks. And then compile the code with
> -fopenmp. Of course this does not work (or it seems to work but not make
> the code use more than one cpu at the same time) since a single GDAL
> Dataset object should not be used by several threads (GDAL FAQ).
You should see some multi-threaded use though, but indeed with a likely crash
at some point when the dataset object is used by several threads.
>
> There seems to be a solution in a book "Remote Sensing Raster
> Programming", which I found with google and google books shows the
> relevant page. The book suggests adding #pragma omp barrier before
> GDALRasterIO. To me it seems that that would cause all the raster data
> to accumulate into the RAM. I did not try it though.
This should just put a mutex around the GDALRasterIO call. Not sure why data
would accumulate into the RAM. I mean you should declare the buffer that
receives the raster data to be thread private. There's a OMP directive for
that.
Although OMP looks like it should be easy to use, I personnaly find it a bit
hard to master and make sure you do things right because it is easy to forget
to declare that some variable must be thread private. So until now, I've done
multi-threading at the old school way and here you can more easily see what is
shared vs thread-private. There's a port/cpl_worker_thread_pool.h class I've
used in a few places : warpkernel, gdalgrid, GTiff driver multi-threaded
compression. But this is mostly personal taste. Someone experienced with OMP
should manage to write correct code, and in a less verbose way than explicit
multithreading.
>
> It seems that I should somehow make the code spawn a new Dataset object
> for each thread. The function for that is GDALOpenShared.
Not really. GDALOpenShared("foo") use case is more for the single-threaded
usage where you could have sometimes the need to open the same file but from
different places and would want to share the same dataset so as to be able to
take advantage of the cache. This was the case for the old way the VRT driver
managed its sources (think of a RGB TIFF wrapped as a VRT: there are 3 sources
pointing to the same TIFF file).
In the multi-threaded case, GDALOpenShared() will return different objects if
called with the same dataset name from different threads. Like GDALOpen().
> Now a simple
> question: What if the raster is created in the code? My test application
> for this is a simple one, which takes an existing raster and returns a
> 0/1 raster, where the cell has 1 if the original raster has value 48 and
> 0 elsewhere. Is the solution to create the dataset, and then open new
> connections to it using OpenShared?
If it is a in-memory raster created with the MEM driver *and* if you do
RasterIO requests such as nXSize == nBufXSize && nYSize == nBufYSize (ie no
resampling), then you can share the same dataset object since the
implementation doesn't require accessing to blocks.
Otherwise you'll need different datasets objects, or put a mutex around
RasterIO calls.
>
> By the way, I'll be at the FOSS4G code sprint Tuesday afternoon and
> Saturday morning if anyone wants to discuss this.
I'll be there too.
Even
--
Spatialys - Geospatial professional services
http://www.spatialys.com
More information about the gdal-dev
mailing list