[gdal-dev] Experimental: concurrent update of a dataset
Even Rouault
even.rouault at spatialys.com
Thu Jun 18 10:15:30 PDT 2015
Hi,
For those interested in parallelizing algorithms that generate a big dataset
(let's say a GeoTIFF), I've just committed in trunk an improvement to the
existing not-so-known-I-guess GDAL "api proxy" mechanism
(http://www.gdal.org/gdal_api_proxy.html) that was initially designed to deal
with datasets in isolation from the main process.
The improvement consists in the addition of the "-nofork" option (for Unix
builds only for now, but could relatively easily be extended to Windows if
needed) to the gdalserver utility, which cause all actions of different client
connections to be run (sequentially) in the same thread, thus allowing sharing
the same dataset object if clients open it with the same name. Consequently,
safe parallel (in fact serialized) update of a dataset is possible.
Demo:
1) Create a target dataset:
gdalwarp in.tif /tmp/out.tif -overwrite -co TILED=YES
(Ctrl+C almost immediately to just create the file, could be done more cleanly
but that's enough for the demo)
2) Launch the server:
gdalserver -unixserver /tmp/mysocket -nofork -v
(you could use "-tcpserver 8080" also, in which case you would set
localhost:8080 as the value of the below GDAL_API_PROXY_SERVER)
3) Launch in parallel in 2 terminals :
a) GDAL_API_PROXY_SERVER=/tmp/mysocket gdalwarp upper.vrt
API_PROXY:/tmp/out.tif
b) GDAL_API_PROXY_SERVER=/tmp/mysocket gdalwarp lower.vrt
API_PROXY:/tmp/out.tif
where upper.vrt and lower.vrt are 2 VRT that are the upper and lower part of
in.tif
A cool aspect is that you can interrupt violently any client at any time and
the integrity of the output dataset will be still preserved (but you can only
safely kill the server once all clients connecting to the same output dataset
have terminated, which the server will tell you with the verbose -v flag). So
you can resume part of the processing later (assuming clients deal with
separated parts of the output raster).
You can also display the result with QGIS while it is processed (this will
slow down things of course, and it should be launched AFTER a first client so
it doesn't open the dataset in read-only mode) :
$ ln -s API_PROXY:/tmp/out.tif proxied_out.tif
$ GDAL_API_PROXY_SERVER=/tmp/mysocket qgis proxied_out.tif
The server, and thus the output file, could also be on a completely different
machine, when using TCP mode of course. The clients could be on different
machines also. They could also be 2 threads of the same process (assuming they
use each a dedicated dataset handle obtained with a
GDALOpen("API_PROXY:/tmp/out.tif", GA_Update) call)
This demo is probably not very exciting (you could use the -multi -wo
NUM_THREADS=ALL_CPUS options of gdalwarp with more performance), but it should
give an idea of what this is about. Of course as the communication of all
clients with the server in -nofork mode is serialized, this is only
interesting if writing the output dataset itself is not the bottleneck of the
processing. This also works for read/update scenarios (what gdalwarp does in
fact since it asks for the content of blocks it will update).
Enjoy,
Even
--
Spatialys - Geospatial professional services
http://www.spatialys.com
More information about the gdal-dev
mailing list