[gdal-dev] Experimental: concurrent update of a dataset

Even Rouault even.rouault at spatialys.com
Thu Jun 18 10:15:30 PDT 2015


Hi,

For those interested in parallelizing algorithms that generate a big dataset 
(let's say a GeoTIFF), I've just committed in trunk an improvement to the 
existing not-so-known-I-guess GDAL "api proxy" mechanism 
(http://www.gdal.org/gdal_api_proxy.html) that was initially designed to deal 
with datasets in isolation from the main process.

The improvement consists in the addition of the "-nofork" option (for Unix 
builds only for now, but could relatively easily be extended to Windows if 
needed) to the gdalserver utility, which cause all actions of different client 
connections to be run (sequentially) in the same thread, thus allowing sharing 
the same dataset object if clients open it with the same name. Consequently, 
safe parallel (in fact serialized) update of a dataset is possible.

Demo:

1) Create a target dataset:

gdalwarp in.tif /tmp/out.tif -overwrite -co TILED=YES

 (Ctrl+C almost immediately to just create the file, could be done more cleanly 
but that's enough for the demo)

2) Launch the server:
gdalserver -unixserver /tmp/mysocket -nofork  -v

(you could use "-tcpserver 8080" also, in which case you would set 
localhost:8080 as the value of the below GDAL_API_PROXY_SERVER)

3) Launch in parallel in 2 terminals :
    a) GDAL_API_PROXY_SERVER=/tmp/mysocket gdalwarp upper.vrt 
API_PROXY:/tmp/out.tif
    b) GDAL_API_PROXY_SERVER=/tmp/mysocket gdalwarp lower.vrt 
API_PROXY:/tmp/out.tif

where upper.vrt and lower.vrt are 2 VRT that are the upper and lower part of 
in.tif

A cool aspect is that you can interrupt violently any client at any time and 
the integrity of the output dataset will be still preserved (but you can only 
safely kill the server once all clients connecting to the same output dataset 
have terminated, which the server will tell you with the verbose -v flag). So 
you can resume part of the processing later (assuming clients deal with 
separated parts of the output raster).

You can also display the result with QGIS while it is processed (this will 
slow down things of course, and it should be launched AFTER a first client so 
it doesn't open the dataset in read-only mode) :
$ ln -s API_PROXY:/tmp/out.tif proxied_out.tif
$ GDAL_API_PROXY_SERVER=/tmp/mysocket qgis proxied_out.tif

The server, and thus the output file, could also be on a completely different 
machine, when using TCP mode of course. The clients could be on different 
machines also. They could also be 2 threads of the same process (assuming they 
use each a dedicated dataset handle obtained with a 
GDALOpen("API_PROXY:/tmp/out.tif", GA_Update) call)

This demo is probably not very exciting (you could use the -multi -wo 
NUM_THREADS=ALL_CPUS options of gdalwarp with more performance), but it should 
give an idea of what this is about. Of course as the communication of all 
clients with the server in -nofork mode is serialized, this is only 
interesting if writing the output dataset itself is not the bottleneck of the 
processing. This also works for read/update scenarios (what gdalwarp does in 
fact since it asks for the content of blocks it will update).

Enjoy,

Even


-- 
Spatialys - Geospatial professional services
http://www.spatialys.com


More information about the gdal-dev mailing list