[COG] Fwd: Cloud optimized GeoTIFF configuration specifics [SEC=UNOFFICIAL]

Even Rouault even.rouault at spatialys.com
Mon Jun 18 03:37:24 PDT 2018


> Full report was linked before in other threads:
> https://github.com/opendatacube/benchmark-rio-s3/blob/master/
report.md#benchmark

Very interesting. The increased median time to open fist file with the number 
of threads is a bit surprising. In the /vsicurl/ layer, there are some shared 
structures protected by mutex, for caching purposes, and one of them, the 
'region cache' could have some linear performance pattern, but I'd expect to 
see observable contention with thousands or more threads, not just 50. Some 
profiling to see where time is spent could be interesting (with 3 second of 
delay, a simple Ctrl+C in a debugger and displaying the stack trace can be 
done).

> This was using HTTP 1.1, I haven't looked whether HTTP/2 makes a difference.

For multi-threaded use, probably little / none. HTTP/2 is only beneficial if 
you can use the same CURL handle to serve several queries.

> 
> So concurrent reads (and opens) are important, but it doesn't have to be
> "within GDAL itself". Ideally I would like to have many concurrent file
> accesses, but without all the threads. I know GDAL has some async support,
> but I haven't had a chance to look into that properly yet. I understand it
> depends on the plugin.

You are probably refering to
https://trac.osgeo.org/gdal/wiki/rfc24_progressive_data_support
The only implementation of that API is for the JPIP JPEG2000 protocol.
One limitation I see with that API is that it lacks a select() approach where 
you could put several requests in a pool and be awaken when some result 
arrives. Currently you would need to make a busy loop calling 
GetNextUpdatedRegion(() with a small timeout, which can burn CPU uselessly.

One limitation of a single-threaded asynchronous approach in your use case is 
that you use DEFLATE, which is CPU hungry. So you are really taking advantage 
of multiple vCPUs. Is the "total_cpu :   92.00 sec" really measure CPU 
activity ? If so, normally I'd expect this to be mostly spent in DEFLATE 
decompression (although 92 s to decompresss ~ 500 MB seems too much, so 
there's some non-neglectable CPU activity happening somewhere else)

> 
> I'd like to be able to
> 
> 1. Issue multiple open requests from the same thread

In a asynchronous way ? Hard to do

> 2. Issue multiple read requests from the same thread (limiting to 1
> concurrent request per file handle is ok, but supporting multiple would be
> better)

The GTiff driver would need to implement RFC24, possibly enhanced with a 
select() approach, and the /vsicurl layer (more generally the VSIL 
abstraction) would also need to expose a pooled asynchronous read interface.

> 3. Possibly tune size/lifetime of the connection pool
> 
> If this currently works in GDAL for GeoTIFFs already, great, I will look
> into that in more detail when I get time, if not any comments on the amount
> of work involved and if anyone is looking into async support for GeoTIFF
> reading over curl back-end.
> 
> 
> Regards,
> 
> Kirill


-- 
Spatialys - Geospatial professional services
http://www.spatialys.com


More information about the COG mailing list