[COG] Fwd: Cloud optimized GeoTIFF configuration specifics [SEC=UNOFFICIAL]
Even Rouault
even.rouault at spatialys.com
Mon Jun 18 03:37:24 PDT 2018
> Full report was linked before in other threads:
> https://github.com/opendatacube/benchmark-rio-s3/blob/master/
report.md#benchmark
Very interesting. The increased median time to open fist file with the number
of threads is a bit surprising. In the /vsicurl/ layer, there are some shared
structures protected by mutex, for caching purposes, and one of them, the
'region cache' could have some linear performance pattern, but I'd expect to
see observable contention with thousands or more threads, not just 50. Some
profiling to see where time is spent could be interesting (with 3 second of
delay, a simple Ctrl+C in a debugger and displaying the stack trace can be
done).
> This was using HTTP 1.1, I haven't looked whether HTTP/2 makes a difference.
For multi-threaded use, probably little / none. HTTP/2 is only beneficial if
you can use the same CURL handle to serve several queries.
>
> So concurrent reads (and opens) are important, but it doesn't have to be
> "within GDAL itself". Ideally I would like to have many concurrent file
> accesses, but without all the threads. I know GDAL has some async support,
> but I haven't had a chance to look into that properly yet. I understand it
> depends on the plugin.
You are probably refering to
https://trac.osgeo.org/gdal/wiki/rfc24_progressive_data_support
The only implementation of that API is for the JPIP JPEG2000 protocol.
One limitation I see with that API is that it lacks a select() approach where
you could put several requests in a pool and be awaken when some result
arrives. Currently you would need to make a busy loop calling
GetNextUpdatedRegion(() with a small timeout, which can burn CPU uselessly.
One limitation of a single-threaded asynchronous approach in your use case is
that you use DEFLATE, which is CPU hungry. So you are really taking advantage
of multiple vCPUs. Is the "total_cpu : 92.00 sec" really measure CPU
activity ? If so, normally I'd expect this to be mostly spent in DEFLATE
decompression (although 92 s to decompresss ~ 500 MB seems too much, so
there's some non-neglectable CPU activity happening somewhere else)
>
> I'd like to be able to
>
> 1. Issue multiple open requests from the same thread
In a asynchronous way ? Hard to do
> 2. Issue multiple read requests from the same thread (limiting to 1
> concurrent request per file handle is ok, but supporting multiple would be
> better)
The GTiff driver would need to implement RFC24, possibly enhanced with a
select() approach, and the /vsicurl layer (more generally the VSIL
abstraction) would also need to expose a pooled asynchronous read interface.
> 3. Possibly tune size/lifetime of the connection pool
>
> If this currently works in GDAL for GeoTIFFs already, great, I will look
> into that in more detail when I get time, if not any comments on the amount
> of work involved and if anyone is looking into async support for GeoTIFF
> reading over curl back-end.
>
>
> Regards,
>
> Kirill
--
Spatialys - Geospatial professional services
http://www.spatialys.com
More information about the COG
mailing list