[COG] Fwd: Cloud optimized GeoTIFF configuration specifics [SEC=UNOFFICIAL]

Sun Jun 17 23:39:17 PDT 2018

I thought I'll reply to the concurrency and networking part of the
discussion in a separate sub-thread.

On Fri, Jun 15, 2018 at 9:18 PM Even Rouault <even.rouault at spatialys.com>
wrote:
...

> There's currently no optimization to issue parallel requests if the bands
> are
> separate (PLANARCONFIG=SEPARATE in TIFF parlance) instead of using pixel
> interleaving (PLANARCONFIG=CONTIG), but could probably be added. And that
> doesn't request the bands to be in separate files.
>
> That said I'm not completely convinced that this would result in
> (significant)
> performance wins. When doing the above optimization about parallel
> requests
> for several intersection tiles, this was done for Google Cloud
> Engine+Storage
> environement, and I found benchmarking this to be super tricky. Timings
> tend
> to be not repeatable (variance of the timings is huge). For example
> deciding
> which of HTTP 1.1 parallel connections (several TCP sockets) vs HTTP 2.0
> multiplexing (single TCP socket, but with multiplexing of requests and
> responses) is the best choice tended to be super difficult to assess (the
> difference of timing was not that huge), hence I only enabled HTTP 2 by
> default for the particular environment I tested.
>

... and also

> Parallelization to read non-contiguous sequences can help a bit since you
> can
> save the latency of serial requests to the server (with HTTP/2
> multiplexing in
> particular, at least in theory). Instead of doing on the same socket: ask
> range R1, wait for server initial processing, get data for range R1, ask
> range
> R2, wait for  server initial processing, get data for range R2. You can
> do:
> ask range R1, ask range R2 without waiting for server, wait for server
> initial
> processing, get data for R1 (or R2 depending of server optimzations), get
> data
> for R2 (or R1). But sometimes establishing HTTP 1.1 parallel connections
> can
> give a small win (if you can reuse your HTTP 1.1 sockets, otherwise the
> TLS
> establishment time will be adverse)
>

Testing effects of concurrency on performance is certainly non-trivial.
Things are so dependent on the context that it's not really possible to
have a generic set of guidelines. For gdal tools like `gdal_translate` that
operate on a single file any gains you might get by having concurrent
requests will probably be offset by connection setup costs, particularly if
HTTP 1.1 is used.

There are so many parameters that affect performance of network reads

- Reading data in the same data center vs over the Internet
- Broadband vs 4G
- Transparent caching proxies
- Transparent virus scanning proxies that add huge latency
- Processing one file vs thousands (can we amortize the cost of connection
pool setup)
- Details of the remote server

My point that there is no way to find configuration that will fit well in
all the cases.

I personally care about data access in the data center. It's an important
use-case, but I wouldn't suggest tuning default GDAL configuration to that
domain. In this context you have fast local network access with high
bandwidth backed by highly scalable data serving back-end. Concurrency is
an absolute must to get reasonable performance in those circumstances. You
just not going to saturate network pipe with one request at a time.

I was able to go from just ~14 tiles per second (this is pixel gather
operation, reading 1 tile from each file) to ~370 using ungodly number of
threads

https://gist.github.com/Kirill888/ccebc72ba3d773191fb3dc0d225914ad

Full report was linked before in other threads:

https://github.com/opendatacube/benchmark-rio-s3/blob/master/report.md#benchmark

This was using HTTP 1.1, I haven't looked whether HTTP/2 makes a difference.

So concurrent reads (and opens) are important, but it doesn't have to be
"within GDAL itself". Ideally I would like to have many concurrent file
accesses, but without all the threads. I know GDAL has some async support,
but I haven't had a chance to look into that properly yet. I understand it
depends on the plugin.

I'd like to be able to

1. Issue multiple open requests from the same thread
2. Issue multiple read requests from the same thread (limiting to 1
concurrent request per file handle is ok, but supporting multiple would be
better)
3. Possibly tune size/lifetime of the connection pool

If this currently works in GDAL for GeoTIFFs already, great, I will look
into that in more detail when I get time, if not any comments on the amount
of work involved and if anyone is looking into async support for GeoTIFF
reading over curl back-end.

Regards,

Kirill
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/cog/attachments/20180618/635726c0/attachment.html>