[COG] Fwd: Cloud optimized GeoTIFF configuration specifics [SEC=UNOFFICIAL]
Even Rouault
even.rouault at spatialys.com
Fri Jun 15 04:18:14 PDT 2018
Hi,
replying here to a thread that started privately.
> ---------- Forwarded message ---------
> From: Kouzoubov Kirill <Kirill.Kouzoubov at ga.gov.au>
> Date: Wed, Jun 13, 2018, 8:26 PM
> Subject: RE: Cloud optimized GeoTIFF configuration specifics
> [SEC=UNOFFICIAL]
> To: Chris Holmes <cholmes at radiant.earth>, Seth Fitzsimmons
> <seth.fitzsimmons at radiant.earth>
>
> Hi Chris,
>
>
>
> I fully agree that single file with everything in it (bands, overviews,
> stats) would be ideal from useability and even data management perspective.
> Particularly if overviews can be made lossy (jpeg) and not take too much
> space, so would only be meant for visualisation not computation. Purely
> from format perspective I don’t see it as inefficient or problematic. But
> when you start taking into account existing software in its current form,
> number of problems come into light
>
>
>
> · Larger header, means much slower open
>
>
>
> This is purely GDAL implementation issue, fetching more bytes when opening
> a file is not a problem as such, it’s that GDAL will make many more
> requests (which is slow). And currently there is no way to hint GDAL to
> fetch more bytes on open even if you know characteristics of your
> collection. Storing detailed metadata about the file on faster storage
> medium addresses this problem, but then you need bespoke reader.
>
>
Just added the following which was missing in /vsicurl/ doc at
http://gdal.org/gdal_virtual_file_systems.html#gdal_virtual_file_systems_vsicurl
"""
Starting with GDAL 2.3, the GDAL_INGESTED_BYTES_AT_OPEN
configuration option can be set to impose the number of bytes read in one
GET call at file opening (can help performance to read Cloud optimized geotiff
with a large header).
Related to that there's a point I've in mind. The current COG definition asks
for the TileOffsets and TileByteCounts tag values to be located just after the
IFD they refer too, and before the next IFD. I'm not totally convinced this is
the appropriate organization for COG that would have very large dimensions and
thus very large sizes of TileOffsets and TileByteCounts. If for your
processing you only need to access one tile, reading the whole TileOffsets and
TileByteConts could become adverse for performance. GDAL when built against
its internal libtiff copy can use an optimization to avoid reading the whole
arrays, but just the part of them that are needed to get the offset and count
of the blocks it needs to access.
But this is mostly a concern for very large files. For example if you take a
10,000 x 10,000 file with 512 x 512 blocks, the size of both arrays is 3200
bytes: 2 uint32 values for (10000 / 512)^2 tiles
You have to go to 190,000 x 190,000 pixels to reach the megabyte size.
In that case, probably that the following organization would be better:
Minimum header:
- IFD1 (full resolution image)
- tag values of IFD1 except TileOffset ands TileByteCounts (essentially
GeoTIFF tag values)
- IFD2 (overview)
- tag values of IFD2 except TileOffset ands TileByteCounts
Extended header:
- Tile Offsets and TileByteCounts of IFD1
- Tile Offsets and TileByteCounts of IFD2 (the order of this line and the
previous one could be indifferently switched)
Imagery:
- Imagery of IFD2
- Imagery of IFD1
A reader would have to read at least the minimum reader, whatever it will do
with the file. It could then decide to completely read extended header, or
just part of it, depending on how much of the imagery it will process.
The only drawback of this organization is that it might require changes in
libtiff to generate it, and that's not necessarily trivial to do so...
(libtiff would have no issue reading that of course)
>
> · Bands in separate files are easier to read concurrently
>
>
>
> Again purely an implementation problem of GDAL. To have concurrent reads
> from the same file you have to open that file multiple times. Cache should
> make subsequent opens cheaper, but it’s not guaranteed. Nothing that cannot
> be fixed with a bespoke reader, but if you just want to use GDAL, or more
> likely a convenient wrapper for it in some dynamic language this becomes a
> bit of a problem. And there is no storage saving from putting several bands
> into one file, even if they share the same geospatial data, it all gets
> repeated from what I understand, not sure if that’s requirement of TIFF
> standard or just a s limitation of TIFF writer libraries.
>
Starting with GDAL 2.3, the GDAL GTiff driver can issue parallel requests if a
pixel request intersects several tiles. In HTTP 1.1, this will create parallel
connections. If enabling HTTP 2.0 (and having a libcurl version, and server
supporting it), HTTP 2.0 multiplexing is used.
Can be controlled with the GDAL_HTTP_VERSION configuration option
""""
GDAL_HTTP_VERSION=1.0/1.1/2/2TLS (GDAL >= 2.3). Specify HTTP version to use.
* Will default to 1.1 generally (except on some controlled environments,
* like Google Compute Engine VMs, where 2TLS will be the default).
* Support for HTTP/2 requires curl 7.33 or later, built against nghttp2.
* "2TLS" means that HTTP/2 will be attempted for HTTPS connections only.
Whereas
* "2" means that HTTP/2 will be attempted for HTTP or HTTPS.
"""
There's currently no optimization to issue parallel requests if the bands are
separate (PLANARCONFIG=SEPARATE in TIFF parlance) instead of using pixel
interleaving (PLANARCONFIG=CONTIG), but could probably be added. And that
doesn't request the bands to be in separate files.
That said I'm not completely convinced that this would result in (significant)
performance wins. When doing the above optimization about parallel requests
for several intersection tiles, this was done for Google Cloud Engine+Storage
environement, and I found benchmarking this to be super tricky. Timings tend
to be not repeatable (variance of the timings is huge). For example deciding
which of HTTP 1.1 parallel connections (several TCP sockets) vs HTTP 2.0
multiplexing (single TCP socket, but with multiplexing of requests and
responses) is the best choice tended to be super difficult to assess (the
difference of timing was not that huge), hence I only enabled HTTP 2 by
default for the particular environment I tested.
In fact the question is more general than parallelizing request to get
different bands. Imagine that the data is not compressed, and you have N
bands, and the number of bytes for one block of a band is B. And consider the
case of a single tile we want to read. If you have PLANARCONFIG=CONTIG, you
have a single block of size N*B. If you have PLANARCONFIG=SEPARATE, you have N
blocks of size B. So if you decide to do parallelized read in the
PLANARCONFIG=SEPARATE case, why not also artificially spitting your single
request in the PLANARCONFIG=CONTIG case as well and doing paralllized read ?
(The advantage of PLANARCONFIG=CONTIG is reduced amount of metadata)
In an ideal world, parallelizing for reading a contiguous sequence of ranges
shouldn't help at all: your single connection should deliver at maximum speed.
But perhaps splitting would in practice help performance a bit.
There is probably a value of the minimum amount of bytes below which splitting
the request in 2 GETs is going to be slower than doing a single big request.
There is probably also a maximum amount of parallel channels beyond which
performance will decrease.
Parallelization to read non-contiguous sequences can help a bit since you can
save the latency of serial requests to the server (with HTTP/2 multiplexing in
particular, at least in theory). Instead of doing on the same socket: ask
range R1, wait for server initial processing, get data for range R1, ask range
R2, wait for server initial processing, get data for range R2. You can do:
ask range R1, ask range R2 without waiting for server, wait for server initial
processing, get data for R1 (or R2 depending of server optimzations), get data
for R2 (or R1). But sometimes establishing HTTP 1.1 parallel connections can
give a small win (if you can reuse your HTTP 1.1 sockets, otherwise the TLS
establishment time will be adverse)
I don't claim being an expert for maximum throughput of HTTP connections, so
take the above as the result of my modest experiments
>
> Also how does multiple bands + overviews work together. Can you point me to
> a resource that explains embedded overviews, just looking at generated
> files with embedded overviews they look like multi-band tiff, so I’d like
> to understand that aspect properly.
>
Using tiffdump / tiffinfo + some reading of the TIFF spec will help. overviews
(different IFD in TIFF parlance) and bands (Samples in TIFF parlance) are
completely different concepts.
Even
--
Spatialys - Geospatial professional services
http://www.spatialys.com
More information about the COG
mailing list