<div dir="ltr">Thank you for informative response Even. The GDAL_INGESTED_BYTES_AT_OPEN option is super useful. <div><br></div><div>Sorry for the confusion, but I meant something different by "multi-band", rather than a single image with multiple samples per pixel I assumed it to be multiple independent images with their own IFD each, stored together in one TIFF file. With something like Lansdsat 8 bands it's not really practical to create an image with 10+ samples per pixel. Each "band" might have different bit depth and possibly slightly different geo-registration. Yet having them stored together in one file has certain advantages. So let me re-phrase the question:</div><div><br></div><div>Is it possible to have several independent full-resolution images in one COG and also have overviews for each?</div><div><br></div><div>I see that overviews are marked with TIFF tag: </div><div>  `Subfile Type: reduced-resolution image (1 = 0x1)`<br></div><div><br></div><div>But that assumes that there is only one main resolution image (possibly with multiple samples per pixel).</div><div><br></div><div>Regards,</div><div><br></div><div>Kirill</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr">On Fri, Jun 15, 2018 at 9:18 PM Even Rouault <<a href="mailto:even.rouault@spatialys.com">even.rouault@spatialys.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

replying here to a thread that started privately.<br>

<br>

> ---------- Forwarded message ---------<br>

> From: Kouzoubov Kirill <<a href="mailto:Kirill.Kouzoubov@ga.gov.au" target="_blank">Kirill.Kouzoubov@ga.gov.au</a>><br>

> Date: Wed, Jun 13, 2018, 8:26 PM<br>

> Subject: RE: Cloud optimized GeoTIFF configuration specifics<br>

> [SEC=UNOFFICIAL]<br>

> To: Chris Holmes <cholmes@radiant.earth>, Seth Fitzsimmons<br>

> <seth.fitzsimmons@radiant.earth><br>

> <br>

> Hi Chris,<br>

> <br>

> <br>

> <br>

> I fully agree that single file with everything in it (bands, overviews,<br>

> stats) would be ideal from useability and even data management perspective.<br>

> Particularly if overviews can be made lossy (jpeg) and not take too much<br>

> space, so would only be meant for visualisation not computation. Purely<br>

> from format perspective I don’t see it as inefficient or problematic. But<br>

> when you start taking into account existing software in its current form,<br>

> number of problems come into light<br>

> <br>

> <br>

> <br>

> ·         Larger header, means much slower open<br>

> <br>

> <br>

> <br>

> This is purely GDAL implementation issue, fetching more bytes when opening<br>

> a file is not a problem as such, it’s that GDAL will make many more<br>

> requests (which is slow). And currently there is no way to hint GDAL to<br>

> fetch more bytes on open even if you know characteristics of your<br>

> collection. Storing detailed metadata about the file on faster storage<br>

> medium addresses this problem, but then you need bespoke reader.<br>

> <br>

> <br>

<br>

Just added the following which was missing in /vsicurl/ doc at<br>

<a href="http://gdal.org/gdal_virtual_file_systems.html#gdal_virtual_file_systems_vsicurl" rel="noreferrer" target="_blank">http://gdal.org/gdal_virtual_file_systems.html#gdal_virtual_file_systems_vsicurl</a><br>

<br>

"""<br>

Starting with GDAL 2.3, the GDAL_INGESTED_BYTES_AT_OPEN<br>

configuration option can be set to impose the number of bytes read in one<br>

GET call at file opening (can help performance to read Cloud optimized geotiff<br>

with a large header).<br>

<br>

Related to that there's a point I've in mind. The current COG definition asks <br>

for the TileOffsets and TileByteCounts tag values to be located just after the <br>

IFD they refer too, and before the next IFD. I'm not totally convinced this is <br>

the appropriate organization for COG that would have very large dimensions and <br>

thus very large sizes of TileOffsets and TileByteCounts. If for your <br>

processing you only need to access one tile, reading the whole TileOffsets and <br>

TileByteConts could become adverse for performance. GDAL when built against <br>

its internal libtiff copy can use an optimization to avoid reading the whole <br>

arrays, but just the part of them that are needed to get the offset and count <br>

of the blocks it needs to access.<br>

<br>

But this is mostly a concern for very large files. For example if you take a <br>

10,000 x 10,000 file with 512 x 512 blocks, the size of both arrays is 3200 <br>

bytes: 2 uint32 values for (10000 / 512)^2 tiles<br>

You have to go to 190,000 x 190,000 pixels to reach the megabyte size.<br>

<br>

In that case, probably that the following organization would be better:<br>

<br>

Minimum header:<br>

- IFD1 (full resolution image)<br>

- tag values of IFD1 except TileOffset ands TileByteCounts (essentially <br>

GeoTIFF tag values)<br>

- IFD2 (overview)<br>

- tag values of IFD2 except TileOffset ands TileByteCounts<br>

<br>

Extended header:<br>

- Tile Offsets and TileByteCounts of IFD1<br>

- Tile Offsets and TileByteCounts of IFD2 (the order of this line and the <br>

previous one could be indifferently switched)<br>

<br>

Imagery:<br>

- Imagery of IFD2<br>

- Imagery of IFD1<br>

<br>

A reader would have to read at least the minimum reader, whatever it will do <br>

with the file. It could then decide to completely read extended header, or <br>

just part of it, depending on how much of the imagery it will process.<br>

<br>

The only drawback of this organization is that it might require changes in <br>

libtiff to generate it, and that's not necessarily trivial to do so... <br>

(libtiff would have no issue reading that of course)<br>

<br>

<br>

> <br>

> ·         Bands in separate files are easier to read concurrently<br>

> <br>

> <br>

> <br>

> Again purely an implementation problem of GDAL. To have concurrent reads<br>

> from the same file you have to open that file multiple times. Cache should<br>

> make subsequent opens cheaper, but it’s not guaranteed. Nothing that cannot<br>

> be fixed with a bespoke reader, but if you just want to use GDAL, or more<br>

> likely a convenient wrapper for it in some dynamic language this becomes a<br>

> bit of a problem. And there is no storage saving from putting several bands<br>

> into one file, even if they share the same geospatial data, it all gets<br>

> repeated from what I understand, not sure if that’s requirement of TIFF<br>

> standard or just a s limitation of TIFF writer libraries.<br>

> <br>

<br>

Starting with GDAL 2.3, the GDAL GTiff driver can issue parallel requests if a <br>

pixel request intersects several tiles. In HTTP 1.1, this will create parallel <br>

connections. If enabling HTTP 2.0 (and having a libcurl version, and server <br>

supporting it), HTTP 2.0 multiplexing is used.<br>

<br>

Can be controlled with the GDAL_HTTP_VERSION configuration option<br>

""""<br>

GDAL_HTTP_VERSION=1.0/1.1/2/2TLS (GDAL >= 2.3). Specify HTTP version to use.<br>

 *     Will default to 1.1 generally (except on some controlled environments,<br>

 *     like Google Compute Engine VMs, where 2TLS will be the default).<br>

 *     Support for HTTP/2 requires curl 7.33 or later, built against nghttp2.<br>

 *     "2TLS" means that HTTP/2 will be attempted for HTTPS connections only. <br>

Whereas<br>

 *     "2" means that HTTP/2 will be attempted for HTTP or HTTPS.<br>

"""<br>

<br>

There's currently no optimization to issue parallel requests if the bands are <br>

separate (PLANARCONFIG=SEPARATE in TIFF parlance) instead of using pixel <br>

interleaving (PLANARCONFIG=CONTIG), but could probably be added. And that <br>

doesn't request the bands to be in separate files.<br>

<br>

That said I'm not completely convinced that this would result in (significant) <br>

performance wins. When doing the above optimization about parallel requests <br>

for several intersection tiles, this was done for Google Cloud Engine+Storage <br>

environement, and I found benchmarking this to be super tricky. Timings tend <br>

to be not repeatable (variance of the timings is huge). For example deciding <br>

which of HTTP 1.1 parallel connections (several TCP sockets) vs HTTP 2.0 <br>

multiplexing (single TCP socket, but with multiplexing of requests and <br>

responses) is the best choice tended to be super difficult to assess (the <br>

difference of timing was not that huge), hence I only enabled HTTP 2 by <br>

default for the particular environment I tested.<br>

<br>

In fact the question is more general than parallelizing request to get <br>

different bands. Imagine that the data is not compressed, and you have N <br>

bands, and the number of bytes for one block of a band is B. And consider the <br>

case of a single tile we want to read. If you have PLANARCONFIG=CONTIG, you <br>

have a single block of size N*B. If you have PLANARCONFIG=SEPARATE, you have N <br>

blocks of size B. So if you decide to do parallelized read in the <br>

PLANARCONFIG=SEPARATE case, why not also artificially spitting your single <br>

request in the PLANARCONFIG=CONTIG case as well and doing paralllized read ? <br>

(The advantage of PLANARCONFIG=CONTIG is reduced amount of metadata)<br>

In an ideal world, parallelizing for reading a contiguous sequence of ranges <br>

shouldn't help at all: your single connection should deliver at maximum speed. <br>

But perhaps splitting would in practice help performance a bit.<br>

There is probably a value of the minimum amount of bytes below which splitting <br>

the request in 2 GETs is going to be slower than doing a single big request. <br>

There is probably also a maximum amount of parallel channels beyond which  <br>

performance will decrease.<br>

<br>

Parallelization to read non-contiguous sequences can help a bit since you can <br>

save the latency of serial requests to the server (with HTTP/2 multiplexing in <br>

particular, at least in theory). Instead of doing on the same socket: ask <br>

range R1, wait for server initial processing, get data for range R1, ask range <br>

R2, wait for  server initial processing, get data for range R2. You can do: <br>

ask range R1, ask range R2 without waiting for server, wait for server initial <br>

processing, get data for R1 (or R2 depending of server optimzations), get data <br>

for R2 (or R1). But sometimes establishing HTTP 1.1 parallel connections can <br>

give a small win (if you can reuse your HTTP 1.1 sockets, otherwise the TLS <br>

establishment time will be adverse)<br>

<br>

I don't claim being an expert for maximum throughput of HTTP connections, so <br>

take the above as the result of my modest experiments <br>

<br>

<br>

> <br>

> Also how does multiple bands + overviews work together. Can you point me to<br>

> a resource that explains embedded overviews, just looking at generated<br>

> files with embedded overviews they look like multi-band tiff, so I’d like<br>

> to understand that aspect properly.<br>

> <br>

<br>

Using tiffdump / tiffinfo + some reading of the TIFF spec will help. overviews <br>

(different IFD in TIFF parlance) and bands (Samples in TIFF parlance) are <br>

completely different concepts.<br>

<br>

<br>

Even<br>

<br>

-- <br>

Spatialys - Geospatial professional services<br>

<a href="http://www.spatialys.com" rel="noreferrer" target="_blank">http://www.spatialys.com</a><br>

_______________________________________________<br>

COG mailing list<br>

<a href="mailto:COG@lists.osgeo.org" target="_blank">COG@lists.osgeo.org</a><br>

<a href="https://lists.osgeo.org/mailman/listinfo/cog" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/cog</a><br>

</blockquote></div>