[COG] Fwd: Cloud optimized GeoTIFF configuration specifics [SEC=UNOFFICIAL]

Fri Jun 15 17:53:50 PDT 2018

Thank you for informative response Even. The GDAL_INGESTED_BYTES_AT_OPEN
option is super useful.

Sorry for the confusion, but I meant something different by "multi-band",
rather than a single image with multiple samples per pixel I assumed it to
be multiple independent images with their own IFD each, stored together in
one TIFF file. With something like Lansdsat 8 bands it's not really
practical to create an image with 10+ samples per pixel. Each "band" might
have different bit depth and possibly slightly different geo-registration.
Yet having them stored together in one file has certain advantages. So let
me re-phrase the question:

Is it possible to have several independent full-resolution images in one
COG and also have overviews for each?

I see that overviews are marked with TIFF tag:
  `Subfile Type: reduced-resolution image (1 = 0x1)`

But that assumes that there is only one main resolution image (possibly
with multiple samples per pixel).

Regards,

Kirill

On Fri, Jun 15, 2018 at 9:18 PM Even Rouault <even.rouault at spatialys.com>
wrote:

> Hi,
>
> replying here to a thread that started privately.
>
> > ---------- Forwarded message ---------
> > From: Kouzoubov Kirill <Kirill.Kouzoubov at ga.gov.au>
> > Date: Wed, Jun 13, 2018, 8:26 PM
> > Subject: RE: Cloud optimized GeoTIFF configuration specifics
> > [SEC=UNOFFICIAL]
> > To: Chris Holmes <cholmes at radiant.earth>, Seth Fitzsimmons
> > <seth.fitzsimmons at radiant.earth>
> >
> > Hi Chris,
> >
> >
> >
> > I fully agree that single file with everything in it (bands, overviews,
> > stats) would be ideal from useability and even data management
> perspective.
> > Particularly if overviews can be made lossy (jpeg) and not take too much
> > space, so would only be meant for visualisation not computation. Purely
> > from format perspective I don’t see it as inefficient or problematic. But
> > when you start taking into account existing software in its current form,
> > number of problems come into light
> >
> >
> >
> > ·         Larger header, means much slower open
> >
> >
> >
> > This is purely GDAL implementation issue, fetching more bytes when
> opening
> > a file is not a problem as such, it’s that GDAL will make many more
> > requests (which is slow). And currently there is no way to hint GDAL to
> > fetch more bytes on open even if you know characteristics of your
> > collection. Storing detailed metadata about the file on faster storage
> > medium addresses this problem, but then you need bespoke reader.
> >
> >
>
> Just added the following which was missing in /vsicurl/ doc at
>
> http://gdal.org/gdal_virtual_file_systems.html#gdal_virtual_file_systems_vsicurl
>
> """
> Starting with GDAL 2.3, the GDAL_INGESTED_BYTES_AT_OPEN
> configuration option can be set to impose the number of bytes read in one
> GET call at file opening (can help performance to read Cloud optimized
> geotiff
> with a large header).
>
> Related to that there's a point I've in mind. The current COG definition
> asks
> for the TileOffsets and TileByteCounts tag values to be located just after
> the
> IFD they refer too, and before the next IFD. I'm not totally convinced
> this is
> the appropriate organization for COG that would have very large dimensions
> and
> thus very large sizes of TileOffsets and TileByteCounts. If for your
> processing you only need to access one tile, reading the whole TileOffsets
> and
> TileByteConts could become adverse for performance. GDAL when built
> against
> its internal libtiff copy can use an optimization to avoid reading the
> whole
> arrays, but just the part of them that are needed to get the offset and
> count
> of the blocks it needs to access.
>
> But this is mostly a concern for very large files. For example if you take
> a
> 10,000 x 10,000 file with 512 x 512 blocks, the size of both arrays is
> 3200
> bytes: 2 uint32 values for (10000 / 512)^2 tiles
> You have to go to 190,000 x 190,000 pixels to reach the megabyte size.
>
> In that case, probably that the following organization would be better:
>
> Minimum header:
> - IFD1 (full resolution image)
> - tag values of IFD1 except TileOffset ands TileByteCounts (essentially
> GeoTIFF tag values)
> - IFD2 (overview)
> - tag values of IFD2 except TileOffset ands TileByteCounts
>
> Extended header:
> - Tile Offsets and TileByteCounts of IFD1
> - Tile Offsets and TileByteCounts of IFD2 (the order of this line and the
> previous one could be indifferently switched)
>
> Imagery:
> - Imagery of IFD2
> - Imagery of IFD1
>
> A reader would have to read at least the minimum reader, whatever it will
> do
> with the file. It could then decide to completely read extended header, or
> just part of it, depending on how much of the imagery it will process.
>
> The only drawback of this organization is that it might require changes in
> libtiff to generate it, and that's not necessarily trivial to do so...
> (libtiff would have no issue reading that of course)
>
>
> >
> > ·         Bands in separate files are easier to read concurrently
> >
> >
> >
> > Again purely an implementation problem of GDAL. To have concurrent reads
> > from the same file you have to open that file multiple times. Cache
> should
> > make subsequent opens cheaper, but it’s not guaranteed. Nothing that
> cannot
> > be fixed with a bespoke reader, but if you just want to use GDAL, or more
> > likely a convenient wrapper for it in some dynamic language this becomes
> a
> > bit of a problem. And there is no storage saving from putting several
> bands
> > into one file, even if they share the same geospatial data, it all gets
> > repeated from what I understand, not sure if that’s requirement of TIFF
> > standard or just a s limitation of TIFF writer libraries.
> >
>
> Starting with GDAL 2.3, the GDAL GTiff driver can issue parallel requests
> if a
> pixel request intersects several tiles. In HTTP 1.1, this will create
> parallel
> connections. If enabling HTTP 2.0 (and having a libcurl version, and
> server
> supporting it), HTTP 2.0 multiplexing is used.
>
> Can be controlled with the GDAL_HTTP_VERSION configuration option
> """"
> GDAL_HTTP_VERSION=1.0/1.1/2/2TLS (GDAL >= 2.3). Specify HTTP version to
> use.
>  *     Will default to 1.1 generally (except on some controlled
> environments,
>  *     like Google Compute Engine VMs, where 2TLS will be the default).
>  *     Support for HTTP/2 requires curl 7.33 or later, built against
> nghttp2.
>  *     "2TLS" means that HTTP/2 will be attempted for HTTPS connections
> only.
> Whereas
>  *     "2" means that HTTP/2 will be attempted for HTTP or HTTPS.
> """
>
> There's currently no optimization to issue parallel requests if the bands
> are
> separate (PLANARCONFIG=SEPARATE in TIFF parlance) instead of using pixel
> interleaving (PLANARCONFIG=CONTIG), but could probably be added. And that
> doesn't request the bands to be in separate files.
>
> That said I'm not completely convinced that this would result in
> (significant)
> performance wins. When doing the above optimization about parallel
> requests
> for several intersection tiles, this was done for Google Cloud
> Engine+Storage
> environement, and I found benchmarking this to be super tricky. Timings
> tend
> to be not repeatable (variance of the timings is huge). For example
> deciding
> which of HTTP 1.1 parallel connections (several TCP sockets) vs HTTP 2.0
> multiplexing (single TCP socket, but with multiplexing of requests and
> responses) is the best choice tended to be super difficult to assess (the
> difference of timing was not that huge), hence I only enabled HTTP 2 by
> default for the particular environment I tested.
>
> In fact the question is more general than parallelizing request to get
> different bands. Imagine that the data is not compressed, and you have N
> bands, and the number of bytes for one block of a band is B. And consider
> the
> case of a single tile we want to read. If you have PLANARCONFIG=CONTIG,
> you
> have a single block of size N*B. If you have PLANARCONFIG=SEPARATE, you
> have N
> blocks of size B. So if you decide to do parallelized read in the
> PLANARCONFIG=SEPARATE case, why not also artificially spitting your single
> request in the PLANARCONFIG=CONTIG case as well and doing paralllized read
> ?
> (The advantage of PLANARCONFIG=CONTIG is reduced amount of metadata)
> In an ideal world, parallelizing for reading a contiguous sequence of
> ranges
> shouldn't help at all: your single connection should deliver at maximum
> speed.
> But perhaps splitting would in practice help performance a bit.
> There is probably a value of the minimum amount of bytes below which
> splitting
> the request in 2 GETs is going to be slower than doing a single big
> request.
> There is probably also a maximum amount of parallel channels beyond which
> performance will decrease.
>
> Parallelization to read non-contiguous sequences can help a bit since you
> can
> save the latency of serial requests to the server (with HTTP/2
> multiplexing in
> particular, at least in theory). Instead of doing on the same socket: ask
> range R1, wait for server initial processing, get data for range R1, ask
> range
> R2, wait for  server initial processing, get data for range R2. You can
> do:
> ask range R1, ask range R2 without waiting for server, wait for server
> initial
> processing, get data for R1 (or R2 depending of server optimzations), get
> data
> for R2 (or R1). But sometimes establishing HTTP 1.1 parallel connections
> can
> give a small win (if you can reuse your HTTP 1.1 sockets, otherwise the
> TLS
> establishment time will be adverse)
>
> I don't claim being an expert for maximum throughput of HTTP connections,
> so
> take the above as the result of my modest experiments
>
>
> >
> > Also how does multiple bands + overviews work together. Can you point me
> to
> > a resource that explains embedded overviews, just looking at generated
> > files with embedded overviews they look like multi-band tiff, so I’d like
> > to understand that aspect properly.
> >
>
> Using tiffdump / tiffinfo + some reading of the TIFF spec will help.
> overviews
> (different IFD in TIFF parlance) and bands (Samples in TIFF parlance) are
> completely different concepts.
>
>
> Even
>
> --
> Spatialys - Geospatial professional services
> http://www.spatialys.com
> _______________________________________________
> COG mailing list
> COG at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/cog
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/cog/attachments/20180616/65ed840c/attachment.html>