[gdal-dev] kerchunk
Michael Sumner
mdsumner at gmail.com
Wed Jul 24 14:37:46 PDT 2024
Amazing guidance, thanks so much Even.
You've answered a lot of hanging questions I had that I didn't know where
to ask. I'll be exploring all of this.
Cheers, Mike
On Wed, Jul 24, 2024 at 7:59 PM Even Rouault <even.rouault at spatialys.com>
wrote:
> Michael,
>
> I don't think this would be a frmts/raw driver, but rather a /vsikerchunk
> virtual file system that you would combine with the Zarr driver
>
> So you would open a dataset with "/vsikerchunk/{path/to.json}", and the
> ZARR driver would then issue a ReadDir() operation on
> /vsikerchunk/{path/to.json}, which would return the top level keys of the
> JSON. Then the Zarr driver would issue a Open() operation on
> "/vsikerchunk/{path/to.json}/.zmetadata", and so on. The Zarr driver could
> be essentially unmodified. This is I believe essentially how the Python
> implementation works when combining the Kerchunk specific part with the
> Python Zarr module (except it passes file system objects and not strings).
>
> Where things don't get pretty is for big datasets, where that JSON file
> can become so big that parsing it and holding it in memory becomes an
> annoyance. They have come apparently to using a hierarchy of Parquet files
> to store the references to the blocks:
> https://fsspec.github.io/kerchunk/spec.html#parquet-references . That's
> becoming a bit messy. Should be implementable though
>
> There are also subtelties in Kerchunk v1 with jinja substitution, and
> generators of keys, all tricks to decrease the size of the JSON, that would
> complicate an implementation.
>
> On Kerchunk itself, I don't have any experience, but I feel there might be
> limitations to what it can handle due to the underlying raster formats. For
> example, if you have a GeoTIFF file using JPEG compression, with the
> quantization tables being stored in the TIFF JpegTables tag (i.e. shared
> for all tiles), which is the formulation that GDAL would use by default on
> creation, then I don't see how Kerchunk can deal with that, since that
> would be 2 distincts chunks in the file, and the recombination is slightly
> more complicated than just appending them together before passing them to a
> JPEG codec. Similarly if you wanted to Kerchunk a GeoPackage raster, you
> couldn't, because a single tile in SQLite3 generally spans over multiple
> SQLite3 pages (of size 4096), with a few "header" bytes at the beginning of
> each tile. For GRIB2, there are certainly limitations to some formulations
> because some GRIB2 encoding for arrays are really particular. It must work
> only with the most simple raw encoding.
>
> Kerchunk can potentially do virtual tiling, but I believe that all tiles
> must have the same dimensions, and their internal tiling to be a multiple
> of that dimension, so you can create a Zarr compatible representation of
> them.
>
> And obviously one strong assumption of Kerchunk is that the files
> referenced by a Kerchunk index are immutable. If for some reason, tiles are
> moved internally because of updates, chaos will arise due to (offset, size)
> tuples being out of sync.
>
> Even
>
>
> Le 24/07/2024 à 00:37, Michael Sumner via gdal-dev a écrit :
>
> Hi, is there any effort or thought into something like Python's kerchunk
> in GDAL? (my summary of kerchunk is below)
>
> https://github.com/fsspec/kerchunk
>
> I'll be exploring the python outputs in detail and looking for hooks into
> where we might bring some of this tighter into GDAL. This would work
> nicely inside the GTI driver, for example. But, a *kerchunk-driver*? That
> would be in the family of raw/ drivers, my skillset won't have much to
> offer but I'm going to explore with some simpler examples. It could even
> bring old HDF4 files into the fold, I think.
>
> It's a bit weird from a GDAL perspective to map the chunks in a format for
> which we have a driver, but there's definitely performance advantages and
> convenience for virtualizing huge disparate collections (even the simplest
> time-series-of-files in netcdf is nicely abstracted here for xarray, a
> super-charged VRT for xarray).
>
> Interested in any thoughts, feedback, pointers to related efforts ...
> thanks!
>
> (my take on) A description of kerchunk:
>
> kerchunk replaces the actual binary blobs on file in a Zarr with json
> references to a file/uri/object and the byte start and end values, in this
> way kerchunk brings formats like hdf/netcdf/grib into the fold of "cloud
> readiness" by having a complete separation of metadata from the actual
> storage. The information about those chunks (compression, type, orientation
> etc is stored in json also).
>
> (a Zarr is a multidimensional version of a single-zoom-level image
> tiling, imagine every image tile as a potentially n-dimensional child block
> of a larger array. The blobs are stored like one zoom of an z/y/x tile
> server [[[v/]w/]y/]x way (with a position for each dimension of the array,
> 1, 2, 3, 4, or n, and z is not special, and with more general encoding
> possibilities than tif/png/jpeg provide.) This scheme is extremely
> general, literally a virtualized array-like abstraction on any storage,
> and with kerchunk you can transcend many legacy issues with actual formats.
>
> Cheers, Mike
>
>
> --
> Michael Sumner
> Research Software Engineer
> Australian Antarctic Division
> Hobart, Australia
> e-mail: mdsumner at gmail.com
>
> _______________________________________________
> gdal-dev mailing listgdal-dev at lists.osgeo.orghttps://lists.osgeo.org/mailman/listinfo/gdal-dev
>
> -- http://www.spatialys.com
> My software is free, but my time generally not.
>
>
--
Michael Sumner
Research Software Engineer
Australian Antarctic Division
Hobart, Australia
e-mail: mdsumner at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20240725/bbcb69b8/attachment.htm>
More information about the gdal-dev
mailing list