[gdal-dev] kerchunk

Even Rouault even.rouault at spatialys.com
Wed Jul 24 02:59:51 PDT 2024


Michael,

I don't think this would be a frmts/raw driver, but rather a 
/vsikerchunk virtual file system that you would combine with the Zarr driver

So you would open a dataset with "/vsikerchunk/{path/to.json}", and the 
ZARR driver would then issue a ReadDir() operation on 
/vsikerchunk/{path/to.json}, which would return the top level keys of 
the JSON. Then the Zarr driver would issue a Open() operation on 
"/vsikerchunk/{path/to.json}/.zmetadata", and so on. The Zarr driver 
could be essentially unmodified. This is I believe essentially how the 
Python implementation works when combining the Kerchunk specific part 
with the Python Zarr module (except it passes file system objects and 
not strings).

Where things don't get pretty is for big datasets, where that JSON file 
can become so big that parsing it and holding it in memory becomes an 
annoyance. They have come apparently to using a hierarchy of Parquet 
files to store the references to the blocks: 
https://fsspec.github.io/kerchunk/spec.html#parquet-references . That's 
becoming a bit messy. Should be implementable though

There are also subtelties in Kerchunk v1 with jinja substitution, and 
generators of keys, all tricks to decrease the size of the JSON, that 
would complicate an implementation.

On Kerchunk itself, I don't have any experience, but I feel there might 
be limitations to what it can handle due to the underlying raster 
formats. For example, if you have a GeoTIFF file using JPEG compression, 
with the quantization tables being stored in the TIFF JpegTables tag 
(i.e. shared for all tiles), which is the formulation that GDAL would 
use by default on creation, then I don't see how Kerchunk can deal with 
that, since that would be 2 distincts chunks in the file, and the 
recombination is slightly more complicated than just appending them 
together before passing them to a JPEG codec. Similarly if you wanted to 
Kerchunk a GeoPackage raster, you couldn't, because a single tile in 
SQLite3 generally spans over multiple SQLite3 pages (of size 4096), with 
a few "header" bytes at the beginning of each tile. For GRIB2, there are 
certainly limitations to some formulations because some GRIB2 encoding 
for arrays are really particular. It must work only with the most simple 
raw encoding.

Kerchunk can potentially do virtual tiling, but I believe that all tiles 
must have the same dimensions, and their internal tiling to be a 
multiple of that dimension, so you can create a Zarr compatible 
representation of them.

And obviously one strong assumption of Kerchunk is that the files 
referenced by a Kerchunk index are immutable. If for some reason, tiles 
are moved internally because of updates, chaos will arise due to 
(offset, size) tuples being out of sync.

Even


Le 24/07/2024 à 00:37, Michael Sumner via gdal-dev a écrit :
> Hi, is there any effort or thought into something like Python's 
> kerchunk in GDAL?   (my summary of kerchunk is below)
>
> https://github.com/fsspec/kerchunk
>
> I'll be exploring the python outputs in detail and looking for hooks 
> into where we might bring some of this tighter into GDAL.  This would 
> work nicely inside the GTI driver, for example. But,  a 
> *kerchunk-driver*? That would be in the family of raw/ drivers, my 
> skillset won't have much to offer but I'm going to explore with some 
> simpler examples.   It could even bring old HDF4 files into the fold, 
> I think.
>
> It's a bit weird from a GDAL perspective to map the chunks in a format 
> for which we have a driver, but there's definitely performance 
> advantages and convenience for virtualizing huge disparate collections 
> (even the simplest time-series-of-files in netcdf is nicely abstracted 
> here for xarray, a super-charged VRT for xarray).
>
> Interested in any thoughts, feedback, pointers to related efforts ... 
> thanks!
>
> (my take on) A description of kerchunk:
>
> kerchunk replaces the actual binary blobs on file in a Zarr with json 
> references to a file/uri/object and the byte start and end values, in 
> this way kerchunk brings formats like hdf/netcdf/grib into the fold of 
> "cloud readiness" by having a complete separation of metadata from the 
> actual storage. The information about those chunks (compression, type, 
> orientation etc is stored in json also).
>
> (a Zarr  is a multidimensional version of a single-zoom-level image 
> tiling, imagine every image tile as a potentially n-dimensional child 
> block of a larger array. The blobs are stored like one zoom of an 
> z/y/x tile server [[[v/]w/]y/]x way (with a position for each 
> dimension of the array,  1, 2, 3, 4, or n, and z is not special, and 
> with more general encoding possibilities than tif/png/jpeg provide.) 
> This scheme is extremely general,  literally a virtualized array-like 
> abstraction on any storage, and with kerchunk you can transcend many 
> legacy issues with actual formats.
>
> Cheers, Mike
>
>
> -- 
> Michael Sumner
> Research Software Engineer
> Australian Antarctic Division
> Hobart, Australia
> e-mail: mdsumner at gmail.com
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev

-- 
http://www.spatialys.com
My software is free, but my time generally not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20240724/5f0df2a0/attachment-0001.htm>


More information about the gdal-dev mailing list