[gdal-dev] Announcing SOZip: Seek-Optimized profile for the .zip format

Even Rouault even.rouault at spatialys.com
Thu Jan 26 13:27:21 PST 2023


Hi,

The implementation has just been merged into GDAL master

You can for example test it with gdal-master and QGIS Conda packages:
conda create --name sozip_test
conda activate sozip_test
conda install -c gdal-master gdal
conda install -c conda-forge qgis

And for example with a large GeoPackage file (filenames to adapt of course):
sozip -j output.gpkg.zip /path/to/input.gpkg
qgis output.gpkg.zip

Even

Le 09/01/2023 à 15:19, Even Rouault a écrit :
> Hi,
>
> It is my pleasure to announce ( 
> https://github.com/sozip/sozip-spec/blob/master/blog/01-announcement.md 
> ) the initial release of the specification ( 
> https://github.com/sozip/sozip-spec/blob/master/sozip_specification.md 
> ) for the SOZip (Seek-Optimized Zip) profile to the ZIP file format, 
> as well as its GDAL implementation.
>
> What is SOZip ?
> ----------------------
>
> A Seek-Optimized ZIP file (SOZip) is a ZIP file that contains one or 
> several Deflate-compressed files that are organized and annotated such 
> that a SOZip-aware reader can perform very fast random access (seek) 
> within a compressed file.
>
> SOZip makes it possible to access large compressed files directly from 
> a .zip file without prior decompression. It is not a new file format, 
> but a profile of the existing ZIP format, done in a fully backward 
> compatible way. ZIP readers that are non-SOZip aware can read a 
> SOZip-enabled file normally and ignore the extended features that 
> support efficient seek capability.
>
> Use cases
> --------------
>
> The SOZip specification is intended to be general purpose / not domain 
> specific. It was first developed to serve geospatial use cases, which 
> commonly have large compressed files inside of ZIP archives. In 
> particular, it makes it possible for users to read large GIS files 
> using the Shapefile, GeoPackage or FlatGeobuf formats (which have no 
> native provision for compression) compressed in .zip files without 
> prior decompression.
>
> Efficient random access and selective decompression are a requirement 
> to provide acceptable performance in many usage scenarios: spatial 
> index filtering, access to a feature by its identifier, etc.
>
> Performance
> ------------------
>
> SOZip is efficient:
>
> * The overhead of using a file from a SOZip archive, compared to using 
> it uncompressed, is of the order of 10% for common read operations.
> * Generation of a SOZip file can be much faster than regular ZIP 
> generation when using multithreading.
> * SOZip files are typically only ~ 5% larger than regular ZIPs 
> (dependent on content, and chunk size)
>
> Have a look at benchmarking results: 
> https://github.com/sozip/sozip-spec/blob/master/README.md#benchmarking
>
> Other ZIP related specification
> ------------------------------------------
>
> The SOZip GitHub organization also hosts the KeyValuePairs extra-field 
> specification ( 
> https://github.com/sozip/keyvaluepairs-spec/blob/master/zip_keyvalue_extra_field_specification.md 
> ), to be able to encode arbitrary key-value pairs of metadata 
> associated with a file within a ZIP. For example to store the 
> Content-Type of a file.
>
> How does this relate to GDAL ?
> -------------------------------------------
>
> Pull request https://github.com/OSGeo/gdal/pull/7042 has been 
> submitted with the following enhancements:
>
> *  The /vsizip/ virtual file system uses the SOZip index to perform fast
>     random access within a compressed SOZip-enabled file.
>
> * The Shapefile and GPKG drivers can directly generate SOZip-enabled 
> .shz/.shp.zip or .gpkg.zip files.
>
> *  Addition of the CPLAddFileInZip() C function that can compress a 
> file and add
>     it to an new or existing ZIP file, and enable the SOZip 
> optimization when relevant.
>
> *  The existed VSIGetFileMetadata() method can be called on a filename of
>     the form /vsizip/path/to/the/file.zip/path/inside/the/zip/file and
>     with domain = "ZIP" to get information if a SOZip index is 
> available for that file.
>
> *  The sozip 
> (https://github.com/rouault/gdal/blob/sozip/doc/source/programs/sozip.rst) 
> new command line utility
>     can be used to create a seek-optimized ZIP file, to append files 
> to an existing ZIP file, list the
>     contents of a ZIP file and display the SOZip optimization status 
> or validate a SOZip file.
>
> Best regards,
>
> Even
>
-- 
http://www.spatialys.com
My software is free, but my time generally not.



More information about the gdal-dev mailing list