[gdal-dev] Announcing SOZip: Seek-Optimized profile for the .zip format

Even Rouault even.rouault at spatialys.com
Mon Jan 9 06:19:07 PST 2023


Hi,

It is my pleasure to announce ( 
https://github.com/sozip/sozip-spec/blob/master/blog/01-announcement.md 
) the initial release of the specification ( 
https://github.com/sozip/sozip-spec/blob/master/sozip_specification.md ) 
for the SOZip (Seek-Optimized Zip) profile to the ZIP file format, as 
well as its GDAL implementation.

What is SOZip ?
----------------------

A Seek-Optimized ZIP file (SOZip) is a ZIP file that contains one or 
several Deflate-compressed files that are organized and annotated such 
that a SOZip-aware reader can perform very fast random access (seek) 
within a compressed file.

SOZip makes it possible to access large compressed files directly from a 
.zip file without prior decompression. It is not a new file format, but 
a profile of the existing ZIP format, done in a fully backward 
compatible way. ZIP readers that are non-SOZip aware can read a 
SOZip-enabled file normally and ignore the extended features that 
support efficient seek capability.

Use cases
--------------

The SOZip specification is intended to be general purpose / not domain 
specific. It was first developed to serve geospatial use cases, which 
commonly have large compressed files inside of ZIP archives. In 
particular, it makes it possible for users to read large GIS files using 
the Shapefile, GeoPackage or FlatGeobuf formats (which have no native 
provision for compression) compressed in .zip files without prior 
decompression.

Efficient random access and selective decompression are a requirement to 
provide acceptable performance in many usage scenarios: spatial index 
filtering, access to a feature by its identifier, etc.

Performance
------------------

SOZip is efficient:

* The overhead of using a file from a SOZip archive, compared to using 
it uncompressed, is of the order of 10% for common read operations.
* Generation of a SOZip file can be much faster than regular ZIP 
generation when using multithreading.
* SOZip files are typically only ~ 5% larger than regular ZIPs 
(dependent on content, and chunk size)

Have a look at benchmarking results: 
https://github.com/sozip/sozip-spec/blob/master/README.md#benchmarking

Other ZIP related specification
------------------------------------------

The SOZip GitHub organization also hosts the KeyValuePairs extra-field 
specification ( 
https://github.com/sozip/keyvaluepairs-spec/blob/master/zip_keyvalue_extra_field_specification.md 
), to be able to encode arbitrary key-value pairs of metadata associated 
with a file within a ZIP. For example to store the Content-Type of a file.

How does this relate to GDAL ?
-------------------------------------------

Pull request https://github.com/OSGeo/gdal/pull/7042 has been submitted 
with the following enhancements:

*  The /vsizip/ virtual file system uses the SOZip index to perform fast
     random access within a compressed SOZip-enabled file.

* The Shapefile and GPKG drivers can directly generate SOZip-enabled 
.shz/.shp.zip or .gpkg.zip files.

*  Addition of the CPLAddFileInZip() C function that can compress a file 
and add
     it to an new or existing ZIP file, and enable the SOZip 
optimization when relevant.

*  The existed VSIGetFileMetadata() method can be called on a filename of
     the form /vsizip/path/to/the/file.zip/path/inside/the/zip/file and
     with domain = "ZIP" to get information if a SOZip index is 
available for that file.

*  The sozip 
(https://github.com/rouault/gdal/blob/sozip/doc/source/programs/sozip.rst) 
new command line utility
     can be used to create a seek-optimized ZIP file, to append files to 
an existing ZIP file, list the
     contents of a ZIP file and display the SOZip optimization status or 
validate a SOZip file.

Best regards,

Even

-- 

http://www.spatialys.com
My software is free, but my time generally not.



More information about the gdal-dev mailing list