[gdal-dev] /vsizip/ improvements: multi-threaded compression, ZIP64 creation

Even Rouault even.rouault at spatialys.com
Mon Jul 2 08:08:03 PDT 2018


Hi,

Following a discussion with 'velix' on IRC who pointed to me the pigz utility
( https://zlib.net/pigz/ ) that does multi-threaded compression of gzip files,
I've committed in GDAL master a similar mechanism in the /vsigzip/ and
/vsizip/ virtual file systems. If you set GDAL_NUM_THREADS to a value greater
than 1 or to ALL_CPUS, multi-threaded DEFLATE compression will be done. This
uses the equivalent of the pigz independent mode, where uncompressed chunks
(of size 1 megabyte by default) are compressed in an independent way [1],
and compressed chunks are simply appended. The resulting codestream is
perfectly standard. If the reading of the input data is not the limiting
factor, this scales quite well with the number of threads.

You can use the following small Python script to test creation of zip files
(it enables GDAL_NUM_THREADS=ALL_CPUS by default).

https://raw.githubusercontent.com/OSGeo/gdal/master/gdal/swig/python/samples/gdal_zip.py

$ python gdal_zip.py my.zip srcfile1 srcfile2 ...

(Note that the multi-threaded compression is per file, not parallelized
compression of several files at once.)

Given that I've opted for the equivalent of pigz independent mode, one
drawback is a slight decrease in the compression ratio, due to the 
clearing of the dictionary, but given the large enough chunk size, this is
normally barely noticeable.

But the main advantage of independent mode is that independent decompression
could potentially be implemented. If we would serialize the offset of each
independant chunk, we could implement efficient seeking in the file, whereas,
currently if you want to read a byte at the end of a deflate stream, you need
to decompress the whole stream. Here you would need to decompress at most 1MB.
This could be done by writting a special file with those offsets of independent
chunks inside a .zip archive, potentially hidden for other applications (you 
can have holes in zip). It could also be possible to do multithreaded 
decompression (if enough uncompressed data is requested at once, or if we detect 
a read pattern that seems to imply the whole file would be read). If people are
interested to fund the implementation of such functionalities, feel free to
contact me.

Another fix/improvement I did is ZIP64 ([2]) creation. ZIP64 reading was
supported, but up to now, if using /vsizip/ in write mode and the uncompressed
or compressed size of a file was greater than 4GB, or the whole .zip was > 4 GB,
the resulting .zip would be corrupted (file sizes and internal offsets
truncated to their lower 32 bit part), due to ZIP64 not being used. I've thus
resynchronized with zlib' minizip to fix that.
One potential downside is that given how GDAL creates zip file and zip file
structural constraints, GDAL must add a ZIP64 extra field in the "local file
header", even if it is eventually unused. The unzip utility on Linux and the
Windows file manager of Windows 7 are happy with that, but I'd appreciate
if some testing could be done by users with other zip readers (on Mac
particularly that apparently may have issues with ZIP64).
You can try opening this smallish zip archive:

https://github.com/OSGeo/gdal/blob/master/autotest/gcore/data/byte_zip64_local_header_zeroed.zip?raw=true

Even

[1] pigz in standard mode can also compress in a multithreaded way, with
non-initial chunks depending on the history of the last 32 KB of the
preceding uncompressed chunk. The resulting stream is thus nearly as small
as classical compression, but independent decompression of the chunks is
not possible. In independent mode, a full synchronization marker terminates
each compressed chunk and the decompressor clears its dictionary.

[2] not to be confused with Deflate64, a proprietary variant of Deflate, that
some Windows versions unfortunately and unnecessarily use for files > 4 GB,
and which is unsupported by zlib.

-- 
Spatialys - Geospatial professional services
http://www.spatialys.com


More information about the gdal-dev mailing list