[gdal-dev] VSIReadDirRecursive is O(n^2) with large nested hierarchies
Craig de Stigter
craig.destigter at koordinates.com
Mon Aug 18 22:26:39 PDT 2025
Hi folks
I've stumbled across VSIReadDirRecursive being really slow when I give it a
ridiculously large ZIP file (containing 5 million files across ~1500
subdirectories)
I spent a while poking round the source code. It looks like
VSIArchiveFilesystemHandler::ReadDirEx() performs repeated linear scans
through the flat VSIArchiveContent::entries array during recursive
directory traversal. For each directory level, it scans all entries from
the beginning, resulting in O(n²) time complexity.
Performance degrades from ~1.3s for the first 5,000 files to ~6.7s for 5000
files once I get 100K files into a 5-million-file ZIP archive, and keeps
getting worse from there. I haven't managed to list the whole 5M-file
archive yet...
A couple of possible solutions:
1. Add a directory index to VSIArchiveContent (add a map of string
directory paths to index in the entries array) so we can jumpstart the
ReadDirEx implementation at the right place
2. make a VSIDIRArchive class (subclass of VSIDIR) and override
OpenDir/NextDirEntry, so that it doens't call ReadDirEx repeatedly but
instead just returns entries from the VSIArchiveContent::entries array.
I'm leaning towards (1) because it would presumably improve random lookups
by file path also (not just ReadDirRecursive). Is this something that would
be accepted as a PR?
Thanks
--
Regards,
Craig
Platform Engineer
Koordinates
koordinates.com / @koordinates <https://twitter.com/koordinates>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20250819/76738f1f/attachment.htm>
More information about the gdal-dev
mailing list