[gdal-dev] VSIReadDirRecursive is O(n^2) with large nested hierarchies

Even Rouault even.rouault at spatialys.com
Tue Aug 19 02:59:42 PDT 2025


Hi Craig,

your option (1) sounds good to me. However, there is no requirement that 
all file entries in a directory are consecutive, so the raw entry list 
could potentially be

dirA/file1
dirB/file1
dirA/file2
dirB/file2

depending on how files are inserted in the ZIP. So you likely need to 
sort things before creating your index.

Even

Le 19/08/2025 à 07:26, Craig de Stigter via gdal-dev a écrit :
> Hi folks
>
> I've stumbled across VSIReadDirRecursive being really slow when I give 
> it a ridiculously large ZIP file (containing 5 million files across 
> ~1500 subdirectories)
>
> I spent a while poking round the source code. It looks like 
> VSIArchiveFilesystemHandler::ReadDirEx() performs repeated linear 
> scans through the flat VSIArchiveContent::entries array during 
> recursive directory traversal. For each directory level, it scans all 
> entries from the beginning, resulting in O(n²) time complexity.
>
> Performance degrades from ~1.3s for the first 5,000 files to ~6.7s for 
> 5000 files once I get 100K files into a 5-million-file ZIP archive, 
> and keeps getting worse from there. I haven't managed to list the 
> whole 5M-file archive yet...
>
> A couple of possible solutions:
>
> 1. Add a directory index to VSIArchiveContent (add a map of string 
> directory paths to index in the entries array) so we can jumpstart the 
> ReadDirEx implementation at the right place
> 2. make a VSIDIRArchive class (subclass of VSIDIR) and override 
> OpenDir/NextDirEntry, so that it doens't call ReadDirEx repeatedly but 
> instead just returns entries from the VSIArchiveContent::entries array.
>
> I'm leaning towards (1) because it would presumably improve random 
> lookups by file path also (not just ReadDirRecursive). Is this 
> something that would be accepted as a PR?
>
> Thanks
>
>
> -- 
> Regards,
> Craig
>
> Platform Engineer
> Koordinates
> koordinates.com <http://koordinates.com/> / @koordinates 
> <https://twitter.com/koordinates>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev

-- 
http://www.spatialys.com
My software is free, but my time generally not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20250819/c144e286/attachment.htm>


More information about the gdal-dev mailing list