<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Hi Craig,</p>
<p>your option (1) sounds good to me. However, there is no
requirement that all file entries in a directory are consecutive,
so the raw entry list could potentially be</p>
<p>dirA/file1<br>
dirB/file1<br>
dirA/file2<br>
dirB/file2<br>
<br>
depending on how files are inserted in the ZIP. So you likely need
to sort things before creating your index.</p>
<p>Even</p>
<div class="moz-cite-prefix">Le 19/08/2025 à 07:26, Craig de Stigter
via gdal-dev a écrit :<br>
</div>
<blockquote type="cite"
cite="mid:CAF1M8pcYuUEavzLgNtF_6opJsyVtEiNOxs5=EcQjt=SF+iJprw@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div>Hi folks<br>
<br>
I've stumbled across VSIReadDirRecursive being really slow
when I give it a ridiculously large ZIP file (containing 5
million files across ~1500 subdirectories)<br>
<br>
I spent a while poking round the source code. It looks like
VSIArchiveFilesystemHandler::ReadDirEx() performs repeated
linear scans through the flat VSIArchiveContent::entries array
during recursive directory traversal. For each directory
level, it scans all entries from the beginning, resulting in
O(n²) time complexity.<br>
<br>
Performance degrades from ~1.3s for the first 5,000 files to
~6.7s for 5000 files once I get 100K files into a
5-million-file ZIP archive, and keeps getting worse from
there. I haven't managed to list the whole 5M-file archive
yet...<br>
<br>
A couple of possible solutions:<br>
<br>
1. Add a directory index to VSIArchiveContent (add a map of
string directory paths to index in the entries array) so we
can jumpstart the ReadDirEx implementation at the right place<br>
2. make a VSIDIRArchive class (subclass of VSIDIR) and
override OpenDir/NextDirEntry, so that it doens't call
ReadDirEx repeatedly but instead just returns entries from the
VSIArchiveContent::entries array.<br>
<br>
I'm leaning towards (1) because it would presumably improve
random lookups by file path also (not just ReadDirRecursive).
Is this something that would be accepted as a PR?<br>
<br>
Thanks<br>
<br>
</div>
<br>
<span class="gmail_signature_prefix">-- </span><br>
<div dir="ltr" class="gmail_signature"
data-smartmail="gmail_signature">
<div dir="ltr">
<div
style="color:rgb(0,0,0);font-family:Helvetica;font-size:12px">Regards,</div>
<div
style="color:rgb(0,0,0);font-family:Helvetica;font-size:12px">Craig</div>
<div
style="color:rgb(0,0,0);font-family:Helvetica;font-size:12px"><br>
</div>
<div
style="color:rgb(0,0,0);font-family:Helvetica;font-size:12px">Platform
Engineer<br>
</div>
<div
style="color:rgb(0,0,0);font-family:Helvetica;font-size:12px">Koordinates</div>
<div
style="color:rgb(0,0,0);font-family:Helvetica;font-size:12px"><a
href="http://koordinates.com/"
style="color:rgb(17,85,204)" target="_blank"
moz-do-not-send="true">koordinates.com</a> / <a
href="https://twitter.com/koordinates"
style="color:rgb(17,85,204)" target="_blank"
moz-do-not-send="true">@koordinates</a></div>
</div>
</div>
</div>
<br>
<fieldset class="moz-mime-attachment-header"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
gdal-dev mailing list
<a class="moz-txt-link-abbreviated" href="mailto:gdal-dev@lists.osgeo.org">gdal-dev@lists.osgeo.org</a>
<a class="moz-txt-link-freetext" href="https://lists.osgeo.org/mailman/listinfo/gdal-dev">https://lists.osgeo.org/mailman/listinfo/gdal-dev</a>
</pre>
</blockquote>
<pre class="moz-signature" cols="72">--
<a class="moz-txt-link-freetext" href="http://www.spatialys.com">http://www.spatialys.com</a>
My software is free, but my time generally not.</pre>
</body>
</html>