[gdal-dev] Performance and sibling files

Daniel daniel112b at gmail.com
Tue Jan 29 10:19:34 EST 2008


Hello,

We have identfied a serious performance problem with the reading of sibling
files performed in the GDALOpenInfo constructor.

When we commented out lines 123-127 in gdalopeninfo.cpp (the VSIReadDir
call), the runtime of our application went down from 150 days to 15! The
application is 100% i/O-bound (uses no cpu time according to the task
manager)

This is our setup:

7.5 million small (~20 KB) jpeg files with corresponding world files for a
total of 15 million files, distributed in 50000 directories (approximately
300 files per directory).

The files reside on a fast 15K SAS disk running in a Windows 2003 server
with 8 cores and 4 GB RAM. The filesystem is NTFS (no compression /
indexing).

Due to the way the files are organized, neighboring jpeg files are located
in different directories. This means that we always have to read the entire
directory in order to open just one file.

Our app needs to go read the entire dataset ordered geographically.
Unfortunately, changing the directory layout is not an option.

Reading one complete directory means reading ~1.5 MB data from disk. The
data is read non-sequentially, since the NTFS directory structure is a
B-Tree and FindNextFile returns the contents sorted alphabetically.
The disk cache gets exhausted after reading 2700 directories. This means
that we neve re-use the previously read directory data.

I realise that this might be a quite unusual case but it would be very nice
if the sibling reading in GDALOpenInfo was optional.

I don't think that the changes made in ticket #2158 (
http://trac.osgeo.org/gdal/ticket/2158) would help in this case since there
was almost no CPU utilization.

Regards,
  Daniel Bäck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.osgeo.org/pipermail/gdal-dev/attachments/20080129/66db6b59/attachment.html


More information about the gdal-dev mailing list