[gdal-dev] /vsicurl caching behavior

Patrick Valsecchi patrick.valsecchi at camptocamp.com
Thu Jan 4 23:24:16 PST 2018


Hi,

I've been looking at the performance of GDAL in the context of MapServer
and QGIS accessing GeoTiff on a HTTP server (S3, mostly).

Doing that, I had a good look at the cpl_vsil_curl.cpp file and the
surroundings and that raised a few questions/concerns:

1) There are a lot of knobs that can be used to tune the thing that are not
documented. For example CPL_VSIL_CURL_USE_CACHE. Is it on purpose?

2) The implementation of Add/GetRegionToCacheDisk is quite crude. Scanning
the file sequentially to find the region is not very efficient, I guess.
Are there any plan to improve that? Maybe a bit less crude with splitting
the file in two: one that contains the index (hash+offset+size) and the
other one the content. That way, the scanning of the index will be faster
(contiguous in the disk and in cache). But that requires the usage of flock
and its equivalent in other OSes.

3) The implementation of GetRegionFromCacheDisk has some efficiency
problems. If the region is found, it calls AddRegion which in turn will
call AddRegionToCacheDisk just to re-scan the file; where it will find the
one GetRegionFromCacheDisk just searched and not add it one more time. So
we scan sequentially the file twice.

4) There is no limit to the gdal_vsicurl_cache.bin file size. This makes
this caching not very usable: risk of running out of disk, increasing
slowness, no refresh of the data after some time.

5) There is no way to specify the location of gdal_vsicurl_cache.bin unless
one does a chdir before calling GDAL.

6) If VSI_CACHE is enabled the data is cached twice in memory (papsRegions
and VSICachedFile). Is it wanted?

7) If the file's content is modified, it's the total mess. We'll end up
having portions of the file having the old data while the rest has the new
data. I'm quite sure the GeoTiff we end up with won't be very valid.

8) In the case discussed in 7), CPL_VSIL_CURL_NON_CACHED will just purge
the data from 1 the 3 caches: papsRegions. The vsil_cache and the disk will
still cache the content.

Apart from that, I'm very impressed by the performance GDAL can get when
accessing the data through HTTP and how easy it is to understand the code.
Kudos!

What do you guys think?

Thanks and CU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20180105/7b1362b2/attachment.html>


More information about the gdal-dev mailing list