[gdal-dev] Cannot open S3 files after upload

Matt Hanson matt.a.hanson at gmail.com
Tue Jun 20 21:58:46 PDT 2017


Hello everyone,

My actual problem is a bit more specific then being unable to open S3 files
after upload. The actual problem is that within the same Python session, I
can open a file off S3 with the vsis3 driver, but then if I upload a new
file that previously did not exist (using boto3), gdal does not see it as a
valid file. I originally encountered this problem in rasterio, and with
gippy, but got the same problem when using gdal directly.

I have an app that generates time series by calculating values from images
off S3, however it also uploads files to S3 if they did not previously
exist for that particular date. If all the files currently exist then there
is no problem and they can be read fine. However, if a file is missing
*and* the app has already read a file from S3, then it is unable to see the
file as existing.

What appears to be happening is that once an S3 file is read the contents
of that bucket are read into a cache, but then if an new file is uploaded
in the meantime, trying to then read that file looks in the cache and
doesn't see that file as existing and throws an error. If I recall
correctly GDAL is reading other contents of that bucket/key-prefix because
it's looking accompanying metadata files so is this cached in some way? It
seemed like a plausible explanation but I've been unable to find reference
to such a cache other than potentially VSI_CACHE, but setting that to FALSE
did nothing and my understanding is that it applies to specific datasets,
not bucket contents.

I've managed to replicate the problem in a very simple Python program
below. While both files are uploaded without error (you can use gdalinfo
remotely on both), the attempt to open the second file will throw:
ERROR 4: `/vsis3/pail-of-images/test2.tif' not recognized as a supported
file format.

Calling the script a second time works, because (presumably) even though it
uploads and overwrites both images again, they both exist from the
beginning.

Either this is a bug or it's intended behavior in which case there's
hopefully some way to change it to force to reread a bucket when trying to
open a file. My current workaround is to change the behavior of my app to
upload all images first before accessing, but this seems unsatisfactory,
not to mention it wreaks havoc with my tests which don't assume such
behavior.

Suggestions very welcome, been banging my head on this for a couple days.

Tested with both Python v2.7 and 3.5, and with gdal 2.1.3 and gdal 2.2.0,
with Docker, without Docker, and on both Ubuntu and OSX.

########################
#!/usr/bin/env python3

from osgeo import gdal
import boto3

filenames = [
    'file1.tif',
    'file2.tif'
]

bucket = 'pail-of-images'

s3 = boto3.resource('s3')
for f in filenames:
    print('Uploading %s to %s' % (f, bucket))
    s3.meta.client.upload_file(f, bucket, f)
    uri = '/vsis3/%s/%s' % (bucket, f)
    print('Opening %s' % uri)
    ds = gdal.Open(uri)
    print(ds.GetMetadata())
    ds = None
##########################

Matthew Hanson
Development Seed
matthew at developmentseed.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20170621/8db2f1c8/attachment.html>


More information about the gdal-dev mailing list