[gdal-dev] vsicurl configuration design decisions

Sean Gillies sean at mapbox.com
Thu Oct 12 09:18:22 PDT 2017


Hi Even,

On Tue, Oct 10, 2017 at 4:02 AM, Even Rouault <even.rouault at spatialys.com>
wrote:

> Hi Sean,
>
>
>
> >
>
> > It's written in
>
> > http://gdal.org/gdal_virtual_file_systems.html#gdal_
> virtual_file_systems_vsi
>
> > curl
>
> > > Starting with GDAL 2.3, options can be passed in the filename with the
>
> >
>
> > following syntax: /vsicurl/option1=val1[,optionN=valN]*,url=http://...
>
> >
>
> > I'd like to discuss the design decisions that are being made here before
>
> > this gets out into the world.
>
> >
>
> > I'm uncomfortable with the way configuration is spread between
> environment
>
> > variables, config options that surface in the API,
>
>
>
> Just a precision: GDAL only reads configuration options with
> CPLGetConfigOption(key). Those can be implictly set through environment
> variables of the same name or with CPLSetConfigOption(key, value).
>
>
>
> > and also in identifiers.
>
> > I don't think it's a great idea to that expand the amount of
> configuration
>
> > in dataset identifiers. It's redundant, the syntax is complicated,
>
>
>
> Frank answered on the main motivations.
>

Yes, I understand that adding syntax tied to new core GDAL functionality
can turn already-deployed software into full-fledged cloud data consumers.
For cloud data providers and customers this is a big win.


>
>
> > and it
>
> > dilutes the network effects of reusing identifiers in our applications.
>
>
>
> Didn't understand what you meant with the above sentence.
>

I mean that having multiple names for datasets in our domain,
https://example.com/foo.tif vs /vsicurl/https://example.com/foo.tif vs
/viscurl/option1=val,url=https//example.com/foo.tif dilutes the power of
the names and potentially reduces the network effects we could get by using
fewer names. This is an abstract concern, however, and I don't want it to
distract from talking about the design decisions.


>
> >
>
> > Are there specific advantages to this
>
> >
>
> > ogrinfo -so /vsicurl/max_retry=10,url=https://example.com/poly.shp
>
> >
>
> > that we can't also have with a curl-style
>
> >
>
> > ogrinfo -so --max-retry=10 /vsicurl/https://example.com/poly.shp
>
> >
>
> > or, better yet, in my opinion
>
> >
>
> > ogrinfo -so --max-retry=10 https://example.com/poly.shp
>
> >
>
> > on the command line?
>
>
>
> One issue with you proposal is that it would require ogrinfo (or any
> utility) to go from the highest level abstraction layers of GDAL to the
> lowest ones.
>
>
>
> When ogrinfo is provided
>
> "/vsicurl/max_retry=10,url=https://example.com/poly.shp",
>
> this is just a string used as a dataset name
>
>
>
> It happily feeds it into GDALOpenEx(), which in turns proposes it
> sequentially to all drivers
>
>
>
> The shapefile driver tries this string with VSIFOpenL(), which in turns
> iterates over all virtual file systems. The /vsicurl/ VFS happens to
> recognize it, manages to open the file. The shapefile driver can read the
> few first bytes from it and recognizes that it is a header of a shapefile,
> etc..
>
>
>
> So in the current design neither the utility, nor GDALOpenEx(), or the
> drivers themselves really make a sense of that string. This is quite a
> strength at the architectural level. This also enables to pass such a
> string in a VRT file for example.
>

Is the future of open and creation options? Do you imagine this extended
to, say, block size, compression, number of threads? An RFC that discussed
the scope of this and at what level of abstraction it is implemented at
might be warranted? I'd be happy to participate.


>
>
> Regarding the direct use of http:// https:// , I also find it is a bit
> unfortunate that we can't use them directly and vsicurl machinery would be
> implictly used. It turns that historically we have the HTTP driver that
> triggers on such dataset name (ingesting the whole file into /vsimem/, and
> proposing it in turn to other drivers). There's also a few other drivers
> (DODS, etc..) that trigger on such names.
>
>
>
> Even
>

On the other hand, https://example.com/foo.tif identifies only a single
resource, whereas /viscurl/url=https://example.com/foo.tif can identify a
GeoTIFF along with all of its sidecars. I presume that the new GDAL cloud
utilities like gdal_cp.py take care of the auxiliary files, yes?

My final concern about the virtual file opening options is the syntax.
These /vsicurl/option1=val1[,optionN=valN]*,url=http://example.com/foo.tif
identifiers (or filenames or whatever we call them) may spread from GDAL
into the wider geospatial programming domain. Speaking from my experience
with Rasterio, open source Python GIS developers expect the /vsi* filenames
to "just work" in all software. Can we consider using a more standard
syntax? One that has parsers already deployed everywhere?

For example, /viscurl?option1=foo&option2=bar&url=
https://example.com/foo.tif can be parsed by standard URL parsers such as
Python's.

>>> from urllib.parse import urlparse, parse_qs
>>> urlparse('/viscurl?option1=foo&option2=bar&url=
https://example.com/foo.tif')
ParseResult(scheme='', netloc='', path='/viscurl', params='',
query='option1=foo&option2=bar&url=https://example.com/foo.tif',
fragment='')
>>> from urllib.parse import parse_qs
>>> parse_qs(_.query)
{'option1': ['foo'], 'url': ['https://example.com/foo.tif'], 'option2':
['bar']}

That syntax gives the /vsi* filenames the form of a "reflector" URL such as
we see in Google searches (for example:
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0ahUKEwjC6e7hvevWAhXmjFQKHWsHDyMQFggmMAA&url=http%3A%2F%2Fwww.gdal.org%2F&usg=AOvVaw3fbRv5TusYwkXgz2Acf2kt)
and there are abundant tools and a body of knowledge about how to parse and
work with these.

-- 
Sean Gillies
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20171012/e4cd0002/attachment.html>


More information about the gdal-dev mailing list