[gdal-dev] UTF-8 String Support in GDALOpen() and OGRSFDriverRegistrar::Open()

Mon Sep 7 14:56:00 EDT 2009

Selon Ivan <ivan.lucena at pmldnet.com>:

> >
> > Folks,
> >
> > I wonder if we should implement some mechanism to support UTF-8 filenames
> > on windows (and generally) before GDAL 1.7 release?

That would be definitely a cool idea. Apart from Windows, I'm not sure if other
supported OSes need work. As far as Linux is concerned, I believe that we can
reasonably assume that UTF-8 strings are already passed to GDAL/OGR, as nowadays
all distro have switched their locales to UTF-8 (the move started with RedHat
8.0 in 2002), although my readings show that the filesystem encoding is not
necessary the same. I've looked a bit at GLIB-2.0 documentation and they have
invented a G_FILENAME_ENCODING and G_BROKEN_FILENAMES to deal with those rare
situations (http://library.gnome.org/devel/glib/stable/glib-running.html,
http://library.gnome.org/devel/glib/stable/glib-Character-Set-Conversion.html).
All of this is rather confusing, but I don't think we need to go into that level
of complexity. As far as MacOSX is concerned, I can't say.

> >
> > How dangerous would it be for us to always assume filenames are UTF-8 and
> > act accordingly?
> >
> > One theoretical downside to treating filenames as UTF8 is that we do a lot
> > of filename parsing that has no concept that some bytes in the name might
> > be part of a multi-byte sequence.  So if there was a UTF8 multibyte
> > character that happened to include ASCII 92 '\' or ASCII 47 '/' it would
> > confuse the path parsers.  Also for subdatasets, database connections and
> > other esoteric datasource names we do a lot of parsing - splitting on
> > spaces, commas, quotes and other special characters.  Some of this could be
> > confused by unfortunate UTF-8 characters.  I suppose we really ought to
> > be migrating to doing these manipulations on wchar_t's or perhaps UCS-32
> > arrays.
> >
> > Hmm, this is getting rather complicated to address fully.

On the contrary, UTF-8 garantees that you can't find a byte within the ASCII
range (0-127) in a multi-byte UTF-8 character. Multi-byte UTF-8 characters
always have their most significant bit at 1. Quoting Wikipedia : "The ASCII
characters are represented by themselves as single bytes that do not appear
anywhere else, which makes UTF-8 work with the majority of existing APIs that
take bytes strings but only treat a small number of ASCII codes specially". So
UTF-8 would be definitely a good choice as a unicode encoding.

> >
> > But at least as a hack we could provide a build or runtime mechanism to
> > tell cpl_vsil_win32.cpp code that the passed in filename should be
> > handled as UTF-8 instead of local code page characters on windows.  Would
> > that be worth implementing?

Like Ivan, I think we must try aiming at the cleanest solution (at least at the
API level) to minimize the need for users to port their app.

I've hardly any experience on the subject on Windows, but I think we should
target the wide-character (UTF-16) variants of the functions of the Windows API
rather than then local code page, since UTF-8/UTF-16 conversion to local code
page encoding can fail. Andrey mentionned CreateFileW in RFC5. _findfirst would
likely need to be changed into _wfindfirst. You mention
cpl_vsil_win32.cpp, but cpl_vsil_simple.cpp would probably need changes.
http://msdn.microsoft.com/en-us/library/yeby3zcb(VS.71).aspx mentions a
_wfopen() wide-character version of fopen.

On Windows, GDAL/OGR applications would also need some changes to get their
command line options as UTF-8 arguments. I see as GetCommandLineW()/
CommandLineToArgvW() functions
(http://msdn.microsoft.com/en-us/library/ms683156(VS.85).aspx).

A remaining question is : should we provide a 'compatibility mode' for users
that only deal with non-ASCII character in the ANSI range of their local code
page and can use it successfully currently ? This could be controlled by a
environment variable (CPL_ANSI_FILENAMES=ON) that would revert to the A variants
without any string conversions. Or maybe we can assume that the
behaviour of current GDAL was undefined for any non-ASCII filename, so we can
freely define it without dealing too much with backward compatibility issues

Best regards,

Even