[gdal-dev] UTF-8 String Support in GDALOpen() and OGRSFDriverRegistrar::Open()

Mon Sep 7 16:18:50 EDT 2009

> A remaining question is : should we provide a 'compatibility mode' for
users
> that only deal with non-ASCII character in the ANSI range of their local
code
> page and can use it successfully currently ? This could be controlled by a
> environment variable (CPL_ANSI_FILENAMES=ON) that would revert to the A
variants
> without any string conversions. Or maybe we can assume that the
> behaviour of current GDAL was undefined for any non-ASCII filename, so we
can
> freely define it without dealing too much with backward compatibility
issues

That is an important question.

On languages of Windows, important paths can contain non-ASCII characters.
For example, if I recall correctly, on the Czech version of Windows XP the
"C:\Documents and Settings\username\Application Data" directory is localized
to "C:\Documents and Settings\username\Data aplikací". The last character in
that string is character 0xED in the Windows-1252 code page. (I guess
Microsoft was too scared to localize "Documents and Settings" but ok with
localizing "Application Data".) Also, I believe users can include non-ASCII
characters in their user name. This means that the path to the "My
Documents" folder will have non-ASCII characters in it. Because that is the
default location to save things for many Windows applications, it seems
important that GDAL would allow programs to read from there.

I am not 100% certain, but I believe the behavior of the current GDAL API is
to accept 8-bit strings and pass them through to the 8-bit operating system
functions. There are probably many programs that rely on this, allowing them
to work on Windows with non-ASCII characters in their paths. If you now
require UTF-8 instead, those programs would all have to be changed unless
you supplied the CPL_ANSI_FILENAMES=ON option, or something similar.

Best,

Jason

-----Original Message-----
From: gdal-dev-bounces at lists.osgeo.org
[mailto:gdal-dev-bounces at lists.osgeo.org] On Behalf Of Even Rouault
Sent: Monday, September 07, 2009 2:56 PM
To: Ivan
Cc: gdal-dev at lists.osgeo.org
Subject: Re: [gdal-dev] UTF-8 String Support in GDALOpen() and
OGRSFDriverRegistrar::Open()

Selon Ivan <ivan.lucena at pmldnet.com>:

> >
> > Folks,
> >
> > I wonder if we should implement some mechanism to support UTF-8
filenames
> > on windows (and generally) before GDAL 1.7 release?

That would be definitely a cool idea. Apart from Windows, I'm not sure if
other
supported OSes need work. As far as Linux is concerned, I believe that we
can
reasonably assume that UTF-8 strings are already passed to GDAL/OGR, as
nowadays
all distro have switched their locales to UTF-8 (the move started with
RedHat
8.0 in 2002), although my readings show that the filesystem encoding is not
necessary the same. I've looked a bit at GLIB-2.0 documentation and they
have
invented a G_FILENAME_ENCODING and G_BROKEN_FILENAMES to deal with those
rare
situations (http://library.gnome.org/devel/glib/stable/glib-running.html,
http://library.gnome.org/devel/glib/stable/glib-Character-Set-Conversion.htm
l).
All of this is rather confusing, but I don't think we need to go into that
level
of complexity. As far as MacOSX is concerned, I can't say.

> >
> > How dangerous would it be for us to always assume filenames are UTF-8
and
> > act accordingly?
> >
> > One theoretical downside to treating filenames as UTF8 is that we do a
lot
> > of filename parsing that has no concept that some bytes in the name
might
> > be part of a multi-byte sequence.  So if there was a UTF8 multibyte
> > character that happened to include ASCII 92 '\' or ASCII 47 '/' it would
> > confuse the path parsers.  Also for subdatasets, database connections
and
> > other esoteric datasource names we do a lot of parsing - splitting on
> > spaces, commas, quotes and other special characters.  Some of this could
be
> > confused by unfortunate UTF-8 characters.  I suppose we really ought to
> > be migrating to doing these manipulations on wchar_t's or perhaps UCS-32
> > arrays.
> >
> > Hmm, this is getting rather complicated to address fully.

On the contrary, UTF-8 garantees that you can't find a byte within the ASCII
range (0-127) in a multi-byte UTF-8 character. Multi-byte UTF-8 characters
always have their most significant bit at 1. Quoting Wikipedia : "The ASCII
characters are represented by themselves as single bytes that do not appear
anywhere else, which makes UTF-8 work with the majority of existing APIs
that
take bytes strings but only treat a small number of ASCII codes specially".
So
UTF-8 would be definitely a good choice as a unicode encoding.

> >
> > But at least as a hack we could provide a build or runtime mechanism to
> > tell cpl_vsil_win32.cpp code that the passed in filename should be
> > handled as UTF-8 instead of local code page characters on windows.
Would
> > that be worth implementing?

Like Ivan, I think we must try aiming at the cleanest solution (at least at
the
API level) to minimize the need for users to port their app.

I've hardly any experience on the subject on Windows, but I think we should
target the wide-character (UTF-16) variants of the functions of the Windows
API
rather than then local code page, since UTF-8/UTF-16 conversion to local
code
page encoding can fail. Andrey mentionned CreateFileW in RFC5. _findfirst
would
likely need to be changed into _wfindfirst. You mention
cpl_vsil_win32.cpp, but cpl_vsil_simple.cpp would probably need changes.
http://msdn.microsoft.com/en-us/library/yeby3zcb(VS.71).aspx mentions a
_wfopen() wide-character version of fopen.

On Windows, GDAL/OGR applications would also need some changes to get their
command line options as UTF-8 arguments. I see as GetCommandLineW()/
CommandLineToArgvW() functions
(http://msdn.microsoft.com/en-us/library/ms683156(VS.85).aspx).

A remaining question is : should we provide a 'compatibility mode' for users
that only deal with non-ASCII character in the ANSI range of their local
code
page and can use it successfully currently ? This could be controlled by a
environment variable (CPL_ANSI_FILENAMES=ON) that would revert to the A
variants
without any string conversions. Or maybe we can assume that the
behaviour of current GDAL was undefined for any non-ASCII filename, so we
can
freely define it without dealing too much with backward compatibility issues

Best regards,

Even

_______________________________________________
gdal-dev mailing list
gdal-dev at lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/gdal-dev