[gdal-dev] RFC 30: Unicode Filenames - call for discussion

Tue Sep 21 16:48:39 EDT 2010

Even Rouault wrote:
> Frank,
> 
> About the question "Do we need to convert to UCS-16 to do parsing or can we 
> safely assume that special characters like '/', '.', '\' and ':' never occur 
> as part of UTF-8 multi-byte sequences?", I was unclear what you really meant, 
> but here are my findings/beliefs :
> * In a UTF-8 string, any (unsigned) byte whose value is <= 127 is guaranteed 
> to be a single-character that is the one defined in the ASCII set (so any 
> multibyte character has bytes whose value is >= 128)
> * Now, if we look at a UCS-16 byte stream, I had a hard time to find the 
> answer. But basically any unicode character >= 0x000 && <= 0xFFFF is directly 
> "converted" into a single 16 bit code-point in UTF-16. If we consider the 
> point character, it's ascii value is 0x2E, and in unicode/UTF-16 it is thus 
> "0x002E". Bit "0x2E2E" is a valid unicode character( '⸮' , the reversed 
> quotation mark in spanish), we cannot trust the "0x2E" byte found in the 
> UCS-16 byte stream to be always a point character.

Even,

Sorry, I was not clear.  I was wondering if it was safe to assume that a
'/', '\', ':' or '.' character in a utf-8 string would always represent
the corresponding ascii character.  Your response seems to indicate this
is true so we do not need to convert utf-8 strings to ucs-16 before doing
path/extension parsing.  This means we can just leave the parsing functions
as they are.

> About command line applications, I was a bit concerned about backward 
> compatibility because there are existing working applications that pass ANSI 
> non-ascii filenames to gdal command line utilities. I've tested your 
> preliminary implementation with a file containing a 'é' (e-acute) character on 
> a Windows platform that I think uses CP1252 (~ ISO-LATIN-1) as the codepage. I 
> expected a failure in the UTF-8 -> UCS-16 translation since the passed filename 
> wasn't UTF-8. It turns out that it actually works, since the utf-8 -> wide 
> char conversion routine in cpl_recode_stub.cpp has a special case : when it 
> doesn't manage to translate a (apparently) multibyte UTF-8 character, it 
> assumes they are CP1252 and convert them into UCS-16 correctly. So this is 
> good news for people having CP1252 as their current code page !
> 
> But apparently, there's a way for command line Windows applications to get 
> their arguments as UCS-16 strings. This could be used to convert them reliably 
> to UTF-8 just afterwards to feed it into GDAL. Here's what I found in MSDN :
> * GetCommandLineW : http://msdn.microsoft.com/en-
> us/library/ms683156%28VS.85%29.aspx
> * CommandLineToArgvW : http://msdn.microsoft.com/en-
> us/library/bb776391%28VS.85%29.aspx

I have confirmed that the above functions can be used to get UCS-2
(wide character) filename arguments that can be successfully converted
to UTF-8 and passed to GDALOpen().  I am hesitant to change all the
GDAL utilities to use this functionality inline; however, it would be
pretty easy to modify the GDALGeneralCmdLineProcessor() and
OGRGeneralCommandLineProcessor() to ignore the passed in
list of options and instead refetch and reparse them using wide
strings on windows.  In the normal use for these functions that would
do the trick nicely, but it will defeat any efforts to do tricky stuff
with these functions (like application injected options).

I will tentatively writeup the RFC to do so despite the possible
downside risks.

> The issue is that, if we go on this route, drivers that still use the old VSI 
> API (posix one) won't work anymore with non-ASCII filenames... So I'm not sure 
> if it's worth the pain : it would be only worth for people using currently 
> successfully the command line utilities with a non-CP1252 code page.

While it is true that exotic characters converted to utf will not
work with fopen() (underlying VSIFOpen()) I don't see how we are any
worse off then if we didn't do this special processing?

> About Java bindings, nothing to change. Java strings are encoded in unicode 
> (UTF-16) and the typemaps we use already automatically converts to/from UTF-8 
> on the C side (with GetStringUTFChars() from Java to C, and NewStringUTF() 
> from C to Java)

Excellent, I have noted as much in the RFC.

Best regards,
-- 
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up   | Frank Warmerdam, warmerdam at pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush    | Geospatial Programmer for Rent