[gdal-dev] RFC 30: Unicode Filenames - call for discussion

Wed Sep 15 16:16:51 EDT 2010

Frank,

About the question "Do we need to convert to UCS-16 to do parsing or can we 
safely assume that special characters like '/', '.', '\' and ':' never occur 
as part of UTF-8 multi-byte sequences?", I was unclear what you really meant, 
but here are my findings/beliefs :
* In a UTF-8 string, any (unsigned) byte whose value is <= 127 is guaranteed 
to be a single-character that is the one defined in the ASCII set (so any 
multibyte character has bytes whose value is >= 128)
* Now, if we look at a UCS-16 byte stream, I had a hard time to find the 
answer. But basically any unicode character >= 0x000 && <= 0xFFFF is directly 
"converted" into a single 16 bit code-point in UTF-16. If we consider the 
point character, it's ascii value is 0x2E, and in unicode/UTF-16 it is thus 
"0x002E". Bit "0x2E2E" is a valid unicode character( '⸮' , the reversed 
quotation mark in spanish), we cannot trust the "0x2E" byte found in the 
UCS-16 byte stream to be always a point character.

About command line applications, I was a bit concerned about backward 
compatibility because there are existing working applications that pass ANSI 
non-ascii filenames to gdal command line utilities. I've tested your 
preliminary implementation with a file containing a 'é' (e-acute) character on 
a Windows platform that I think uses CP1252 (~ ISO-LATIN-1) as the codepage. I 
expected a failure in the UTF-8 -> UCS-16 translation since the passed filename 
wasn't UTF-8. It turns out that it actually works, since the utf-8 -> wide 
char conversion routine in cpl_recode_stub.cpp has a special case : when it 
doesn't manage to translate a (apparently) multibyte UTF-8 character, it 
assumes they are CP1252 and convert them into UCS-16 correctly. So this is 
good news for people having CP1252 as their current code page !

But apparently, there's a way for command line Windows applications to get 
their arguments as UCS-16 strings. This could be used to convert them reliably 
to UTF-8 just afterwards to feed it into GDAL. Here's what I found in MSDN :
* GetCommandLineW : http://msdn.microsoft.com/en-
us/library/ms683156%28VS.85%29.aspx
* CommandLineToArgvW : http://msdn.microsoft.com/en-
us/library/bb776391%28VS.85%29.aspx

The issue is that, if we go on this route, drivers that still use the old VSI 
API (posix one) won't work anymore with non-ASCII filenames... So I'm not sure 
if it's worth the pain : it would be only worth for people using currently 
successfully the command line utilities with a non-CP1252 code page.

About Java bindings, nothing to change. Java strings are encoded in unicode 
(UTF-16) and the typemaps we use already automatically converts to/from UTF-8 
on the C side (with GetStringUTFChars() from Java to C, and NewStringUTF() 
from C to Java)

Best regards,

Even