[gdal-dev] RFC 30: Unicode Filenames - call for discussion
Even Rouault
even.rouault at mines-paris.org
Wed Sep 15 16:16:51 EDT 2010
Frank,
About the question "Do we need to convert to UCS-16 to do parsing or can we
safely assume that special characters like '/', '.', '\' and ':' never occur
as part of UTF-8 multi-byte sequences?", I was unclear what you really meant,
but here are my findings/beliefs :
* In a UTF-8 string, any (unsigned) byte whose value is <= 127 is guaranteed
to be a single-character that is the one defined in the ASCII set (so any
multibyte character has bytes whose value is >= 128)
* Now, if we look at a UCS-16 byte stream, I had a hard time to find the
answer. But basically any unicode character >= 0x000 && <= 0xFFFF is directly
"converted" into a single 16 bit code-point in UTF-16. If we consider the
point character, it's ascii value is 0x2E, and in unicode/UTF-16 it is thus
"0x002E". Bit "0x2E2E" is a valid unicode character( '⸮' , the reversed
quotation mark in spanish), we cannot trust the "0x2E" byte found in the
UCS-16 byte stream to be always a point character.
About command line applications, I was a bit concerned about backward
compatibility because there are existing working applications that pass ANSI
non-ascii filenames to gdal command line utilities. I've tested your
preliminary implementation with a file containing a 'é' (e-acute) character on
a Windows platform that I think uses CP1252 (~ ISO-LATIN-1) as the codepage. I
expected a failure in the UTF-8 -> UCS-16 translation since the passed filename
wasn't UTF-8. It turns out that it actually works, since the utf-8 -> wide
char conversion routine in cpl_recode_stub.cpp has a special case : when it
doesn't manage to translate a (apparently) multibyte UTF-8 character, it
assumes they are CP1252 and convert them into UCS-16 correctly. So this is
good news for people having CP1252 as their current code page !
But apparently, there's a way for command line Windows applications to get
their arguments as UCS-16 strings. This could be used to convert them reliably
to UTF-8 just afterwards to feed it into GDAL. Here's what I found in MSDN :
* GetCommandLineW : http://msdn.microsoft.com/en-
us/library/ms683156%28VS.85%29.aspx
* CommandLineToArgvW : http://msdn.microsoft.com/en-
us/library/bb776391%28VS.85%29.aspx
The issue is that, if we go on this route, drivers that still use the old VSI
API (posix one) won't work anymore with non-ASCII filenames... So I'm not sure
if it's worth the pain : it would be only worth for people using currently
successfully the command line utilities with a non-CP1252 code page.
About Java bindings, nothing to change. Java strings are encoded in unicode
(UTF-16) and the typemaps we use already automatically converts to/from UTF-8
on the C side (with GetStringUTFChars() from Java to C, and NewStringUTF()
from C to Java)
Best regards,
Even
More information about the gdal-dev
mailing list