[Gdal-dev] RFC DRAFT: Unicode support in GDAL
Andrey Kiselev
andrey.kiselev at gmail.com
Thu Sep 21 09:00:52 EDT 2006
On 9/21/06, Ben Discoe <ben at vterrain.org> wrote:
> I am happy to do this work and submit it, i am just waiting to hear from you
> (or Frank, or..? Confused by GDAL's new democracy) that i should proceed.
We have discussed the issue today on IRC and now I want to come up with the
following proposal.
Unicode support in GDAL.
There are three basic statements:
1. Users work in localized environment using their native languages. That
means we can not assume ASCII character set when working with string data
passed to GDAL.
2. GDAL uses UTF-8 encoding internally when working with strings.
3. GDAL uses Unicode version of third-party API when it is possible.
So all strings, used in GDAL, are in UTF-8, not in plain ASCII. That means we
should convert user's input from the local encoding to UTF-8 during
interactive sessions. The opposite should be done for GDAL output. For
example, when user passes a filename as a command-line parameter to GDAL
utilities, that filename should be immediately converted to UTF-8 and only
afetrwards passed to functions like GDALOpen(). All functions, wich take
character strings as parameters, assume UTF-8 (with except of several ones,
which will do the conversion between different encodings, see below). The same
is valid for output functions. Output functions (CPLError/CPLDebug), embedded
in GDAL, should convert all strings from UTF-8 to local encoding befire
printing them. Custom error handlers should be aware of UTF-8 issue and do the
proper transformation of strings passed to them.
The string encoding pops up again when GDAL needs to call the third-party API.
UTF-8 should be converted to encoding suitable for that API. In particular,
that means we should convert UTF-8 to UTF-16 before calling CreateFile()
function in Windows implementation of VSIFOpenL().
For file format drivers the string representation should be worked out on
per-driver basis. If driver need to parse ASCII text there is no need to
convert strings to UTF-8 until they will be passed to GDAL functions.
Notes on implementation:
1. New CPL functions:
// Convert UTF-8 encoded string into ASCII. Out-of-range characters
// replaced with '?' in output string.
char* CPLUTF8ToASCII(const char*);
// Convert UTF-8 encoded string into local encoding.
char* CPLUTF8ToLocal(const char*);
// Convert string from local encoding to UTF-8.
char* CPLLocalToUTF8(const char*);
// Convert string from UTF-8 encoding into array of wchar_t elements.
// Destination encoding is system specific.
wchar_t* CPLUTF8ToWide(const char*);
// Convert array of wchar_t elements into UTF-8 encoded string.
// Source encoding is system specific.
char* CPLWideToUTF8(wchar_t*);
2. In order to use non-ASCII characters in user input every application should
call setlocale(LC_ALL, "") function right after the entry point.
3. Code example. Let's look how the gdal utilities and core code should
be changed in regard to Unicode.
For input instead of
pszFilename = argv[i];
if( pszFilename )
hDataset = GDALOpen( pszFilename, GA_ReadOnly );
we should do
pszFilename = CPLLocalToUTF8(argv[i]);
if ( pszFilename )
{
hDataset = GDALOpen( pszFilename, GA_ReadOnly );
CPLFree( pszFilename );
}
For output instead of
printf( "Description = %s\n", GDALGetDescription(hBand) );
we should do
char *pszDescription = CPLUTF8ToLocal( GDALGetDescription(hBand) );
if ( pszDescription )
{
printf( "Description = %s\n", pszDescription );
CPLFree( pszDescription );
}
The filename passed to GDALOpen() in UTF-8 encoding in the code snippet
above will be further processed in the GDAL core. On Windows instead of
hFile = CreateFile( pszFilename, dwDesiredAccess,
FILE_SHARE_READ | FILE_SHARE_WRITE,
NULL, dwCreationDisposition, dwFlagsAndAttributes, NULL );
we do
wchar_t *pawcFilename = CPLUTF8ToWide( pszFilename );
if ( pawcFilename )
{
// I am prefer call the wide character version explicitly
// rather than specify _UNICODE switch.
hFile = CreateFileW( pawcFilename, dwDesiredAccess,
FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
dwCreationDisposition, dwFlagsAndAttributes, NULL );
CPLFree( pawcFilename );
}
References:
1. FAQ on how to use Unicode in software:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
2. FLTK implementation of string conversion functions:
http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c
--
Andrey V. Kiselev
ICQ# 26871517
More information about the Gdal-dev
mailing list