[Gdal-dev] RFC DRAFT: Unicode support in GDAL

Andrey Kiselev andrey.kiselev at gmail.com
Thu Sep 21 09:00:52 EDT 2006


On 9/21/06, Ben Discoe <ben at vterrain.org> wrote:
> I am happy to do this work and submit it, i am just waiting to hear from you
> (or Frank, or..? Confused by GDAL's new democracy) that i should proceed.

We have discussed the issue today on IRC and now I want to come up with the
following proposal.

Unicode support in GDAL.

There are three basic statements:

1. Users work in localized environment using their native languages. That
   means we can not assume ASCII character set when working with string data
   passed to GDAL.

2. GDAL uses UTF-8 encoding internally when working with strings.

3. GDAL uses Unicode version of third-party API when it is possible.

So all strings, used in GDAL, are in UTF-8, not in plain ASCII. That means we
should convert user's input from the local encoding to UTF-8 during
interactive sessions. The opposite should be done for GDAL output. For
example, when user passes a filename as a command-line parameter to GDAL
utilities, that filename should be immediately converted to UTF-8 and only
afetrwards passed to functions like GDALOpen(). All functions, wich take
character strings as parameters, assume UTF-8 (with except of several ones,
which will do the conversion between different encodings, see below). The same
is valid for output functions. Output functions (CPLError/CPLDebug), embedded
in GDAL, should convert all strings from UTF-8 to local encoding befire
printing them. Custom error handlers should be aware of UTF-8 issue and do the
proper transformation of strings passed to them.

The string encoding pops up again when GDAL needs to call the third-party API.
UTF-8 should be converted to encoding suitable for that API. In particular,
that means we should convert UTF-8 to UTF-16 before calling CreateFile()
function in Windows implementation of VSIFOpenL().

For file format drivers the string representation should be worked out on
per-driver basis. If driver need to parse ASCII text there is no need to
convert strings to UTF-8 until they will be passed to GDAL functions.

Notes on implementation:

1. New CPL functions:

   // Convert UTF-8 encoded string into ASCII. Out-of-range characters
   // replaced with '?' in output string.
   char* CPLUTF8ToASCII(const char*);

   // Convert UTF-8 encoded string into local encoding.
   char* CPLUTF8ToLocal(const char*);

   // Convert string from local encoding to UTF-8.
   char* CPLLocalToUTF8(const char*);

   // Convert string from UTF-8 encoding into array of wchar_t elements.
   // Destination encoding is system specific.
   wchar_t* CPLUTF8ToWide(const char*);

   // Convert array of wchar_t elements into UTF-8 encoded string.
   // Source encoding is system specific.
   char* CPLWideToUTF8(wchar_t*);

2. In order to use non-ASCII characters in user input every application should
   call setlocale(LC_ALL,  "") function right after the entry point.

3. Code example. Let's look how the gdal utilities and core code should
   be changed in regard to Unicode.

   For input instead of

	pszFilename = argv[i];
	if( pszFilename )
		hDataset = GDALOpen( pszFilename, GA_ReadOnly );

   we should do

	pszFilename = CPLLocalToUTF8(argv[i]);
	if ( pszFilename )
	{
		hDataset = GDALOpen( pszFilename, GA_ReadOnly );
		CPLFree( pszFilename );
	}

   For output instead of

   	printf( "Description = %s\n", GDALGetDescription(hBand) );

   we should do

	char *pszDescription = CPLUTF8ToLocal( GDALGetDescription(hBand) );
	if ( pszDescription )
	{
   		printf( "Description = %s\n", pszDescription );
		CPLFree( pszDescription );
	}

   The filename passed to GDALOpen() in UTF-8 encoding in the code snippet
   above will be further processed in the GDAL core. On Windows instead of

   	hFile = CreateFile( pszFilename, dwDesiredAccess,
		FILE_SHARE_READ | FILE_SHARE_WRITE,
                NULL, dwCreationDisposition,  dwFlagsAndAttributes, NULL );

   we do

	wchar_t *pawcFilename = CPLUTF8ToWide( pszFilename );
	if ( pawcFilename )
	{
		// I am prefer call the wide character version explicitly
		// rather than specify _UNICODE switch.
   		hFile = CreateFileW( pawcFilename, dwDesiredAccess,
			FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
			dwCreationDisposition,  dwFlagsAndAttributes, NULL );
		CPLFree( pawcFilename );
	}

References:

1. FAQ on how to use Unicode in software:

    http://www.cl.cam.ac.uk/~mgk25/unicode.html

2. FLTK implementation of string conversion functions:

    http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c


-- 
Andrey V. Kiselev
ICQ# 26871517



More information about the Gdal-dev mailing list