[Gdal-dev] RFC DRAFT: Unicode support in GDAL

Akio Takubo takubo at saruga-tondara.net
Thu Sep 21 16:24:07 EDT 2006


Dear Andrey and all developers.

 I've read "Wide-character filenames " thread with much interest these days. 
And I read draft document for Unicode support. It is great to be improving 
non-ASCII charcters support! So I have some comments for draft.

 At first, as Ben pointed out, I think that filename (or related with filesystem) issue 
and data(attribute name, attribute data, etc...) issue  are some diffrent topic.

 In this draft, it seems that supported encodings are following three types.
+ ASCII
+ UTF-8(and UTF-16)
+ local encoding
So this framework has some limitation, i think. If GDAL/OGR convert to 
UTF-8 internally, this conversion may break information 
when shp file have encoding other than local one.
 For example, Can a user, who uses Windows with Latin1(CP1252), read 
shapefile with Shift JIS(CP932) (most japanese use shp file with CP932) correctly?
I think that we cannot know what encoding is used in each shp file automatically.

At least, we can manipurate non-ASCII contents with current GDAL/OGR generally, 
as a client which uses GDAL/OGR consider contents's encoding.
Now we can receive raw byte sequence from GDAL/OGR, 
we can convert it to strings specified encodings. 
In QGIS, when a user open shp  (or other supported file), he need to 
select encoding which this file has. And reading attribute via OGR 
and convert from selected encoding to unicode(it is QGIS's internal encoding). 
So we can read contents with various encodings.

If GDAL/OGR will use UTF-8 internally, It is better way that add API 
for specifying encoding for each datasource (not each driver or not per system) 
is needed (maybe default is local encoding), I think. 
Depends on file format, this setting can be ignored, of course.

 Client <------------------> GDAL/OGR <-------------------> datasource
 (local encoding)                (UTF-8)                           (user setting/ or driver specific)


 I hope my comments will help you.

Thanks, 

 Akio Takubo
  From Tokyo, Japan

On Thu, 21 Sep 2006 17:00:52 +0400
"Andrey Kiselev" <andrey.kiselev at gmail.com> wrote:

> On 9/21/06, Ben Discoe <ben at vterrain.org> wrote:
> > I am happy to do this work and submit it, i am just waiting to hear from you
> > (or Frank, or..? Confused by GDAL's new democracy) that i should proceed.
> 
> We have discussed the issue today on IRC and now I want to come up with the
> following proposal.
> 
> Unicode support in GDAL.
> 
> There are three basic statements:
> 
> 1. Users work in localized environment using their native languages. That
>    means we can not assume ASCII character set when working with string data
>    passed to GDAL.
> 
> 2. GDAL uses UTF-8 encoding internally when working with strings.
> 
> 3. GDAL uses Unicode version of third-party API when it is possible.
> 
> So all strings, used in GDAL, are in UTF-8, not in plain ASCII. That means we
> should convert user's input from the local encoding to UTF-8 during
> interactive sessions. The opposite should be done for GDAL output. For
> example, when user passes a filename as a command-line parameter to GDAL
> utilities, that filename should be immediately converted to UTF-8 and only
> afetrwards passed to functions like GDALOpen(). All functions, wich take
> character strings as parameters, assume UTF-8 (with except of several ones,
> which will do the conversion between different encodings, see below). The same
> is valid for output functions. Output functions (CPLError/CPLDebug), embedded
> in GDAL, should convert all strings from UTF-8 to local encoding befire
> printing them. Custom error handlers should be aware of UTF-8 issue and do the
> proper transformation of strings passed to them.
> 
> The string encoding pops up again when GDAL needs to call the third-party API.
> UTF-8 should be converted to encoding suitable for that API. In particular,
> that means we should convert UTF-8 to UTF-16 before calling CreateFile()
> function in Windows implementation of VSIFOpenL().
> 
> For file format drivers the string representation should be worked out on
> per-driver basis. If driver need to parse ASCII text there is no need to
> convert strings to UTF-8 until they will be passed to GDAL functions.
> 
> Notes on implementation:
> 
> 1. New CPL functions:
> 
>    // Convert UTF-8 encoded string into ASCII. Out-of-range characters
>    // replaced with '?' in output string.
>    char* CPLUTF8ToASCII(const char*);
> 
>    // Convert UTF-8 encoded string into local encoding.
>    char* CPLUTF8ToLocal(const char*);
> 
>    // Convert string from local encoding to UTF-8.
>    char* CPLLocalToUTF8(const char*);
> 
>    // Convert string from UTF-8 encoding into array of wchar_t elements.
>    // Destination encoding is system specific.
>    wchar_t* CPLUTF8ToWide(const char*);
> 
>    // Convert array of wchar_t elements into UTF-8 encoded string.
>    // Source encoding is system specific.
>    char* CPLWideToUTF8(wchar_t*);
> 
> 2. In order to use non-ASCII characters in user input every application should
>    call setlocale(LC_ALL,  "") function right after the entry point.
> 
> 3. Code example. Let's look how the gdal utilities and core code should
>    be changed in regard to Unicode.
> 
>    For input instead of
> 
> 	pszFilename = argv[i];
> 	if( pszFilename )
> 		hDataset = GDALOpen( pszFilename, GA_ReadOnly );
> 
>    we should do
> 
> 	pszFilename = CPLLocalToUTF8(argv[i]);
> 	if ( pszFilename )
> 	{
> 		hDataset = GDALOpen( pszFilename, GA_ReadOnly );
> 		CPLFree( pszFilename );
> 	}
> 
>    For output instead of
> 
>    	printf( "Description = %s\n", GDALGetDescription(hBand) );
> 
>    we should do
> 
> 	char *pszDescription = CPLUTF8ToLocal( GDALGetDescription(hBand) );
> 	if ( pszDescription )
> 	{
>    		printf( "Description = %s\n", pszDescription );
> 		CPLFree( pszDescription );
> 	}
> 
>    The filename passed to GDALOpen() in UTF-8 encoding in the code snippet
>    above will be further processed in the GDAL core. On Windows instead of
> 
>    	hFile = CreateFile( pszFilename, dwDesiredAccess,
> 		FILE_SHARE_READ | FILE_SHARE_WRITE,
>                 NULL, dwCreationDisposition,  dwFlagsAndAttributes, NULL );
> 
>    we do
> 
> 	wchar_t *pawcFilename = CPLUTF8ToWide( pszFilename );
> 	if ( pawcFilename )
> 	{
> 		// I am prefer call the wide character version explicitly
> 		// rather than specify _UNICODE switch.
>    		hFile = CreateFileW( pawcFilename, dwDesiredAccess,
> 			FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
> 			dwCreationDisposition,  dwFlagsAndAttributes, NULL );
> 		CPLFree( pawcFilename );
> 	}
> 
> References:
> 
> 1. FAQ on how to use Unicode in software:
> 
>     http://www.cl.cam.ac.uk/~mgk25/unicode.html
> 
> 2. FLTK implementation of string conversion functions:
> 
>     http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c
> 
> 
> -- 
> Andrey V. Kiselev
> ICQ# 26871517
> _______________________________________________
> Gdal-dev mailing list
> Gdal-dev at lists.maptools.org
> http://lists.maptools.org/mailman/listinfo/gdal-dev
> 






More information about the Gdal-dev mailing list