[Gdal-dev] RFC DRAFT: Unicode support in GDAL

Mateusz Loskot mateusz at loskot.net
Thu Sep 21 11:44:28 EDT 2006


Andrey Kiselev wrote:
> We have discussed the issue today on IRC and now I want to come up with the
> following proposal.
> 
> Unicode support in GDAL.
> 
> There are three basic statements:
> 
> 1. Users work in localized environment using their native languages. That
>   means we can not assume ASCII character set when working with string data
>   passed to GDAL.

+1

> 2. GDAL uses UTF-8 encoding internally when working with strings.

+1

> 3. GDAL uses Unicode version of third-party API when it is possible.

+1

> So all strings, used in GDAL, are in UTF-8, not in plain ASCII. That
> means we should convert user's input from the local encoding to UTF-8 during
> interactive sessions. The opposite should be done for GDAL output.

Is my understanding correct that we won't reimplement GDAL drivers,
for example Shape to accept UTF-8?
So, strings will be converted to/from ASCII when reading/writing strings
into GDAL internal buffers, to UTF-8 ?

AFAI understand, the data flow for sample OGR data looks as follows:

user's input -> convert to UTF-8 -> manipulate -> convert to ASCII
   -> send to OGR driver -> write to dataset

However, it's still not full Unicode support, because drivers
still accept non-Unicode.

I'm a bit confused :-)

> All functions, wich take
> character strings as parameters, assume UTF-8 (with except of several ones,
> which will do the conversion between different encodings, see below).

OK.

> The same is valid for output functions.
> Output functions (CPLError/CPLDebug), embedded
> in GDAL, should convert all strings from UTF-8 to local encoding befire
> printing them. Custom error handlers should be aware of UTF-8 issue and
> do the proper transformation of strings passed to them.

OK.

> The string encoding pops up again when GDAL needs to call the
> third-party API.
> UTF-8 should be converted to encoding suitable for that API. In particular,
> that means we should convert UTF-8 to UTF-16 before calling CreateFile()
> function in Windows implementation of VSIFOpenL().

OK

> For file format drivers the string representation should be worked out on
> per-driver basis. If driver need to parse ASCII text there is no need to
> convert strings to UTF-8 until they will be passed to GDAL functions.

I see, now my questions from above have been answered.
Thought, I still think drivers should also support Unicode, at least
OGR drivers, to be able to deal with i18n'ized
strings in feature attributes.

> Notes on implementation:
> [...]

OK.

> 2. FLTK implementation of string conversion functions:
> 
>    http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c

And functions reference:

http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html

Cheers
-- 
Mateusz Loskot
http://mateusz.loskot.net



More information about the Gdal-dev mailing list