[Gdal-dev] RFC DRAFT: Unicode support in GDAL
Mateusz Loskot
mateusz at loskot.net
Thu Sep 21 11:44:28 EDT 2006
Andrey Kiselev wrote:
> We have discussed the issue today on IRC and now I want to come up with the
> following proposal.
>
> Unicode support in GDAL.
>
> There are three basic statements:
>
> 1. Users work in localized environment using their native languages. That
> means we can not assume ASCII character set when working with string data
> passed to GDAL.
+1
> 2. GDAL uses UTF-8 encoding internally when working with strings.
+1
> 3. GDAL uses Unicode version of third-party API when it is possible.
+1
> So all strings, used in GDAL, are in UTF-8, not in plain ASCII. That
> means we should convert user's input from the local encoding to UTF-8 during
> interactive sessions. The opposite should be done for GDAL output.
Is my understanding correct that we won't reimplement GDAL drivers,
for example Shape to accept UTF-8?
So, strings will be converted to/from ASCII when reading/writing strings
into GDAL internal buffers, to UTF-8 ?
AFAI understand, the data flow for sample OGR data looks as follows:
user's input -> convert to UTF-8 -> manipulate -> convert to ASCII
-> send to OGR driver -> write to dataset
However, it's still not full Unicode support, because drivers
still accept non-Unicode.
I'm a bit confused :-)
> All functions, wich take
> character strings as parameters, assume UTF-8 (with except of several ones,
> which will do the conversion between different encodings, see below).
OK.
> The same is valid for output functions.
> Output functions (CPLError/CPLDebug), embedded
> in GDAL, should convert all strings from UTF-8 to local encoding befire
> printing them. Custom error handlers should be aware of UTF-8 issue and
> do the proper transformation of strings passed to them.
OK.
> The string encoding pops up again when GDAL needs to call the
> third-party API.
> UTF-8 should be converted to encoding suitable for that API. In particular,
> that means we should convert UTF-8 to UTF-16 before calling CreateFile()
> function in Windows implementation of VSIFOpenL().
OK
> For file format drivers the string representation should be worked out on
> per-driver basis. If driver need to parse ASCII text there is no need to
> convert strings to UTF-8 until they will be passed to GDAL functions.
I see, now my questions from above have been answered.
Thought, I still think drivers should also support Unicode, at least
OGR drivers, to be able to deal with i18n'ized
strings in feature attributes.
> Notes on implementation:
> [...]
OK.
> 2. FLTK implementation of string conversion functions:
>
> http://svn.easysw.com/public/fltk/fltk/trunk/src/utf.c
And functions reference:
http://www.easysw.com/~mike/fltk/doc-2.0/html/utf_8h.html
Cheers
--
Mateusz Loskot
http://mateusz.loskot.net
More information about the Gdal-dev
mailing list