[Gdal-dev] RFC DRAFT: Unicode support in GDAL

Fri Sep 22 08:55:22 EDT 2006

Akio Takubo wrote:
>  In this draft, it seems that supported encodings are following three types.
> + ASCII
> + UTF-8(and UTF-16)
> + local encoding
> So this framework has some limitation, i think. If GDAL/OGR convert to 
> UTF-8 internally, this conversion may break information 
> when shp file have encoding other than local one.
>  For example, Can a user, who uses Windows with Latin1(CP1252), read 
> shapefile with Shift JIS(CP932) (most japanese use shp file with CP932) correctly?
> I think that we cannot know what encoding is used in each shp file automatically.

Generally, I agree. But I think there are some exceptions to this rule.
I mean, if particular data format supports encoding information to be
stored in the dataset, then we will be able to use it to provide correct
conversion.
See, my latest posts about how encoding can be stored in Shapefile.
This information can be used to select/control appropriated
encoder/decoder (on the internationalization layer side) and do the job
well.

To summary, the conversion flow will look like this:

                   (1)             (2)
    Shapefile       |      GDAL     |      Local host
                    |               |
Shift JIS (CP932) ->|->    UTF-8  ->|->    Latin1 (CP1252)
                    |               |

Convertion (1) will be controled by codepage stored in
the Language driver ID (dbf) or .cpg file, if one of them is present in
the Shapefile.
Otherwise we can decide to treat Shapefile using UTF-8 (in range of
ASCII subset) encoding, by default.

Certainly, this approach may not be possible to use for every
dataformat/driver, but I think it's a good idea to use wherever it's
possible.

> At least, we can manipurate non-ASCII contents with current GDAL/OGR generally, 
> as a client which uses GDAL/OGR consider contents's encoding.

Also, currently GDAL/ORG uses ASCII which is a subset of UTF-8, in other
words, UTF-8 is a superset of ASCII, so ASCII is compatible with UTF-8
in range of 0-128 characters.

> Now we can receive raw byte sequence from GDAL/OGR, 
> we can convert it to strings specified encodings. 
> In QGIS, when a user open shp  (or other supported file), he need to 
> select encoding which this file has.

Or, for Shapefile in this example, we can detect encoding if .cpg or
Language driver ID is available.

> And reading attribute via OGR 
> and convert from selected encoding to unicode(it is QGIS's internal encoding). 
> So we can read contents with various encodings.

Yes, exactly, that's why I'm trying to express, though may be in
not very clear way.

> If GDAL/OGR will use UTF-8 internally, It is better way that add API 
> for specifying encoding for each datasource (not each driver or not per system) 
> is needed (maybe default is local encoding), I think. 

In both cases, we need to be able to use various encoders/decoders to be
able to transform X encoding to UTF-8.

> Depends on file format, this setting can be ignored, of course.
> 
>  Client <------------------> GDAL/OGR <-------------------> datasource
>  (local encoding)                (UTF-8)                           (user setting/ or driver specific)

Yes, exactly.
However, when I'm talking about "Unicode-aware driver" I'm trying to say
that we should support various encodings of the same kind of datasources.
For example, Shape driver should be able to manipulate Shapefiles
encoded in Latin1 or UTF-8 or .... whatever encoding user want's to use.

That's what I understand, generally, as an internationalization of OGR
drivers.
If I'm messing this subject too much, I appologize :-)

Cheers
-- 
Mateusz Loskot
http://mateusz.loskot.net