[Gdal-dev] RFC DRAFT: Unicode support in GDAL

Akio Takubo takubo at saruga-tondara.net
Sat Sep 23 23:58:05 EDT 2006


Thank you for replying, Mateusz,

On Fri, 22 Sep 2006 14:55:22 +0200
Mateusz Loskot <mateusz at loskot.net> wrote:

> Akio Takubo wrote:
> >  In this draft, it seems that supported encodings are following three types.
> > + ASCII
> > + UTF-8(and UTF-16)
> > + local encoding
> > So this framework has some limitation, i think. If GDAL/OGR convert to 
> > UTF-8 internally, this conversion may break information 
> > when shp file have encoding other than local one.
> >  For example, Can a user, who uses Windows with Latin1(CP1252), read 
> > shapefile with Shift JIS(CP932) (most japanese use shp file with CP932) correctly?
> > I think that we cannot know what encoding is used in each shp file automatically.
> 
> Generally, I agree. But I think there are some exceptions to this rule.
> I mean, if particular data format supports encoding information to be
> stored in the dataset, then we will be able to use it to provide correct
> conversion.
>
> See, my latest posts about how encoding can be stored in Shapefile.
> This information can be used to select/control appropriated
> encoder/decoder (on the internationalization layer side) and do the job
> well.
>
> To summary, the conversion flow will look like this:
> 
>                    (1)             (2)
>     Shapefile       |      GDAL     |      Local host
>                     |               |
> Shift JIS (CP932) ->|->    UTF-8  ->|->    Latin1 (CP1252)
>                     |               |
> 
> 
> Convertion (1) will be controled by codepage stored in
> the Language driver ID (dbf) or .cpg file, if one of them is present in
> the Shapefile.
> Otherwise we can decide to treat Shapefile using UTF-8 (in range of
> ASCII subset) encoding, by default.

 I've read your post. Considering LDID is interesting.
About a month ago I've just talked with a japanese developer 
about dbf's LDID, he had written it in his blog,   
He and I agreed that LDID is convinience if it was set correctly, 
but providing a mechanism, for setting encoding which a user 
wants, is also important because there are some apps which 
doesn't handle LDID correctly. 
 About shapefile format, sharpmap supports LDID handling 
and provides interface for setting encoding manually. 
It seems good interface.

> Certainly, this approach may not be possible to use for every
> dataformat/driver, but I think it's a good idea to use wherever it's
> possible.

 I agree with it. PG driver is one of good example, I think.
If setting client encoding to UNICODE, we can completely read/write
data with UTF-8 regardless of db encoding.
Other example is gml driver. XML document instance declares
its own encoding. 

> > At least, we can manipurate non-ASCII contents with current GDAL/OGR generally, 
> > as a client which uses GDAL/OGR consider contents's encoding.
> 
> Also, currently GDAL/ORG uses ASCII which is a subset of UTF-8, in other
> words, UTF-8 is a superset of ASCII, so ASCII is compatible with UTF-8
> in range of 0-128 characters.

Yes, ASCII has compatiblity for UTF-8. But if raw byte sequence is not broken, 
client (not inside GDAL/OGR) which uses GDAL/OGR can also handle data with 
multibytes encoding.
Currently client must know what encoding uses, but after GDAL/OGR support 
unicode, client always uses UTF-8 and GDAL/OGR must know what encoding uses.

> > Now we can receive raw byte sequence from GDAL/OGR, 
> > we can convert it to strings specified encodings. 
> > In QGIS, when a user open shp  (or other supported file), he need to 
> > select encoding which this file has.
> 
> Or, for Shapefile in this example, we can detect encoding if .cpg or
> Language driver ID is available.
>
> > And reading attribute via OGR 
> > and convert from selected encoding to unicode(it is QGIS's internal encoding). 
> > So we can read contents with various encodings.
> 
> Yes, exactly, that's why I'm trying to express, though may be in
> not very clear way.
>
> > If GDAL/OGR will use UTF-8 internally, It is better way that add API 
> > for specifying encoding for each datasource (not each driver or not per system) 
> > is needed (maybe default is local encoding), I think. 
> 
> In both cases, we need to be able to use various encoders/decoders to be
> able to transform X encoding to UTF-8.

 Yes. As converting encode at inside of GDAL/OGR, it is needed to 
some encoding converter. Currenly in the outside of GDAL/OGR, 
some app convert encoding (QGIS, Mapserver...).

> > Depends on file format, this setting can be ignored, of course.
> > 
> >  Client <------------------> GDAL/OGR <-------------------> datasource
> >  (local encoding)                (UTF-8)                           (user setting/ or driver specific)
> 
> Yes, exactly.
> However, when I'm talking about "Unicode-aware driver" I'm trying to say
> that we should support various encodings of the same kind of datasources.
> For example, Shape driver should be able to manipulate Shapefiles
> encoded in Latin1 or UTF-8 or .... whatever encoding user want's to use.
>
> That's what I understand, generally, as an internationalization of OGR
> drivers.
> If I'm messing this subject too much, I appologize :-)

Sorry for some mistake previous post. I intended following after GDAL/OGR supports
unicode. It may be waht you said.
Client <-----(utf-8)--------> GDAL/OGR <---(Utf-8 <->  datasource encoding* )---> datasource
* driver speific/auto detected(ex shp with LDID)/user selection


Best regards, 

 Akio Takubo
  From Tokyo, Japan



More information about the Gdal-dev mailing list