[Gdal-dev] RFC DRAFT: Unicode support in GDAL

Mateusz Loskot mateusz at loskot.net
Sun Sep 24 10:23:31 EDT 2006


Hi Akio,

Akio Takubo wrote:
> On Fri, 22 Sep 2006 14:55:22 +0200
> Mateusz Loskot <mateusz at loskot.net> wrote:
>>
>> Convertion (1) will be controled by codepage stored in
>> the Language driver ID (dbf) or .cpg file, if one of them is present in
>> the Shapefile.
>> Otherwise we can decide to treat Shapefile using UTF-8 (in range of
>> ASCII subset) encoding, by default.
> 
>  I've read your post. Considering LDID is interesting.

Yes, I also see it as pretty nice solution.

> About a month ago I've just talked with a japanese developer 
> about dbf's LDID, he had written it in his blog,

I'm curious about it, could you give the blog URL?

> He and I agreed that LDID is convinience if it was set correctly, 
> but providing a mechanism, for setting encoding which a user 
> wants, is also important because there are some apps which 
> doesn't handle LDID correctly.

Yes, I agree. I'm also aware it's not a trivial task.
Genreally, implementing fully-featured i18n scares me a lot but
it's very interesting subject for me :)

> About shapefile format, sharpmap supports LDID handling 
> and provides interface for setting encoding manually. 
> It seems good interface.

I tried to look at the code but I have no idea how to access
sharpmap's repository from its website.

I think the most important thing we need to support LDID is to have a
valid list of language drivers IDs.


The Shapefile provider for FDO uses .cpg file.
Second ctor of ShapeCPG class uses locale to set appropriate
codepage in .cpg file:

ShapeCPG::ShapeCPG (const WCHAR* name, char *locale)

https://fdoshp.osgeo.org/source/browse/fdoshp/trunk/Providers/SHP/Src/ShpRead/ShapeCPG.cpp

Next, in RowData class

https://fdoshp.osgeo.org/source/browse/fdoshp/trunk/Providers/SHP/Src/ShpRead/RowData.cpp

there are some functions to convert between codepages codes:

RowData::ConvertCodePageWin()
RowData::ConvertCodePageLinux()

and data itself:

RowData::GetData()

and the final data conversion (ie. multibyte_to_wide_cpg macro and
friends) is done using iconv on Unix and Windows API conversion
functions on Windows:

https://fdocore.osgeo.org/source/browse/fdocore/trunk/Utilities/Common/Inc/FdoCommonStringUtil.h

It can be used as an exemplar of the .cpg handling and
characters encoding conversions.

>> Certainly, this approach may not be possible to use for every
>> dataformat/driver, but I think it's a good idea to use wherever it's
>> possible.
> 
>  I agree with it. PG driver is one of good example, I think.
> If setting client encoding to UNICODE, we can completely read/write
> data with UTF-8 regardless of db encoding.
> Other example is gml driver. XML document instance declares
> its own encoding. 

Yes, also KML is XML-based.

>>> At least, we can manipurate non-ASCII contents with current GDAL/OGR generally, 
>>> as a client which uses GDAL/OGR consider contents's encoding.
>> Also, currently GDAL/ORG uses ASCII which is a subset of UTF-8, in other
>> words, UTF-8 is a superset of ASCII, so ASCII is compatible with UTF-8
>> in range of 0-128 characters.
> 
> Yes, ASCII has compatiblity for UTF-8. But if raw byte sequence is not broken, 
> client (not inside GDAL/OGR) which uses GDAL/OGR can also handle data with 
> multibytes encoding.
> Currently client must know what encoding uses, but after GDAL/OGR support 
> unicode, client always uses UTF-8 and GDAL/OGR must know what encoding uses.

This is how I understand it.

>>> If GDAL/OGR will use UTF-8 internally, It is better way that add API 
>>> for specifying encoding for each datasource (not each driver or not per system) 
>>> is needed (maybe default is local encoding), I think. 
>> In both cases, we need to be able to use various encoders/decoders to be
>> able to transform X encoding to UTF-8.
> 
>  Yes. As converting encode at inside of GDAL/OGR, it is needed to 
> some encoding converter. Currenly in the outside of GDAL/OGR, 
> some app convert encoding (QGIS, Mapserver...).

I suppose we will likely needs some external engine to achieve that,
like ICU or iconv, or something else.

May be it's reasonable to provide optional support of ICU.
Boost.Regexp works that way:
http://www.boost.org/libs/regex/doc/install.html#unicode

>>> Depends on file format, this setting can be ignored, of course.
>>>
>>>  Client <------------------> GDAL/OGR <-------------------> datasource
>>>  (local encoding)                (UTF-8)                           (user setting/ or driver specific)
>> Yes, exactly.
>> However, when I'm talking about "Unicode-aware driver" I'm trying to say
>> that we should support various encodings of the same kind of datasources.
>> For example, Shape driver should be able to manipulate Shapefiles
>> encoded in Latin1 or UTF-8 or .... whatever encoding user want's to use.
>>
>> That's what I understand, generally, as an internationalization of OGR
>> drivers.
>> If I'm messing this subject too much, I appologize :-)
> 
> Sorry for some mistake previous post. I intended following after GDAL/OGR supports
> unicode. It may be waht you said.
> Client <-----(utf-8)--------> GDAL/OGR <---(Utf-8 <->  datasource encoding* )---> datasource
> * driver speific/auto detected(ex shp with LDID)/user selection

Yes, it's exactly what I have in mind.

Cheers
-- 
Mateusz Loskot
http://mateusz.loskot.net



More information about the Gdal-dev mailing list