[Gdal-dev] RFC DRAFT: Unicode support in GDAL

Andrey Kiselev dron at ak4719.spb.edu
Fri Sep 22 05:01:16 EDT 2006


On Fri, Sep 22, 2006 at 05:24:07AM +0900, Akio Takubo wrote:
>  At first, as Ben pointed out, I think that filename (or related with
>  filesystem) issue and data(attribute name, attribute data, etc...)
>  issue  are some diffrent topic.

Akio,

These are not entirely different issues. The only different things are
the ways how filename comes to GDAL core. It can be either read from the
user's input or read from the file contents.

>  In this draft, it seems that supported encodings are following three types.
> + ASCII
> + UTF-8(and UTF-16)
> + local encoding
> So this framework has some limitation, i think. If GDAL/OGR convert to
> UTF-8 internally, this conversion may break information when shp file
> have encoding other than local one.  For example, Can a user, who uses
> Windows with Latin1(CP1252), read shapefile with Shift JIS(CP932)
> (most japanese use shp file with CP932) correctly?  I think that we
> cannot know what encoding is used in each shp file automatically.
> 
> At least, we can manipurate non-ASCII contents with current GDAL/OGR
> generally, as a client which uses GDAL/OGR consider contents's
> encoding.  Now we can receive raw byte sequence from GDAL/OGR, we can
> convert it to strings specified encodings.  In QGIS, when a user open
> shp  (or other supported file), he need to select encoding which this
> file has. And reading attribute via OGR and convert from selected
> encoding to unicode(it is QGIS's internal encoding).  So we can read
> contents with various encodings.
> 
> If GDAL/OGR will use UTF-8 internally, It is better way that add API
> for specifying encoding for each datasource (not each driver or not
> per system) is needed (maybe default is local encoding), I think.
> Depends on file format, this setting can be ignored, of course.

Excellent point! I have missed this problem completely, but it is very
important for my environment too. I will phrase the problem one more
time: what should we do if the file encoding differs from the local
system encoding and we do not have a way to know the file encoding other
than ask user.

Well, for now I do not know how to solve this. The natural solution is
to introduce configuration parameter "ENCODING" to GDALOpen/OGROpen
functions.  Unfortunately, those functions do not accept configuration
parameters.  That should be the next RFC, I think. Hopefully, we do not
need to add encoding parameter immediately, because it is independent
from the general i18n process. We can add UTF-8 support now and add
support for forcing encoding later, when the open options will be
introduced.

Best regards,
Andrey

-- 
Andrey V. Kiselev
ICQ# 26871517



More information about the Gdal-dev mailing list