[Gdal-dev] RFC DRAFT: Unicode support in GDAL

Akio Takubo takubo at saruga-tondara.net
Sat Sep 23 20:20:37 EDT 2006


Dear Andrey,

On Fri, 22 Sep 2006 13:01:16 +0400
Andrey Kiselev <dron at ak4719.spb.edu> wrote:

> On Fri, Sep 22, 2006 at 05:24:07AM +0900, Akio Takubo wrote:
> >  At first, as Ben pointed out, I think that filename (or related with
> >  filesystem) issue and data(attribute name, attribute data, etc...)
> >  issue  are some diffrent topic.
> 
> Akio,
> 
> These are not entirely different issues. The only different things are
> the ways how filename comes to GDAL core. It can be either read from the
> user's input or read from the file contents.

 I agree with not entirely diffrent. What I want to say is that 
filesystem may have some limitation for encoding but we may use 
all encoding for reading/writing data. Sorry for my not enough explanation.

> >  In this draft, it seems that supported encodings are following three types.
> > + ASCII
> > + UTF-8(and UTF-16)
> > + local encoding
> > So this framework has some limitation, i think. If GDAL/OGR convert to
> > UTF-8 internally, this conversion may break information when shp file
> > have encoding other than local one.  For example, Can a user, who uses
> > Windows with Latin1(CP1252), read shapefile with Shift JIS(CP932)
> > (most japanese use shp file with CP932) correctly?  I think that we
> > cannot know what encoding is used in each shp file automatically.
> > 
> > At least, we can manipurate non-ASCII contents with current GDAL/OGR
> > generally, as a client which uses GDAL/OGR consider contents's
> > encoding.  Now we can receive raw byte sequence from GDAL/OGR, we can
> > convert it to strings specified encodings.  In QGIS, when a user open
> > shp  (or other supported file), he need to select encoding which this
> > file has. And reading attribute via OGR and convert from selected
> > encoding to unicode(it is QGIS's internal encoding).  So we can read
> > contents with various encodings.
> > 
> > If GDAL/OGR will use UTF-8 internally, It is better way that add API
> > for specifying encoding for each datasource (not each driver or not
> > per system) is needed (maybe default is local encoding), I think.
> > Depends on file format, this setting can be ignored, of course.
> 
> Excellent point! I have missed this problem completely, but it is very
> important for my environment too. I will phrase the problem one more
> time: what should we do if the file encoding differs from the local
> system encoding and we do not have a way to know the file encoding other
> than ask user.
>
> Well, for now I do not know how to solve this. The natural solution is
> to introduce configuration parameter "ENCODING" to GDALOpen/OGROpen
> functions.  Unfortunately, those functions do not accept configuration
> parameters.  That should be the next RFC, I think. Hopefully, we do not
> need to add encoding parameter immediately, because it is independent
> from the general i18n process. We can add UTF-8 support now and add
> support for forcing encoding later, when the open options will be
> introduced.

 It is very complicated problem, I think. Now we can select various encoding
on web browser or text editor so there are both some metadata for encoding 
selection and user selection mechanism. I hope we can find excellent solution 
for GDAL/OGR and also use various encoding correctly. I'll post about 
your next RFC, if I notice something on it.

Best regards, 

 Akio Takubo
  From Tokyo, Japan



More information about the Gdal-dev mailing list