[Gdal-dev] RFC DRAFT: Unicode support in GDAL
Akio Takubo
takubo at saruga-tondara.net
Sat Sep 23 20:20:37 EDT 2006
Dear Andrey,
On Fri, 22 Sep 2006 13:01:16 +0400
Andrey Kiselev <dron at ak4719.spb.edu> wrote:
> On Fri, Sep 22, 2006 at 05:24:07AM +0900, Akio Takubo wrote:
> > At first, as Ben pointed out, I think that filename (or related with
> > filesystem) issue and data(attribute name, attribute data, etc...)
> > issue are some diffrent topic.
>
> Akio,
>
> These are not entirely different issues. The only different things are
> the ways how filename comes to GDAL core. It can be either read from the
> user's input or read from the file contents.
I agree with not entirely diffrent. What I want to say is that
filesystem may have some limitation for encoding but we may use
all encoding for reading/writing data. Sorry for my not enough explanation.
> > In this draft, it seems that supported encodings are following three types.
> > + ASCII
> > + UTF-8(and UTF-16)
> > + local encoding
> > So this framework has some limitation, i think. If GDAL/OGR convert to
> > UTF-8 internally, this conversion may break information when shp file
> > have encoding other than local one. For example, Can a user, who uses
> > Windows with Latin1(CP1252), read shapefile with Shift JIS(CP932)
> > (most japanese use shp file with CP932) correctly? I think that we
> > cannot know what encoding is used in each shp file automatically.
> >
> > At least, we can manipurate non-ASCII contents with current GDAL/OGR
> > generally, as a client which uses GDAL/OGR consider contents's
> > encoding. Now we can receive raw byte sequence from GDAL/OGR, we can
> > convert it to strings specified encodings. In QGIS, when a user open
> > shp (or other supported file), he need to select encoding which this
> > file has. And reading attribute via OGR and convert from selected
> > encoding to unicode(it is QGIS's internal encoding). So we can read
> > contents with various encodings.
> >
> > If GDAL/OGR will use UTF-8 internally, It is better way that add API
> > for specifying encoding for each datasource (not each driver or not
> > per system) is needed (maybe default is local encoding), I think.
> > Depends on file format, this setting can be ignored, of course.
>
> Excellent point! I have missed this problem completely, but it is very
> important for my environment too. I will phrase the problem one more
> time: what should we do if the file encoding differs from the local
> system encoding and we do not have a way to know the file encoding other
> than ask user.
>
> Well, for now I do not know how to solve this. The natural solution is
> to introduce configuration parameter "ENCODING" to GDALOpen/OGROpen
> functions. Unfortunately, those functions do not accept configuration
> parameters. That should be the next RFC, I think. Hopefully, we do not
> need to add encoding parameter immediately, because it is independent
> from the general i18n process. We can add UTF-8 support now and add
> support for forcing encoding later, when the open options will be
> introduced.
It is very complicated problem, I think. Now we can select various encoding
on web browser or text editor so there are both some metadata for encoding
selection and user selection mechanism. I hope we can find excellent solution
for GDAL/OGR and also use various encoding correctly. I'll post about
your next RFC, if I notice something on it.
Best regards,
Akio Takubo
From Tokyo, Japan
More information about the Gdal-dev
mailing list