[gdal-dev] Motion: To adopt RFC 23: Unicode Support in OGR

Fri Apr 25 14:01:18 EDT 2008

On Fri, Apr 25, 2008 at 12:38:24PM -0400, Frank Warmerdam wrote:
> You also wrote "What I finally suggesting is to assume that the
> internal encoding either UTF-8 when you are constructing string
> specifying encoding or unknown (as it is now) and then you are
> keeping encoding of the string somwhere else."  But I disagree
> with assuming that values of CPLString are always UTF-8.  There
> are lots of places where I construct CPLStrings from text without
> having any idea what the encoding is.

Ok, I would like to make it clear. CPLString can contain strings in any
8-bit encoding if being constructed without encoding specified. So

  oStr = CPLString("abc");

is equavalent to

  oStr = CPLString("abc", NULL);

and equvalent to

  CPLString oStr = "abc";

and it works exactly as it is now. No recoding here and string is stored
internally in native encoding whatever it is. You should not use GetAs()
method for this object, because it will produce undefined results. That
is what you should keep track of: do not use recoding on old CPLString
objects.

But

  oStr = CPLString("abc", "iso-8859-3");

will do recoding from "iso-8859-3" to UTF-8, the string will be stored
in UTF-8 internally and you should use GetAs() method to get it in some
other encoding (other from UTF-8).

  oStr = CPLString("abc", "UTF-8");

will do nothing and store "abc" directly, because it is already comes in
right encoding.

> As you note, normal assigment with "=" or CPLString constructors with
> no encoding argument will not result in any conversion to UTF-8. So
> how have we achieved your goal of having CPLString always be UTF-8?

It is not always in UTF-8. Actually I am suggesting the same approach as
you suggested, but with different API.

> As I mentioned before, I would not mind a CPLString subclass that is
> intentionally always UTF-8, or perhaps that even carries it's encoding
> along.  I just don't think we can apply this logic to CPLString and
> ensure that CPLString contents are always UTF-8 without careful review
> of all existing code.

Thinking on the issue even more it seems that it is much easier and
cleaner to just define a derived class (CPLUString?) utilizing interface
that I proposed. The new string object will be UTF-8 internally and does
not support dangerous and meaningless operations. That way we can
precisely separate encoding aware strings from the old ones and there
will be no ambiguity. I am ready to write specification on that!

> If you feel strongly enough about this you can -1 the RFC, but I am
> not willing to modify it to be based on the assumption that CPLString
> is always in UTF-8.

I don't want to put a veto on something not being sure that we are
understood each other proposals. Now I am agree with you that CPLString
is not a good place for this new functionality. A new class then?

Best regards,
Andrey

-- 
Andrey V. Kiselev
ICQ# 26871517