[gdal-dev] Motion: To adopt RFC 23: Unicode Support in OGR

Andrey Kiselev dron at ak4719.spb.edu
Fri Apr 25 08:27:46 EDT 2008

On Thu, Apr 24, 2008 at 12:21:48PM -0400, Frank Warmerdam wrote:
> Motion: To approve RFC 23: Unicode Support in OGR
>  http://trac.osgeo.org/gdal/wiki/rfc23_ogr_unicode


I have read the proposal once again and should say that I dislake the
interface suggested. With this API the character encoding is not a
property of the string object but an external knowlege. Imagine that you
are initializing two CPLString objects from the same const char* string,
but for one of them you are calling recode() method later. How do you
know what object was changed? You should keep track of encodings
yourself. So the encoding should be a property of string object, but
then the recode() method should have single argument and constructor
taking encoding value is also required.

But this way of modifying CPLString is also bad and adds unnecessary
complication. What I finally suggesting is to assume that the internal
encoding either UTF-8 when you are constructing string specifying
encoding or unknown (as it is now) and then you are keeping encoding of
the string somwhere else.

// Constructor taking the 8-bit string and its encoding. If input
// encoding is set, convert string from that encoding to UTF-8 for
// internal representation. If pszEncoding is NULL, then do nothing and
// store 8-bit string as is (and that is the way how it works now, so we
// are not breaking existing code and do not introducing additional
// recoding). If pszEncoding is "" then take encoding from current
// locale.
// Later we can declare that if pszEncoding is NULL then input string is
// in UTF-8 and again, store it as is without recoding.
CPLString( const char *pszString, const char *pszEncoding = NULL );

// The same cons for wchar_t, no difference with suggested in RFC-23
CPLString::CPLString( const wchar_t*pszInput, const char *pszEncoding =
"UCS-2" );

// Now the getter for 8-bit recoded strings. Take encoding from locale
// by default.
char *CPLString::GetAs( const char *pszDstEncoding = "" );

// Getter for wchar_t strings, the same as in RFC-23.
wchar_t *CPLString::GetAsWChar( const char *pszDstEncoding = "UCS-2" );

With this API I think we are doing pretty much the same as with
suggested in RFC-23, but with much less ambiguity.

Best regards,

Andrey V. Kiselev
ICQ# 26871517

More information about the gdal-dev mailing list