[Gdal-dev] RFC DRAFT: Unicode support in GDAL

Tue Sep 26 15:45:09 EDT 2006

Hi,

Andrey Kiselev wrote:
> On Mon, Sep 25, 2006 at 06:45:13PM -0400, Frank Warmerdam wrote:
>>> - simplify GDAL interfaces and make them *explicite *with respect to i18n.
>>> - enable smooth transition for existing applications.
>>> - avoid yet another copy of strings, because many libraries already 
>>> provides wchar driven interfaces.
>>>       
>> Can you provide some more background on why a wide character solution
>> is preferable to UTF8?
For a GDAL the three points above are the main ones.
>>  Personally, I think the whole effort will be a
>> non-starter if we have to re-engineer everything to use wide characters.
>>     
Rewriting GDAL/OGR to use wide chars is pointless. Too much effort for 
GDAL developers as well as GDAL users. It is better to provide some 
additional interfaces for wide chars eg.

class OGRSFDriverRegistrar
{
    static OGRDataSource *Open( const char *pszName, int bUpdate=FALSE,
                                OGRSFDriver ** ppoDriver = NULL );

    OGRDataSource *OpenShared( const char *pszName, int bUpdate=FALSE,
                               OGRSFDriver ** ppoDriver = NULL );

    static OGRDataSource *Open( const wchar_t *pszName, int bUpdate=FALSE,
                                OGRSFDriver ** ppoDriver = NULL );

    OGRDataSource *OpenShared( const wchar_t *pszName, int bUpdate=FALSE,
                               OGRSFDriver ** ppoDriver = NULL );
}

Please notice how encoding were introduced in C++ STL, which is in a 
sense a blueprint for C++ developers. STL streams for old  and wide 
chars are distict and there is no assumption that a string passed for 
example to cout are encoded in XYZ standard and thus require recoding 
before outputing to terminal.
>
> I am seconded on this. For me it looks like 8-bit UTF-8 is the same as
> wide chars in terms of multilingual support. 
Sure, one may encode everything in UTF8 as well as in wide chars. But 
the convention to use UTF-8 encoding in plain strings is only 
*implicite*  and may be violated at the runtime. Wide chars area 
*explicite* and enable partial compile-time validation of i18n handling 
as well as clean distiction between i18n aware and not aware interfaces.
> Wide chars typically used in Windows interfaces and we can implement necessary conversion functions where it is needed. 
>   
Wide chars were introduced in C/C++ standard to avoid unnecessary 
conversions and enable smooth transition of strings between various 
libraries. One should always remember the ordinary C++ application 
usually wastes it time in string copying/conversion. Why to introduce 
another copying?

> Most free toolkits and libraries use UTF-8 internally and it is no problem to interface with them using our
> approach.
>   
In our company we use *directly* 20-30 open source  libraries and I 
cannot agree with the above statement. Moreoever, for many laguages with 
GDAL/OGR bindings wide chars are preffered way to deal with i18n issues.

BTW. In Windows typically M$ code pages are employed. It is a horror for 
portable applications and even a worse solution than UTF-8.

BTW 2. OGR spends more time in strcpy than the average C++ library, 
please cachegrind some trivial  examples to verify that.

Marek Brudka