[Gdal-dev] RFC DRAFT: Unicode support in GDAL
Marek Brudka
mbrudka at aster.pl
Tue Sep 26 15:45:09 EDT 2006
Hi,
Andrey Kiselev wrote:
> On Mon, Sep 25, 2006 at 06:45:13PM -0400, Frank Warmerdam wrote:
>>> - simplify GDAL interfaces and make them *explicite *with respect to i18n.
>>> - enable smooth transition for existing applications.
>>> - avoid yet another copy of strings, because many libraries already
>>> provides wchar driven interfaces.
>>>
>> Can you provide some more background on why a wide character solution
>> is preferable to UTF8?
For a GDAL the three points above are the main ones.
>> Personally, I think the whole effort will be a
>> non-starter if we have to re-engineer everything to use wide characters.
>>
Rewriting GDAL/OGR to use wide chars is pointless. Too much effort for
GDAL developers as well as GDAL users. It is better to provide some
additional interfaces for wide chars eg.
class OGRSFDriverRegistrar
{
static OGRDataSource *Open( const char *pszName, int bUpdate=FALSE,
OGRSFDriver ** ppoDriver = NULL );
OGRDataSource *OpenShared( const char *pszName, int bUpdate=FALSE,
OGRSFDriver ** ppoDriver = NULL );
static OGRDataSource *Open( const wchar_t *pszName, int bUpdate=FALSE,
OGRSFDriver ** ppoDriver = NULL );
OGRDataSource *OpenShared( const wchar_t *pszName, int bUpdate=FALSE,
OGRSFDriver ** ppoDriver = NULL );
}
Please notice how encoding were introduced in C++ STL, which is in a
sense a blueprint for C++ developers. STL streams for old and wide
chars are distict and there is no assumption that a string passed for
example to cout are encoded in XYZ standard and thus require recoding
before outputing to terminal.
>
> I am seconded on this. For me it looks like 8-bit UTF-8 is the same as
> wide chars in terms of multilingual support.
Sure, one may encode everything in UTF8 as well as in wide chars. But
the convention to use UTF-8 encoding in plain strings is only
*implicite* and may be violated at the runtime. Wide chars area
*explicite* and enable partial compile-time validation of i18n handling
as well as clean distiction between i18n aware and not aware interfaces.
> Wide chars typically used in Windows interfaces and we can implement necessary conversion functions where it is needed.
>
Wide chars were introduced in C/C++ standard to avoid unnecessary
conversions and enable smooth transition of strings between various
libraries. One should always remember the ordinary C++ application
usually wastes it time in string copying/conversion. Why to introduce
another copying?
> Most free toolkits and libraries use UTF-8 internally and it is no problem to interface with them using our
> approach.
>
In our company we use *directly* 20-30 open source libraries and I
cannot agree with the above statement. Moreoever, for many laguages with
GDAL/OGR bindings wide chars are preffered way to deal with i18n issues.
BTW. In Windows typically M$ code pages are employed. It is a horror for
portable applications and even a worse solution than UTF-8.
BTW 2. OGR spends more time in strcpy than the average C++ library,
please cachegrind some trivial examples to verify that.
Marek Brudka
More information about the Gdal-dev
mailing list