[fdo-users] FDO OGR 3.6+3.7 and UTF-8 problem

Fri Jan 6 11:34:33 EST 2012

> -----Original Message-----
> From: fdo-users-bounces at lists.osgeo.org [mailto:fdo-users-
> bounces at lists.osgeo.org] On Behalf Of Frank Warmerdam
> Sent: Friday, January 06, 2012 10:22 AM
> To: fdo-users at lists.osgeo.org
> Subject: Re: [fdo-users] FDO OGR 3.6+3.7 and UTF-8 problem
> 
> On 12-01-06 04:57 AM, Hans Milling wrote:
> > Hi everyone
> >
> > I need some help. I have problems with none ascii characters with FDO
> > OGR and MGOS2.2.
> > The strings on the map like road names are all messed up if they
> > contain a danish letter like Æ Ø Å.
> > A city name like "Farsø" is suddently written "Fars؀".
> > I have created a small test program (see code below) to test the
> > problem, and FDO 3.3 does not have any issues, but FDO 3.6 and 3.7
> > seems to have this issue. To me the ISO-8859-1 string read from the
> > TAB file is converted to
> > UTF-8 at some point, and that messes up the text. See this image for
> > the
> > output:
> > http://osgeo-org.1803224.n2.nabble.com/file/n7158330/FDO.png
> > Road name: "Bakkegårdsvej", the å character (number 197 or 0xc5) is
> > treated as unicode (3 bytes) and thus the following "rd" letters are
> > included to create a Chinese character resulting in "Bakkeg岤svej".
> > Does anyone have a fix for this, can I recompile FDO ine some way to
> > not make this error?
> > I think FDO should know/detect the format of the strings from the
> > source, so that these are not destroyed.
> 
> Hans,
> 
> The relevant RFC in OGR is:
> 
>    http://trac.osgeo.org/gdal/wiki/rfc23_ogr_unicode
> 
> It appears the FDO OGR provider should at the very least be checking the
> OLCStringsasUTF8 capability on the layer.  If true it should be assumed string
> attributes from the layer are in UTF8 and processed accordingly.
> 

Before a certain time (22.02.2010), the OGR FDO provider was using the active code page for the conversion (calling mbstowcs), after which I changed it to use utf8. This explains why it worked for western European encodings before and not now. 

If you are in position to recompile, the fastest way would be to change the following functions in stdafx.h

static std::string W2A_SLOW(const wchar_t* input)
static std::wstring A2W_SLOW(const char* input)

to use mbstowcs and wcstombs instead of the utf8 conversion they are currently doing. Or uncomment the calls to MultiByteToWideChar/WideCharToMultiByte in the same functions and change the parameter from CP_UTF8 to CP_ACP. You can also create a ticket for this in Trac.

I had no idea about the existence of the capability Frank mentions and clearly checking that would be the right fix. But, I would argue that OGR should return all strings in UTF-8 instead of having the user check the capability. ;-)

Traian

> Best regards,
> --
> ---------------------------------------+--------------------------------
> ---------------------------------------+------
> I set the clouds in motion - turn up   | Frank Warmerdam,
> warmerdam at pobox.com
> light and sound - activate the windows | http://pobox.com/warmerda
> and watch the world go round - Rush    | Geospatial Software Developer
> 
> _______________________________________________
> fdo-users mailing list
> fdo-users at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/fdo-users