[Qgis-user] [1.8.0] Broken UTF-8 support in shapefiles - and workaround

Fri Jun 29 01:36:03 PDT 2012

Hi,

(Disclaimer: I'm a GDAL contributor)

First I'd like to say that pointing the finger at GDAL will not help improving
the situation, and we should strive for more constructive cooperation. I think
there are various issues involved and I'll try to summarize my vision of things
:
- Before GDAL 1.9, the Shapefile driver didn't have any knowledge of shapefile
encoding, and in both reading and writing operations, it took raw bytes to
read/write them in the .DBF file
- Starting with GDAL 1.9, the Shapefile driver will :
   * for write operations : recode from UTF-8 to the encoding specified by the
ENCODING layer creation option (-lco option of ogr2ogr) (or, for an existing
shapefile, from the value of the LDID field of the .dbf header or the .cpg file)
. If the value of that variable is of the form LDID/xx, then xx is written as
the LDID field in the .dbf header. If it is of another form, then it is written
as a plain string in the accompaying .cpg file. If no value for ENCODING is
specified, then LDID/87 is assumed. This value is supposed to be the "Current
ANSI codepage", a concept that doesn't make actually sense on all platforms, and
that doesn't make sense when transporting shapefiles from a system to another
one. An assumption is then made that this LDID/87 is
actually ISO-8859-1 (Latin1) and, indeed, this is strongly biased towards
Western Europe language. As far as QGIS is concerned, when creating shapefile,
it might be prudent to specify ENCODING=UTF-8 if strings passed to OGR
CreateFeature() are in UTF-8. The consequence will be that no recoding will
occur, and a .cpg file with UTF-8 in it will be written.
    * recode from the encoding specified in the LDID field in the .dbf header or
the value of the .cpg file (the .cpg file has priority over the LDID field).
Several issues can occur then :
        - The actual content of the .dbf may not match with the declared LDID
value or .cpg. In which case the recoding to UTF-8 will fail. This can be
gotten around by specifying the SHAPE_ENCODING environmenet variable to the
appropriate value, when it is known. You can also set SHAPE_ENCODING to the
empty string, in which case no recoding at all will occur. That might be the
solution for QGIS if QGIS want to do recoding on its side, based on user input
for example.
         - Even if the .dbf, LDID or .cpg are consistant, you can have issues if
the build of GDAL does not use the iconv library used for doing recoding
(there's only built-in conversion betweeen Latin1 and UTF-8 without iconv
dependency). Until recent fixes in GDAL (not yet released, see
http://trac.osgeo.org/gdal/ticket/4650), there was indeed a bug in the
TestCapability(OLCStringsAsUTF8) method that returned TRUE as soon as the shape
encoding was found, without checking that the recoding services were actually
available.

I hope that the working of the shapefile driver is clearer and that the QGIS
team can find the best solution on how to integrate with it.

Best regards,

Even