[Qgis-developer] Shapefile Encoding summary

Borys Jurgiel lists at borysjurgiel.pl
Fri Apr 26 09:15:36 PDT 2013

Most of the encoding issues are fixed, and there is still one big problem left: 
the unresolvable "System" encoding. It's a special Qt codec and unfortunately 
there is no way to figure out which particular codec is hidden behind this one. 
It's choosen in different ways depending on the OS, iconv presence etc. So if 
you use this "System" encoding when saving a Shapefile, the OGR/iconv doesn't 
know how to encode it. Well, it's probably solvable still, but a bigger 
problem is the "System" name goes to the .cpg file. Such cpg is of course 
useless, but it also forces QGIS to enable the OGR conversion, blocking the 
possibility to adjust it in the encoding combobox.

After spending some days on it, I believe that above all we should rename 
the "Ignore Shapefile encoding" to more meaningful "Shapefile encoding 
autodetection (experimental)", set it to false by default and warn in the 
documentation to never use this option together with the "System" encoding. It 
generally fixes the problem in a coarse manner and is a basis for any further 
convenience. Do anybody agree to do that? Regardless of this flag, QGIS 
properly writes the encoding declaration to cpg file. In QGIS 2.1 we can easily 
add our own encoding autodetection, what will just set the encoding combobox 
instead of forcing any internal conversion, like OGR's one does. 

Alternative solution would be, I guess, unacceptable: to completely get rid 
of the "System" encoding from comboboxes and make UTF-8 default encoding. The 
major advantage and disadvantage at once is forcing users to be aware what 
encoding they use. However, I assume most users use windows and Shapefiles in 
local codepages, and it could be unacceptable for them to switch from the 
default UTF-8 to their codepage after every QGIS installation.

Ok, I assume we encourage users to turn off the autodetection (turn on 
ignoring), and tell to those decided to use it anyway to not mix autodetection 
with "System" encoding. But regardless of autodetection/ignore state, if a file 
is saved with "System" encoding, QGIS puts this name into the cpg. Non-empty 
cpg forces OGR to turn on the conversion from indefinite encoding (unless 
ignoring is on). Even of we make a workaround in QGIS, the fi can be broken in 
other software. I see three possibilities:

1. Only include the "System" encoding to comboboxes when opening existing 
Shapefile and take them out from those where you save one. So users can use 
System for opening files, but they have to specify a particular encoding when 
saving Shapefiles. The default output encodung (in a fresh installation) would 
be UTF-8 of course. The question is: do we only use it when autodetection is 
on (ignoring is off) or always? In the former case it doesn't prevent most 
users from creating .cpg files containing the useless "System" string, so 
"partially corrupted". In the latter - we force all users to use real 
encodings on save, but the efficacy is very high (except layers created by some 
plugins, as authors will surely forget to disable/enable the "System" in 

2. Let the "System" exist in all comboboxes, but in case user use it for 
saving, just don't create the .cpg file.
- unclear behaviour, as it creates some layers auto-recognizable, some not
- I'm not sure if saving to "System" will always work properly

3. Let the "System" exist in all comboboxes, but in case user use it for
- annoying
- only works in when QgsInterface is available

More information about the Qgis-developer mailing list