[Qgis-developer] Shapefile Encoding summary
Borys Jurgiel
lists at borysjurgiel.pl
Fri Apr 26 09:15:36 PDT 2013
Most of the encoding issues are fixed, and there is still one big problem left:
the unresolvable "System" encoding. It's a special Qt codec and unfortunately
there is no way to figure out which particular codec is hidden behind this one.
It's choosen in different ways depending on the OS, iconv presence etc. So if
you use this "System" encoding when saving a Shapefile, the OGR/iconv doesn't
know how to encode it. Well, it's probably solvable still, but a bigger
problem is the "System" name goes to the .cpg file. Such cpg is of course
useless, but it also forces QGIS to enable the OGR conversion, blocking the
possibility to adjust it in the encoding combobox.
After spending some days on it, I believe that above all we should rename
the "Ignore Shapefile encoding" to more meaningful "Shapefile encoding
autodetection (experimental)", set it to false by default and warn in the
documentation to never use this option together with the "System" encoding. It
generally fixes the problem in a coarse manner and is a basis for any further
convenience. Do anybody agree to do that? Regardless of this flag, QGIS
properly writes the encoding declaration to cpg file. In QGIS 2.1 we can easily
add our own encoding autodetection, what will just set the encoding combobox
instead of forcing any internal conversion, like OGR's one does.
Alternative solution would be, I guess, unacceptable: to completely get rid
of the "System" encoding from comboboxes and make UTF-8 default encoding. The
major advantage and disadvantage at once is forcing users to be aware what
encoding they use. However, I assume most users use windows and Shapefiles in
local codepages, and it could be unacceptable for them to switch from the
default UTF-8 to their codepage after every QGIS installation.
Ok, I assume we encourage users to turn off the autodetection (turn on
ignoring), and tell to those decided to use it anyway to not mix autodetection
with "System" encoding. But regardless of autodetection/ignore state, if a file
is saved with "System" encoding, QGIS puts this name into the cpg. Non-empty
cpg forces OGR to turn on the conversion from indefinite encoding (unless
ignoring is on). Even of we make a workaround in QGIS, the fi can be broken in
other software. I see three possibilities:
1. Only include the "System" encoding to comboboxes when opening existing
Shapefile and take them out from those where you save one. So users can use
System for opening files, but they have to specify a particular encoding when
saving Shapefiles. The default output encodung (in a fresh installation) would
be UTF-8 of course. The question is: do we only use it when autodetection is
on (ignoring is off) or always? In the former case it doesn't prevent most
users from creating .cpg files containing the useless "System" string, so
"partially corrupted". In the latter - we force all users to use real
encodings on save, but the efficacy is very high (except layers created by some
plugins, as authors will surely forget to disable/enable the "System" in
QgsEncodingFileDialog).
2. Let the "System" exist in all comboboxes, but in case user use it for
saving, just don't create the .cpg file.
Disadvantages:
- unclear behaviour, as it creates some layers auto-recognizable, some not
- I'm not sure if saving to "System" will always work properly
3. Let the "System" exist in all comboboxes, but in case user use it for
Disadvantages:
- annoying
- only works in when QgsInterface is available
More information about the Qgis-developer
mailing list