[GRASS-dev] Fwd: Re: [Qgis-developer] again on encoding problems

Glynn Clements glynn at gclements.plus.com
Thu Oct 25 14:32:45 PDT 2012


Paolo Cavallini wrote:

> I assume grass-dev are aware of the problem.

Yes.

> Has this been solved in wxPy GUI? How?

No.

> Looks as a serious issue, as it is keeping lots of people
> away from grass in qgis at least.
> If you have a solution, we'll be happy of implementing in qgis, workload
> permitting.

There are two issues for which there is no viable solution:

1. OEM encoding.
2. Shift-JIS.

Regarding #1: GRASS neither knows nor cares whether a string is in
ANSI or OEM encoding. Much of it doesn't care about encodings at all,
and just treats strings as sequences of bytes. Anything which needs to
care about the encoding (e.g. the GUI) will just use "the locale's
encoding", which on Windows means "the ANSI codepage". If you use the
OEM codepage for anything, you lose.

Suggestions as to how to determine whether a string uses the ANSI or
OEM page are welcome, if unlikely.

Regarding #2: On Windows, any byte within the range 0-127 is assumed
to represent the corresponding ASCII character. For encodings which
assign other characters to any byte within that range (either
individually or as part of a multi-byte sequence), that is likely to
cause problems.

The most obvious example is that any occurrence of the byte 0x5C
within a filename is assumed to be a directory separator. 
Unfortunately, Shift-JIS uses 0x5C as the second byte of a multi-byte
sequence, meaning that Japanese filenames may be parsed incorrectly.

Neither EUC-JP nor UTF-8 have this problem (as these only re-purpose
codes above 128), but unfortunately Windows doesn't provide locales
which uses either of these encodings.

And I can't think of any solution which doesn't involve re-writing all
code which handles pathnames.

Similar issues may exist with the other punctuation characters which
are "mingled" with the alphabetic characters, i.e. "[\]^_{|}~" (e.g. |
is commonly used as a field separator, so tabular data which includes
Japanese text may be parsed incorrectly).

While such cases are probably less common than the pathname issue, a
fix is even less viable (i.e. fixing all string-handling code).

-- 
Glynn Clements <glynn at gclements.plus.com>


More information about the grass-dev mailing list