[GRASS-dev] Moving GRASS Python parts to Unicode

Glynn Clements glynn at gclements.plus.com
Tue Feb 16 15:32:08 PST 2016


Maris Nartiss wrote:

> as you might already have noticed, there is a constant stream of
> issues containing keywords "encoding" or more often
> "UnicodeDecodeError". The main reason behind this is Python 2.x two
> types of text strings - byte sequence (one you get with str()) and
> Unicode (unicode()). Python 3.x will have only one - Unicode (byte
> sequence is not a string any more) thus fixing this frustrating source
> of errors.

Both versions have both types of string. In 2.x, str() and "plain"
string literals create byte strings, while unicode() and u"..." create
unicode strings. In 3.x, str() and plain string literals create
unicode strings, while bytes() and b"..." create byte strings.

The biggest differences between the two are:

a) 2.x allows implicit conversions. If you pass a byte string where a
unicode string is expected (or vice versa), the string is implicitly
converted using the default encoding (which can't be set by a script).
3.x doesn't do this; you get an exception.

b) 3.x tries quite hard to maintain the fiction that everything is
unicode. E.g. sys.argv contains unicode strings, os.environ uses
unicode strings for both keys and values, sys.stdin/stdout/stderr are
text streams which return Unicode data.

> Moving GRASS Python code to use Unicode internally will make it closer
> to Python 3 ready and solve largest part of errors caused by implicit
> conversation from encoded text strings to Unicode text strings.

I don't particularly care what happens with wxGUI, and using unicode
consistently would make sense there, as wx itself uses Unicode. But if
you're planning on doing this to grass.script, I'm strongly opposed. 
It achieves nothing beyond making what should be wxGUI's problem into
everyone else's problem.

Pretending that everything is unicode only works so long as the rest
of the world makes sure not to dispel the illusion. Otherwise, it
fails hard. Something as simple as e.g. copying stdin to stdout fails
just because the data isn't in the assumed encoding.

Bear in mind that the C portion of GRASS (i.e. most of it) doesn't pay
any attention to encodings unless it has to. It just passes bytes
around. It doesn't care whether the bytes are in any particular
encoding, and certainly won't attempt to ensure that data written to
stdout or to files is in any particular encoding.

-- 
Glynn Clements <glynn at gclements.plus.com>


More information about the grass-dev mailing list