[GRASS-dev] Python 3 porting and unicode

Glynn Clements glynn at gclements.plus.com
Tue Nov 28 00:18:51 PST 2017


Vaclav Petras wrote:

> * There is no way around the unicode when using Python 3. Unicode is
> inherent part of the language even things such as os.environ or
> sys.stdout.write() work only with unicode. I'm not sure what exactly the
> rule is here, but it seems to be everywhere.

Python 3 has os.environb on Unix. You can use the .detach() method on
text streams to get the underlying binary stream.

> * In relation to the previous point, one of the reasons why unicode is used
> that thinks like text[:10] actually return 10 characters to display.

Although some of those characters may be combining characters or
control codes. Unicode characters don't necessarily map 1:1 with
glyphs.

> * Users of the Python API who are using Python 3 will expect unicode
> strings to work, i.e. expect run_command('g.region', flags='p') to work
> (not just run_command(b'g.region', flags=b'p')).

Even if you automatically encode unicode strings, there's no guarantee
that it will work (e.g. if the string is a filename, then the encoded
string must produce the correct sequence of bytes).

I can't think of any significant cases where it's likely to be
necessary to pass "binary" data via arguments, although it should be
trivial to simply accept data which is already a byte string.

The bigger issue is with output: the output from GRASS commands isn't
guaranteed to be in the locale's encoding (if it's extracted from a
file, it's going to be in whatever encoding the file uses). Returning
bytes allows the user to deal with this; automatically decoding the
data will either raise an exception or return mojibake if the encoding
doesn't match.

> * It seems hard to predict when we will know the right encoding of the
> text.

Which is why byte-oriented interfaces still exist and still matter,
and will do so for the foreseeable future.

Python's solution is to accelerate standardisation on Unicode by
making the alternatives as painful as possible. Yet legacy encodings
remain widespread

-- 
Glynn Clements <glynn at gclements.plus.com>


More information about the grass-dev mailing list