[GRASS-dev] Python 3 porting and unicode

Vaclav Petras wenzeslaus at gmail.com
Sun Nov 26 19:21:32 PST 2017


Dear all,

after looking at different Python 2 to 3 porting issues, doing r71849, and
reading #3392, I understand the following:

* Several solutions for poring exist. Most recent one is python-future
project, but only from __future__ import ... is part of the library and
thus guaranteed with recent Python 2.7. (We can discuss concrete steps
separately.)

* However, the most challenging part of the porting will be the unicode.

* There is no way around the unicode when using Python 3. Unicode is
inherent part of the language even things such as os.environ or
sys.stdout.write() work only with unicode. I'm not sure what exactly the
rule is here, but it seems to be everywhere.

* I haven't seen any simple fix which would limit the changes in the code
in a way, e.g., in which print statement can be fixed.

* GUI will always use unicode because that's how the libraries and
interfaces as set.

* In relation to the previous point, one of the reasons why unicode is used
that thinks like text[:10] actually return 10 characters to display.

* C library will not use unicode for now.

* Users of the Python API who are using Python 3 will expect unicode
strings to work, i.e. expect run_command('g.region', flags='p') to work
(not just run_command(b'g.region', flags=b'p')).

* If Python libraries are unicode, there will need to be an interface to
work with ctypes which would add to existing code for transferring from C
world to Python and back.

* If Python libraries are bytes, there will need to be an interface to work
with GUI in unicode as well as with users of the API who will expect
unicode to work. In other words, internally it would use bytes, but
interface must be both bytes (for modules and internal use) and unicode
(for GUI and users).

* Having unicode-based library means encoding and decoding on any
"external" interface such as file reading or ctypes.

* Having bytes-based library means encoding and decoding on any interface
such as Python 3 interface such as os.environ and additionally rewriting
all string literals ("abc") to bytes (b"abc").

* It seems hard to predict when we will know the right encoding of the
text. It seems that we will need it with any solution since
garbage-in-garbage stops when you need to use some system interface
function in Python 3 which requires unicode. Although e.g.
sys.stdout.write() has a (less generic) sys.stdout.buffer.write()
alternative, os.environb does not work on MS Windows.

An example fix in r71849 is done using a (custom) decode function which
creates unicode (standard string in Python3) when file content is read.
Alternative to this change would be changing all the strings in the file to
bytes (b'abc' as opposed to 'abc').

Please comment or link other related discussions.

Thanks,
Vaclav


python3 -c "import os; os.environ[b'abc'] = b'def'"
python3 -c "import os; os.environb[b'abc'] = b'def'"
python3 -c "import sys; sys.stdout.write(b'abc\n')"
python3 -c "import sys; sys.stdout.buffer.write(b'abc\n')"
python3 -c "import os; print(type(os.name))"
https://trac.osgeo.org/grass/changeset/71849
https://trac.osgeo.org/grass/ticket/2708
https://trac.osgeo.org/grass/ticket/3392
https://trac.osgeo.org/grass/query?status=!closed&keywords=~python3
https://trac.osgeo.org/grass/query?status=!closed&keywords=~encoding
https://trac.osgeo.org/grass/query?status=!closed&keywords=~unicode
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/grass-dev/attachments/20171126/0c1fa6ac/attachment-0001.html>


More information about the grass-dev mailing list