[GRASS-dev] Moving GRASS Python parts to Unicode

Moritz Lennert mlennert at club.worldonline.be
Wed Feb 10 06:42:33 PST 2016


Hi Maris,

On 07/02/16 11:56, Maris Nartiss wrote:
> Hello devs,
> as you might already have noticed, there is a constant stream of
> issues containing keywords "encoding" or more often
> "UnicodeDecodeError". The main reason behind this is Python 2.x two
> types of text strings - byte sequence (one you get with str()) and
> Unicode (unicode()). Python 3.x will have only one - Unicode (byte
> sequence is not a string any more) thus fixing this frustrating source
> of errors.
> Moving GRASS Python code to use Unicode internally will make it closer
> to Python 3 ready and solve largest part of errors caused by implicit
> conversation from encoded text strings to Unicode text strings.

I would be very happy if we could find a structural solution to this 
which would avoid having to deal with so many individual errors all the 
time.

>
> The proposal is to make GRASS GIS Python code complaint with Unicode
> best practice [1] following principle "decode early, encode late".
> Things to change:
> 1) Any text string entering Python part of code should be decoded at
> its entry point and decoded back to byte sequence at its exit point.
> It also applies to all calls to GRASS modules passing around text;
> 2) Replace all text strings with Unicode literals (u'text'). No
> exceptions. Note - "text strings" - thus byte sequences should not be
> touched;
> 3) Ensure text file reading / writing is done via codecs.open;
> 4) Pass only Unicode to Python file handling calls (this is important
> for running on MS-Windows);
> 5) Use Unicode in tests to ensure correctness of code;
> 6) Introduce information on Unicode usage into Python submitting
> guidelines [2],[3].
>
> Things to change outside of Python code:
> 1) Store attribute table encoding information along with connection parameters;
> 2) Ensure storage of correct encoding information on data import and
> correct use on export (especially painful for ESRI Shapefiles);
> 3) Ensure correct encoding information in headers of all PO and XML files.
>
> Expected problems:
> 1) When moving to Python 3, all explicit Unicode literal definitions
> will need to be removed (u'text' -> 'text');
> 2) Introduction of "encode early" principle will break all of the
> band-aids currently in place - a major breakage of code for a short
> time is expected;
> 3) Guessing correct encoding can be a problem. One of solutions could
> be checking early for correctness of system configuration and refusing
> to operate on improperly configured systems. Fatal error is better
> than silent data corruption (as it is happening at the moment for
> certain scenarios).
>

I am no expert on this question, and thus do not have a clear opinion on 
your proposal, except for the fact that I'm very happy that it exists, 
but here are my intuitive ideas & questions on your topics:


> Topic to discuss:
> 1) Implementation plan:
> a) should it be done before 7.1?

I think the sooner, the better, so 7.1 should be our latest milestone 
(7.0.x should be in 'bugfix only mode).

> b) should separate bugs be opened for parts of migration?

To what point can different issues be delimited into +/- autonomous issues ?

> c) how big / long breakage is acceptable?

How complete would breakage be: for all encodings, or would LANG=C 
always work ?

Is this something which could be done for most part in a concentrated 
manner during a code sprint (e.g. FOSS4G 2016) ?

> 2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus
> pushing the encode/decode "boundary" further. Upside - most of
> existing data is UTF-8 ready (parts supporting only ASCII) [4].

What do you mean with "text in GRASS location" ? How about files on the 
filesystem that some users might want to access via other tools ? 
Shouldn't they be in the system-wide encoding ?

Thank you very much for bringing up this discussion in such a structured 
manner. I hope that others will show some interest in the matter...

Moritz


More information about the grass-dev mailing list