[GRASS-dev] Moving GRASS Python parts to Unicode

Wed Feb 10 07:03:25 PST 2016

On Wed, Feb 10, 2016 at 9:42 AM, Moritz Lennert <
mlennert at club.worldonline.be> wrote:

> Hi Maris,
>
> On 07/02/16 11:56, Maris Nartiss wrote:
>
>> Hello devs,
>> as you might already have noticed, there is a constant stream of
>> issues containing keywords "encoding" or more often
>> "UnicodeDecodeError". The main reason behind this is Python 2.x two
>> types of text strings - byte sequence (one you get with str()) and
>> Unicode (unicode()). Python 3.x will have only one - Unicode (byte
>> sequence is not a string any more) thus fixing this frustrating source
>> of errors.
>> Moving GRASS Python code to use Unicode internally will make it closer
>> to Python 3 ready and solve largest part of errors caused by implicit
>> conversation from encoded text strings to Unicode text strings.
>>
>
> I would be very happy if we could find a structural solution to this which
> would avoid having to deal with so many individual errors all the time.

>
>
>> The proposal is to make GRASS GIS Python code complaint with Unicode
>> best practice [1] following principle "decode early, encode late".
>> Things to change:
>> 1) Any text string entering Python part of code should be decoded at
>> its entry point and decoded back to byte sequence at its exit point.
>> It also applies to all calls to GRASS modules passing around text;
>> 2) Replace all text strings with Unicode literals (u'text'). No
>> exceptions. Note - "text strings" - thus byte sequences should not be
>> touched;
>> 3) Ensure text file reading / writing is done via codecs.open;
>> 4) Pass only Unicode to Python file handling calls (this is important
>> for running on MS-Windows);
>> 5) Use Unicode in tests to ensure correctness of code;
>> 6) Introduce information on Unicode usage into Python submitting
>> guidelines [2],[3].
>>
>> Things to change outside of Python code:
>> 1) Store attribute table encoding information along with connection
>> parameters;
>> 2) Ensure storage of correct encoding information on data import and
>> correct use on export (especially painful for ESRI Shapefiles);
>> 3) Ensure correct encoding information in headers of all PO and XML files.
>>
>> Expected problems:
>> 1) When moving to Python 3, all explicit Unicode literal definitions
>> will need to be removed (u'text' -> 'text');
>> 2) Introduction of "encode early" principle will break all of the
>> band-aids currently in place - a major breakage of code for a short
>> time is expected;
>> 3) Guessing correct encoding can be a problem. One of solutions could
>> be checking early for correctness of system configuration and refusing
>> to operate on improperly configured systems. Fatal error is better
>> than silent data corruption (as it is happening at the moment for
>> certain scenarios).
>>
>>
> I am no expert on this question, and thus do not have a clear opinion on
> your proposal, except for the fact that I'm very happy that it exists, but
> here are my intuitive ideas & questions on your topics:

I don't have a clear opinion either but I hoped Glynn could state his
opinion here, because I understood he has a different view on some of these
things. AFAIR, one of the problems is possibly different needs of Python
scripting library vs. GUI.

Anna

>
>
> Topic to discuss:
>> 1) Implementation plan:
>> a) should it be done before 7.1?
>>
>
> I think the sooner, the better, so 7.1 should be our latest milestone
> (7.0.x should be in 'bugfix only mode).
>
> b) should separate bugs be opened for parts of migration?
>>
>
> To what point can different issues be delimited into +/- autonomous issues
> ?
>
> c) how big / long breakage is acceptable?
>>
>
> How complete would breakage be: for all encodings, or would LANG=C always
> work ?
>
> Is this something which could be done for most part in a concentrated
> manner during a code sprint (e.g. FOSS4G 2016) ?
>
> 2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus
>> pushing the encode/decode "boundary" further. Upside - most of
>> existing data is UTF-8 ready (parts supporting only ASCII) [4].
>>
>
> What do you mean with "text in GRASS location" ? How about files on the
> filesystem that some users might want to access via other tools ? Shouldn't
> they be in the system-wide encoding ?
>
> Thank you very much for bringing up this discussion in such a structured
> manner. I hope that others will show some interest in the matter...
>
> Moritz
>
> _______________________________________________
> grass-dev mailing list
> grass-dev at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/grass-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/grass-dev/attachments/20160210/90a33f1a/attachment.html>