[GRASS-dev] Moving GRASS Python parts to Unicode

Maris Nartiss maris.gis at gmail.com
Sat Feb 13 02:30:30 PST 2016


2016-02-10 17:03 GMT+02:00 Anna Petrášová <kratochanna at gmail.com>:
>
>
> On Wed, Feb 10, 2016 at 9:42 AM, Moritz Lennert
> <mlennert at club.worldonline.be> wrote:
>>
>> Hi Maris,
>>
>> I would be very happy if we could find a structural solution to this which
>> would avoid having to deal with so many individual errors all the time.
That is my proposal. Get it right + policy to enforce to avoid
breakdown in the future.

>>
>> I am no expert on this question, and thus do not have a clear opinion on
>> your proposal, except for the fact that I'm very happy that it exists, but
>> here are my intuitive ideas & questions on your topics:
Neither am I. I just got fed up with UnicodeDecodeError.

>
> I don't have a clear opinion either but I hoped Glynn could state his
> opinion here, because I understood he has a different view on some of these
> things. AFAIR, one of the problems is possibly different needs of Python
> scripting library vs. GUI.
>
> Anna
Anna, there should be no other "special" way of treating some parts of
Python code. If it is Python, it should follow Python idioms. That's
the whole point of using Python at the first place - to provide
Pythonic access to power of GRASS. I do not see in any near future any
significant changes in Python community moving away from Unicode
strings to raw byte strings for texts thus either we adopt Pythonic
approach or continue to fight uphill battle with Python. So far we are
not going too well with it.

>>
>>
>>
>>> Topic to discuss:
>>> 1) Implementation plan:
>>> a) should it be done before 7.1?
>>
>>
>> I think the sooner, the better, so 7.1 should be our latest milestone
>> (7.0.x should be in 'bugfix only mode).
Depends on how far is 7.1. I would prefer to have GRASS releases more
often, then it should go to 7.2.


>>> b) should separate bugs be opened for parts of migration?
>>
>>
>> To what point can different issues be delimited into +/- autonomous issues
>> ?
Good question.

>>> c) how big / long breakage is acceptable?
>>
>>
>> How complete would breakage be: for all encodings, or would LANG=C always
>> work ?
Only partially. There are no UnicodeEncodingErrors for LANG=C, but
there will be UnicodeUnequalError instead when comparing Unicode
string to byte string.

>>
>> Is this something which could be done for most part in a concentrated
>> manner during a code sprint (e.g. FOSS4G 2016) ?
I am not so familiar with whole codebase.

>>> 2) Moving all text in GRASS location to UTF-8 encoding (GRASS 8) thus
>>> pushing the encode/decode "boundary" further. Upside - most of
>>> existing data is UTF-8 ready (parts supporting only ASCII) [4].
>>
>>
>> What do you mean with "text in GRASS location" ? How about files on the
>> filesystem that some users might want to access via other tools ? Shouldn't
>> they be in the system-wide encoding ?
I meant any text strings (raster categories, metadata entries, etc.).
System-wide encoding makes GRASS location non-portable. I can not just
copy it to other system and expect to work. UTF-8 would be a natural
choice as it is backwards compatible with ASCII (existing data does
not need to be changed) and at the same would allow to accept any
characters in the future. Besides - it is used by 86% of Web [1].
If we introduce such policy, the same principle would apply - decode
early, encode late. On the bright side - legacy systems are dying out,
MacOS uses UTF-8 for all locales by default, Linux has nice UTF-8
support (my guess - it is the most popular encoding after plain
ASCII).
Current situation that data is in unknown encoding is the worst -
either we adopt this approach, or start to store metadata on encoding
in use. I assume anyone who has been playing game "guess the encoding
of Shapefile" will agree on downsides of such approach.
Anyway - this is discussion about GRASS 8.

>>
>> Thank you very much for bringing up this discussion in such a structured
>> manner. I hope that others will show some interest in the matter...
>>
>> Moritz
I hope so.

Dziękuje,
Māris.

1. http://w3techs.com/technologies/overview/character_encoding/all


More information about the grass-dev mailing list