[GRASS-dev] Character encoding of module i.atcorr files

Mon Mar 3 07:01:57 PST 2014

On Sun, Mar 2, 2014 at 10:59 PM, Hamish <hamish_b at yahoo.com> wrote:

> Maris wrote:
>
> >>  The offending line is a reference in the comment section:
> >>
> http://trac.osgeo.org/grass/browser/grass/trunk/imagery/i.atcorr/computations.cpp#L1365
> >>
> >>  I browsed SUBMITTING file and didn't find any rules about source
> >>  encoding.
> ...
>
> Glynn wrote
> > Most files are ASCII. Those which aren't are almost evenly split
> > between ISO-8859-1 and UTF-8:
> >
> > Files using ISO-8859-1:
> >
> > raster/r.sunmask/g_solposition.c    U+00B0    DEGREE SIGN
> > imagery/i.topo.corr/main.c        U+00F1    LATIN SMALL LETTER N WITH
> TILDE
> > imagery/i.landsat.toar/landsat.h    U+00B5    MICRO SIGN
> > imagery/i.evapo.pm/functions.c        U+00B0    DEGREE SIGN
> > imagery/i.atcorr/computations.cpp    U+00E9    LATIN SMALL LETTER E WITH
> ACUTE
> > lib/raster/color_look.c            U+00AD    SOFT HYPHEN
> > lib/raster/color_set.c            U+00AD    SOFT HYPHEN
> >
> > Files using UTF-8:
> >
> > raster/r.sunmask/main.c            U+00B0    DEGREE SIGN
> > raster/r.watershed/ram/do_flatarea.c    U+2013    EN DASH
> > vector/v.net.salesman/main.c        U+2013    EN DASH
> > gui/wxpython/lmgr/frame.py        U+00F6    LATIN SMALL LETTER O WITH
> DIAERESIS
> >                     U+2019    RIGHT SINGLE QUOTATION MARK
> > lib/python/pygrass/functions.py        U+00B0    DEGREE SIGN
> > lib/arraystats/class.c            U+00E9    LATIN SMALL LETTER E WITH
> ACUTE
> >
> > Many of these are either gratuitous, e.g. use of soft hyphen or
> > en-dash when an ASCII "-" (U+002D HYPHEN-MINUS) would suffice.
> >
> > Some are due to comments written in languages other than English
> > (i.topo.corr = Spanish, lib/arraystats = French); these should be
> > translated.
> >
> > All but one are in comments: the pygrass one is a string literal,
> > which should really use escape notation (assuming that the
> > is_clean_name() function is actually correct, and not a half-baked
> > attempt at re-implementing G_legal_filename()).
> >
> > So, if those are fixed, it boils down to whether we actually want to
> > have to deal with source-code encoding issue for the sake of comments
> > which include:
> >
> > a) °C for degrees Celcius,
> > b) µm for micrometres (microns), and
> > c) proper names using the Latin script with accents (names using any
> > other script will invariably be romanised).
>
> I've now removed most of these in trunk with r59172.
>
> remaining:
> imagery/i.atcorr/computations.cpp (someone's name)
>

> gui/wxpython/lmgr/frame.py (an example of something using UTF-8)
>
>
https://trac.osgeo.org/grass/browser/grass/trunk/gui/wxpython/lmgr/frame.py#L978

I wanted this to be just written without UTF-8 chars but since UTF-8 chars
is what makes problematic, I agree with MarkusN that it is better to be
explicit.

> and lib/python/pygrass/functions.py ...
>
> as for functions.py, hooking into G_legal_filename() would
> be best, but failing that, a white-list of allowed chars would
> seem much more robust than a small black-list of disallowed
> chars.
>
>
> > Personally, I would prefer it if source code was 7-bit clean.
>
> Me too. Not sure how to deal with non-ASCII chars in people's names though.
>
> The problem is that each language deal with this differently. While for
Czech you write Petras instead of Petráš, for German, you write Soeren
instead of Sören in case you want to avoid non-ASCII. For languages with
non-latin alphabet, it is even more complicated. And moreover, the context
when it is appropriate or tolerated may differ.

However, it seems that languages usually have some way to write them in
ASCII or in English transcription. So, we can use that in source codes.
Original names in UTF-8 can be in contributors.csv and in (HTML)
documentation for modules which anyway may contain some UTF-8 chars for
various reasons.

But anyway, UTF-8 is now everywhere and time to time it is necessary and
much easier than various workarounds such as entities in HTML, unicode
escape sequences or rewriting readable and standard °C to degC. So, I don't
see 7 bit or whatever simplification as advantageous because the problem is
complex and you just cannot fit into 7 bit (1).

Are there any disadvantages of using UTF-8?

Vaclav (Václav Petráš)

(1) This remembered me about some comment somewhere where the question "How
do I use this with Latin2 encoded language?" was answered "Use Latin1."
which is of course absurd since Latin1 contains different characters than
Latin2 (that's why there are both here). My point is that encoding in
something else than unicode/UTF-8 is usually a huge simplification which
may destroy the original text.

regards,
> Hamish
>
> _______________________________________________
> grass-dev mailing list
> grass-dev at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/grass-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/grass-dev/attachments/20140303/2061e241/attachment-0001.html>