[GRASS-dev] Re: [bug #5195] (grass) ps.map sets encoding to iso-8859-1

Tue Oct 10 08:50:05 EDT 2006

Moritz Lennert wrote:

> >> this bug's URL: http://intevation.de/rt/webrt?serial_num=5195
> >> -------------------------------------------------------------------------
> >>
> >> Subject: ps.map sets encoding to iso-8859-1
> >>
> >> Platform: GNU/Linux/x86
> >> grass obtained from: CVS
> >> grass binary for platform: Compiled from Sources
> >> GRASS Version: cvs_head_20060921
> >>
> >> On line 92 of ps/ps.map/prolog.ps the encoding is set to ISOLatin1Encoding.
> >>
> >> If I understand correctly (and some testing confirms this) this
> >> means that the instructions file for ps.map has to be encoded in
> >> iso-8859-1 (or similar) to work, i.e. to be able to print accented
> >> characters. If you are in a UTF-8 environment, ps.map creates a ps
> >> file which doesn't show correct accented characters be it in iso or
> >> in utf.
> >>
> >> Is there any reason why ps.map hardcodes the encoding ? Is it
> >> possible to automatically use the users encoding ?
> > 
> > The reason why we force the font's encoding to ISOLatin1Encoding is
> > that the default encoding for most Latin fonts is StandardEncoding,
> > which (contrary to its name) is a completely non-standard encoding
> > which (AFAICT) is not used by anything except PostScript.
> > 
> > The value of the Encoding property is an array of 256 glyph names, so
> > you can use any unibyte encoding (e.g. ISO-646-*, ISO-8859-*,
> > windows-12?? etc).
> > 
> > If you want to support more complex encodings, you need to use
> > CID-keyed fonts. Apart from being rather complex, CID-keyed fonts may
> > not be supported by PostScript printers sold outside of South-East
> > Asia.
> 
> Does UTF-8 count as 'complex encoding' ?

Anything which isn't a unibyte encoding (where each byte maps to a
specific character) counts as a complex encoding. That includes UTF-8.

> Most GNU/Linux distributions 
> come with UTF-8 as default system encoding nowadays and so users will 
> have that problem.

The default locale's encoding doesn't matter. What matters is the
encoding of the text in the ps.map input file.

If they have text in UTF-8, they'll need to convert it to ISO-8859-1
first. If you have text outside of the ISO-8859-1 repertoire, you lose
regardless of what ps.map does, because your printer probably doesn't
have those glyphs.

About the only thing which ps.map can do here is to convert UTF-8 to
ISO-8859-1. But then it would need some way to determine that the text
is in UTF-8 (if it assumes it, users would first have to convert any
ISO-8859-1 text to UTF-8 just so that ps.map can convert it back to
ISO-8859-1).

> I imagine there is no way of automatically identifying the encoding of a 
> file ?

Correct. At least, not reliably. You can use various heuristics; e.g. 
bytes \x80-\x9F aren't valid in any ISO-8859-* encodings, certain
combinations aren't valid in UTF-8 etc.

But it's entirely possible to create a text file which is perfectly
valid in multiple encodings. E.g. if you have an ISO-8859-* file which
is almost entirely ASCII but with a small number of isolated non-ASCII
characters, it's almost impossible for a program to determine exactly
which ISO-8859-* encoding it's meant to be.

-- 
Glynn Clements <glynn at gclements.plus.com>