[GRASS5] Re: Freetype failure

Wed Mar 29 11:50:41 EST 2006

Roger Miller wrote:

> > > > The main reason it is used in the FreeType code is that it's the
> > > > simplest encoding to decode to an integer codepoint.
> > > 
> > > This is true for multibyte characters, but not single byte characters.  
> > 
> > I don't understand what you're saying here. Or maybe you're
> > misunderstanding something. UCS-2BE is just 16-bit unicode codepoints
> > stored in big-endian byte order.
> 
> UTF-8 can be 1, 2, 3, 4, 5 or 6 bytes.  The first byte corresponds to
> the old ascii standard.  Transactions with ascii are just one-byte
> transfers, most transactions with latin-1 and ISO-8859-1 characters are
> also just 1-byte transfers.  

To clarify: UCS-2/UCS-4 are the simplest Unicode encodings to decode
to integer Unicode codepoints. UCS-2/UCS-4 are just integer codepoints
stored in a specific byte order (technically, those names imply
big-endian ordering; the little-endian UCS-* encodings were invented
by Microsoft to avoid byte-swapping on import and export).

> > > Sorry if I mislead you.  My suggestion was that the code would retain 
> > > convert_str and convert_str would use iconv to convert all user-supplied 
> > > encodings to UTF-8 instead of to UCS-2BE as it does now.  Draw_text would 
> > > decode UTF-8 to FT_ULong.
> > 
> > Using UTF-8 as the intermediate encoding doesn't make sense.
> 
> It makes sense if you start with UTF-8 and there is no intermediate step
> at all.  Lots of us are using UTF-8 now (possibly without realizing it)
> and more of us will be using it in the future.

Certainly, forcing the user to supply UTF-8 simplifies matters for the
programmer, which is why it's so popular. But it's a major nuisance
for the user if you have been consistently using some other encoding
for the past 25 years (unless that encoding is ASCII).

The adoption of UTF-8 closely mirrors the use of ASCII.

It's most popular in English-speaking locales where almost everything
uses ASCII. It's reasonably popular in locales whose primary language
uses the roman alphabet, i.e. where you can adequately approximate the
language using ASCII (it's common in "European" locales to simply
coerce filenames, usernames etc to ASCII to sidestep any encoding
issues).

It's least popular in locales where the language doesn't use the latin
alphabet but e.g. Cyrillic or Han instead.

In the latter case, you are likely to have decades worth of data and
an installed base of software which use a specific encoding other than
ASCII, and where non-ASCII characters are commonplace in filenames,
usernames etc.

It doesn't help that the UTF-8 encoding isn't compatible with the
(older, and in many locales, well-established) ISO-2022 encoding
(unlike ISO-8859-*, EUC and others).

> > Also, AFAIK, FT_ULong is in the host's byte order, so you either need
> > to convert char[4] to FT_ULong using shift+or (which is what happens
> > at present), or use either UCS-4BE or UCS-4LE depending upon the
> > host's byte order (Vax users are out of luck).
> 
> Does the existing code account for differences in byte order?  I don't
> see how it does.

The existing code converts to UCS2-BE then converts the result to
integer codepoints with:

		ch = (out[i]<<8) | out[i+1];

[display/drivers/lib/text3.c, line 194].

This doesn't rely upon the host's byte order.

-- 
Glynn Clements <glynn at gclements.plus.com>