[GRASS5] Re: Freetype failure

Wed Mar 29 23:59:53 EST 2006

On composing a reply I found myself repeating things I've said before.
I think that by definition that means this discussion is going nowhere.
All I can really ask is that someone fix the features that caused my
original problems.

Roger Miller

On Wed, 2006-03-29 at 17:50 +0100, Glynn Clements wrote:
> Roger Miller wrote:
> 
> > > > > The main reason it is used in the FreeType code is that it's the
> > > > > simplest encoding to decode to an integer codepoint.
> > > > 
> > > > This is true for multibyte characters, but not single byte characters.  
> > > 
> > > I don't understand what you're saying here. Or maybe you're
> > > misunderstanding something. UCS-2BE is just 16-bit unicode codepoints
> > > stored in big-endian byte order.
> > 
> > UTF-8 can be 1, 2, 3, 4, 5 or 6 bytes.  The first byte corresponds to
> > the old ascii standard.  Transactions with ascii are just one-byte
> > transfers, most transactions with latin-1 and ISO-8859-1 characters are
> > also just 1-byte transfers.  
> 
> To clarify: UCS-2/UCS-4 are the simplest Unicode encodings to decode
> to integer Unicode codepoints. UCS-2/UCS-4 are just integer codepoints
> stored in a specific byte order (technically, those names imply
> big-endian ordering; the little-endian UCS-* encodings were invented
> by Microsoft to avoid byte-swapping on import and export).
> 
> > > > Sorry if I mislead you.  My suggestion was that the code would retain 
> > > > convert_str and convert_str would use iconv to convert all user-supplied 
> > > > encodings to UTF-8 instead of to UCS-2BE as it does now.  Draw_text would 
> > > > decode UTF-8 to FT_ULong.
> > > 
> > > Using UTF-8 as the intermediate encoding doesn't make sense.
> > 
> > It makes sense if you start with UTF-8 and there is no intermediate step
> > at all.  Lots of us are using UTF-8 now (possibly without realizing it)
> > and more of us will be using it in the future.
> 
> Certainly, forcing the user to supply UTF-8 simplifies matters for the
> programmer, which is why it's so popular. But it's a major nuisance
> for the user if you have been consistently using some other encoding
> for the past 25 years (unless that encoding is ASCII).
> 
> The adoption of UTF-8 closely mirrors the use of ASCII.
> 
> It's most popular in English-speaking locales where almost everything
> uses ASCII. It's reasonably popular in locales whose primary language
> uses the roman alphabet, i.e. where you can adequately approximate the
> language using ASCII (it's common in "European" locales to simply
> coerce filenames, usernames etc to ASCII to sidestep any encoding
> issues).
> 
> It's least popular in locales where the language doesn't use the latin
> alphabet but e.g. Cyrillic or Han instead.
> 
> In the latter case, you are likely to have decades worth of data and
> an installed base of software which use a specific encoding other than
> ASCII, and where non-ASCII characters are commonplace in filenames,
> usernames etc.
> 
> It doesn't help that the UTF-8 encoding isn't compatible with the
> (older, and in many locales, well-established) ISO-2022 encoding
> (unlike ISO-8859-*, EUC and others).
> 
> > > Also, AFAIK, FT_ULong is in the host's byte order, so you either need
> > > to convert char[4] to FT_ULong using shift+or (which is what happens
> > > at present), or use either UCS-4BE or UCS-4LE depending upon the
> > > host's byte order (Vax users are out of luck).
> > 
> > Does the existing code account for differences in byte order?  I don't
> > see how it does.
> 
> The existing code converts to UCS2-BE then converts the result to
> integer codepoints with:
> 
> 		ch = (out[i]<<8) | out[i+1];
> 
> [display/drivers/lib/text3.c, line 194].
> 
> This doesn't rely upon the host's byte order.
>