[GRASS5] Re: Freetype failure

Tue Mar 28 14:39:10 EST 2006

roger at spinn.net wrote:

> >> I wrote a UTF-8 to FT_ULong converter to get a more direct solution and 
> >> eliminated convert_str from the code.  This is a working solution and 
> >> probably in most respects a better solution than the current text3.c. 
> > 
> > Except for the most important issue, namely that the input string is
> > not necessarily in UTF-8; the encoding is specified by the charset=
> > option to d.font.freetype. As the FreeType support in the display
> > drivers was originally written to support Japanese, I suspect that
> > most of the existing users of this functionality probably won't be
> > using UTF-8.
> 
> UTF-8 represents the entire range of UCS.  Existing Japanese, Korean, 
> Chinese (etc.) character encodings are encorporated in UCS and are 
> represented by UTF-8.  That does not mean that everyone's software is 
> delivering UTF-8 encoding, but the time when that happens is probably not 
> too far off. 

That's wishful thinking. Most of the CJK world is quite happy to stick
with their existing encodings regardless of how much western
programmers would like them all to switch to Unicode.

> > Whilst a hard-coded UTF-8 to UCS-2 or UCS-4 decoder might be a useful
> > fall-back for systems which don't have iconv, the iconv code needs to
> > stay to support other encodings.
> 
> That makes sense, but if everything is to be funneled into one encoding then 
> I don't think it should be through UCS-2.  There is the possibly academic 
> fact that UCS-2 doesn't represent all of UCS.  Also, UTF-8 is expected to be 
> the future standard encoding and many of us are already working with it.  
> UTF-8 has been the default encoding in all major Linux distributions for a 
> couple years now -- longer for some distros.  I haven't heard that UCS-2 is 
> that widely used. 

FWIW, Windows uses UCS-2LE quite extensively, but that isn't relevant
here.

The main reason it is used in the FreeType code is that it's the
simplest encoding to decode to an integer codepoint. The iconv only
deals with external encodings, so there's no way to decode directly to
unicode codepoints in "host-endian" format (although you could decode
to either UCS-4LE or UCS4-BE according to the host's endianness, then
just cast the output buffer to "FT_ULong *").

> It makes more sense to translate anything that isn't already encoded in 
> UTF-8 into UTF-8, then decode UTF-8 to FreeType.  That way UTF-8 systems 
> would not have to go through an encode-decode cycle. 

That's easier to program (all conversions other than UTF-8 to
UCS-2/UCS-4 become the responsibility of the user), but it's a lot
less useful (because the user has to explicitly convert everything).

To be useful, d.text needs to be able to accept text in the encoding
which other programs generate. In locales where the dominant language
doesn't use the latin alphabet, that probably isn't going to be UTF-8
(on Windows, it definitely won't be UTF-8).

-- 
Glynn Clements <glynn at gclements.plus.com>