[GRASS5] Re: Freetype failure

Glynn Clements glynn at gclements.plus.com
Tue Mar 28 18:45:14 EST 2006


roger at spinn.net wrote:

> >> UTF-8 represents the entire range of UCS.  Existing Japanese, Korean, 
> >> Chinese (etc.) character encodings are encorporated in UCS and are 
> >> represented by UTF-8.  That does not mean that everyone's software is 
> >> delivering UTF-8 encoding, but the time when that happens is probably not 
> >> too far off. 
> > 
> > That's wishful thinking. Most of the CJK world is quite happy to stick
> > with their existing encodings regardless of how much western
> > programmers would like them all to switch to Unicode.
> 
> Perhaps it is wishful thinking, but according to the document at 
> http://www.cl.cam.ac.uk/~mgk25/unicode.html
> China, Korea and Japan already have national standards based on UCS.  
> Microsoft uses Unicode, which is similar. 

Having standards for something and actually using it are very
different matters.

Part of the problem is that Windows doesn't provide much choice when
it comes to encodings. You have 16-bit Unicode (i.e. UCS-2LE) and the
system's codepage, and that's it. For Japanese, the system codepage is
CP932 (Shift-JIS), and anything which doesn't use UCS-2LE (i.e. 
anything which needs to use an ASCII-compatible encoding, e.g. 
virtually every external data format except those which mandate UTF-8)
will be in Shift-JIS.

> > The main reason it is used in the FreeType code is that it's the
> > simplest encoding to decode to an integer codepoint.
> 
> This is true for multibyte characters, but not single byte characters.  

I don't understand what you're saying here. Or maybe you're
misunderstanding something. UCS-2BE is just 16-bit unicode codepoints
stored in big-endian byte order. This encoding was chosen because it's
trivial to convert to an FT_ULong. Decoding UCS-2BE to Unicode
codepoints is just:

	wchar_t *chars;
	char *bytes;

	for (i = 0; i < num_chars; i++)
		chars[i] = (bytes[2*i] << 8) | bytes[2*i+1];

[Not to be confused with UTF-16, which is almost the same as UCS-2,
except that UTF-16 supports codepoints above U+FFFF using surrogates
while UCS-2 is limited to the BMP.]

> Besides, I'm offering the decoder, so it shouldn't make a lot of difference 
> whether it is more complex or not. 
> 
> >> It makes more sense to translate anything that isn't already encoded in 
> >> UTF-8 into UTF-8, then decode UTF-8 to FreeType.  That way UTF-8 systems 
> >> would not have to go through an encode-decode cycle. 
> > 
> > That's easier to program (all conversions other than UTF-8 to
> > UCS-2/UCS-4 become the responsibility of the user), but it's a lot
> > less useful (because the user has to explicitly convert everything).
> 
> Sorry if I mislead you.  My suggestion was that the code would retain 
> convert_str and convert_str would use iconv to convert all user-supplied 
> encodings to UTF-8 instead of to UCS-2BE as it does now.  Draw_text would 
> decode UTF-8 to FT_ULong.

Using UTF-8 as the intermediate encoding doesn't make sense.

> There would be no responsibility on the user that 
> isn't there now.  Anything coming in from a UTF-8 system could skip 
> convert_str. 
> 
> But now that you mention it, just using iconv to convert everything to 
> UCS-4BE and casting that to FT_ULong might be a simpler solution yet.

Yes; that's basically what happens now, except that it uses UCS-2
rather than UCS-4. Given the relatively small amounts of data
involved, the performance advantages of using UCS-2 are negligible.

Also, AFAIK, FT_ULong is in the host's byte order, so you either need
to convert char[4] to FT_ULong using shift+or (which is what happens
at present), or use either UCS-4BE or UCS-4LE depending upon the
host's byte order (Vax users are out of luck).

> That would leave iconv with the responsibility for checking the
> UTF-8 stream for malformed encodings. I'm not sure how much of that
> checking iconv actually does.

iconv (at least the GNU implementation) is very rigid; it won't accept
anything which doesn't strictly conform to the input encoding.

-- 
Glynn Clements <glynn at gclements.plus.com>




More information about the grass-dev mailing list