[Proj] Unicode
Glynn Clements
glynn at gclements.plus.com
Mon Jun 8 17:02:04 PDT 2009
Gerald I. Evenden wrote:
> ...
> > printf("%ls\n", L"Schöne Grüße");
>
> for my edification I grabbed a portion of the above string and:
>
> gie at charon:~$ echo 'L"Schöne Grüße");' >foo
> gie at charon:~$ m foo
> L"Schöne Grüße");
> gie at charon:~$ hd foo
> 00000000 4c 22 53 63 68 c3 b6 6e 65 20 47 72 c3 bc c3 9f |L"Sch..ne Gr....|
> 00000010 65 22 29 3b 0a |e");.|
> 00000015
> gie at charon:~$
>
> I see that the "normal text is taking up 1 byte per character and when hitting
> a funky character it escapes with c3 and a code. So it seems that when
> everything is in ASCII we are in normal byte mode and when an extended
> character comes along it is handled with a two byte sequence.
>
> Fair enough. This *is not* the impression I got various previous descriptions
> as the 16 bit aspect kept comming up and made one think that the whole damn
> string was in 16-bit code.
Unix normally uses 32 bits for wide characters. But you don't normally
use that for storage or interchange (apart from anything else, you
have endian-ness issues).
But some forms of text processing are inconvenient on multi-byte
representations; e.g. you can't iterate over a char[] processing each
element independently. So it's quite common to convert to wide
characters for processing.
OTOH, life is still much simpler with the ISO-8859-* encodings where
one byte is one character, which is one reason why they're still
widely used.
> As an aside, I dropped the string into vim and it displayed it properly.
> Alas, how does one enter this stuff without dropping into a character map
> display and wear your mouse out with drag-and-drop?
There are various options.
Most X keyboard layouts configure AltGr plus the punctuation keys on
the RHS of the keyboard as "dead" accents, so e.g. AltGr+semicolon
then e gives eacute; "xmodmap -pk | grep dead_" should list the
combinations.
But I have Shift+AltGr configured as a "Compose" key[1], which allows
mnemonic sequences, e.g. Shift+AltGr then e then single-quote gives
eacute, Shift+AltGr then o then c gives the copyright symbol, etc.
This would be tedious if you need to use accented characters a lot,
but it's adequate (and easier than remembering all of the dead keys)
for occasional non-ASCII characters.
[1] xmodmap -e 'keycode 113 = ISO_Level3_Shift Multi_key'
--
Glynn Clements <glynn at gclements.plus.com>
More information about the Proj
mailing list