[Shapelib] Re: shapelib improvements

Sat Dec 29 08:40:15 PST 2007

ANDY CANFIELD wrote:
> If it's not to late and I can put in a vote for the type of Unicode 
> to support I'd like to vote it be UTF-16. Most of the Windows 
> platforms like 2000, and XP have their internal character 
> representation as UTF-16. NT was UCS-2 though

Also, Windows 2000 was UCS-2.

> but  UTF-16 is compatible with UCS-2 from U+0000 through U+FFFF it
> just doesn't support surrogate pairs so UTF-16 can support the entire
> BMP and the highest planes of Unicode while UCS-2 cannot. .Net also
> internally maintains it's characters as UTF-16. The fact that windows
> internally is UTF-16 may be why an earlier poster had problems with
> UTF-8 encoded paths.

Windows does *not* support UTF-8 at all.
If one needs to pass UTF-8 string to Windows API, she has to convert
it to Windows wide-character string (UTF-16).

> I know shapelib isn't Windows specific but if we support UTF-16 it
> will make windows development a whole lot easier and I don't think it
> would make *nix or other platform development any harder using UTF-16
> instead of UTF-8. Mac OS X's Cocoa and Core foundation frameworks use
> UTF-16 internally 

Mac OS VFS uses UTF-8.
AFAIK, Cocoa core foundation recommends to use UTF-8 for file paths.

> as does the Java bytecode environment.

As you say, UTF-16 is used *internally*. Actually, Java supports Unicode
in a *mess*, exposing Unicode in 3 or 4 different ways, including their
own modified version of UTF-8 encoding (brrr!).
So, actually, different components of Java use different standard,
for exmaple Data{Input|Output}Stream uses modified UTF-8,
OutputStreamWriter and InputStreamReader can use *any* encoding,
String can use *any* encoding, etc.

For me, Java and Windows arguments are irrelevant here because Shapelib
does not use system specific API of any of the systems listed above.
Shapelib is just a data storage/transfer layer and as such, the only
portable and IMHO reasonable choice is UTF-8.
UTF-16 and UTF-32 make more troubles than it's worth.
UTF-8 is more natural choice because:
- UTF-8 works well with legacy platforms and clients that only
  support 8-bit characters
- UTF-8 is compatible with ASCII
- UTF-8 is more compact
- UTF-8 is byte oriented instead of word oriented
- UTF-8 is C strings friendly
- UTF-8 is more efficient (it depends on range of content)
- UTF-8 is compatible with all Unix systems as well as recommended in
standards and protocols like W3C, IETF, IMC, etc.

All these suggest me that UTF-8 support easier to implement for highly
portable data storage software like Shapelib is.

> I haven't worked on *nix platforms in almost two years now so I can't
> remember what they use internally. Anyway those are my thoughts and
> if it isn't to late to vote for the Unicode format we support I would
> like to throw my hat in UTF-16s corner.

I don't think we are actually voting, but rather brainstorming :-)

Cheers
-- 
Mateusz Loskot
http://mateusz.loskot.net