[Geotiff] Unicode String in a GeoTiff file

Thu Oct 18 08:24:58 PDT 2007

Niles,

Thank you for the detailed response.  The method you suggest is the
method I used to resolve this issue; that is, I used UTF-8 encoding to
support both standard ASCII and Japanese Kanji.

Best Regards,
Ken

-----Original Message-----
From: Niles Ritter [mailto:ritter at earthlink.net] 
Sent: Thursday, October 18, 2007 10:49 AM
To: Garner, Ken @ KLEIN
Cc: geotiff at lists.maptools.org
Subject: Re: [Geotiff] Unicode String in a GeoTiff file

On Mon, 2007-05-14 at 17:09 -0400, Ken Garner wrote:

>[...]
> I am creating a GeoTiff file for a customer who requires that Unicode
> text strings be stored in the file.  The text strings consist of
> Japanese Kanji characters.  Therefore, an ASCII string will not
> suffice.
> 

Actually, an ASCII string *will* suffice. The unicode UTF-8 or UTF-7 
encoding standards were designed to cleanly mesh with existing
ASCII data streams and networks so that the millions of 32-bit
unicode code-points will survive a round-trip through the internet.
Most XML, for example, specifies a UTF-8 encoding.

Here is a link to a general discussion of the various UTF unicode
encodings, the BOM, and the advantages of each:

http://unicode.org/faq/utf_bom.html

This is really a more general issue about TIFF files, since all the
information in a GeoTIFF file relies on the TIFF data encoding
standard.

The "philosophy of TIFF" is, when possible, make your data as obvious
and accessible as possible, even if the recipient may know nothing
about your implementation of data encoding, and only has the TIFF
format spec on hand. We followed this philosophy as much as 
we could in designing GeoTIFF.

At any rate, here is my own take on the issue of Unicode in TIFF:

UTF-8  encoding has the enormous advantage over "double-byte" UTF-16 or
raw UTF-32 encodings in that the encoding of standard low-ASCII
(seven-bit) 
looks  identical to standard ASCII. UTF-8 encodings of non-ASCII data
show up as variable-length byte substrings, which are uniquely
distinguishable
from the other ASCII data in which they are embedded. UTF-8 does not
require the use of a NULL (0) byte, which is always troublesome for TIFF
data readers, even though the spec allows them as delimiters.

About the only thing that may need discussion is whether or not to
prepend the Byte-Order-Marker (BOM) at the beginning of the encoded
string.
This marker, which has been used as the "indicator" of a Unicode string,
is generally considered optional, but unicode-savvy data readers
are required to make note of and ignore the marker when read.

--Niles Ritter (author, GeoTIFF 1.0 standard)