[Geotiff] Unicode String in a GeoTiff file
ritter at earthlink.net
Thu Oct 18 07:49:17 PDT 2007
On Mon, 2007-05-14 at 17:09 -0400, Ken Garner wrote:
> I am creating a GeoTiff file for a customer who requires that Unicode
> text strings be stored in the file. The text strings consist of
> Japanese Kanji characters. Therefore, an ASCII string will not
Actually, an ASCII string *will* suffice. The unicode UTF-8 or UTF-7
encoding standards were designed to cleanly mesh with existing
ASCII data streams and networks so that the millions of 32-bit
unicode code-points will survive a round-trip through the internet.
Most XML, for example, specifies a UTF-8 encoding.
Here is a link to a general discussion of the various UTF unicode
encodings, the BOM, and the advantages of each:
This is really a more general issue about TIFF files, since all the
information in a GeoTIFF file relies on the TIFF data encoding
The "philosophy of TIFF" is, when possible, make your data as obvious
and accessible as possible, even if the recipient may know nothing
about your implementation of data encoding, and only has the TIFF
format spec on hand. We followed this philosophy as much as
we could in designing GeoTIFF.
At any rate, here is my own take on the issue of Unicode in TIFF:
UTF-8 encoding has the enormous advantage over "double-byte" UTF-16 or
raw UTF-32 encodings in that the encoding of standard low-ASCII (seven-bit)
looks identical to standard ASCII. UTF-8 encodings of non-ASCII data
show up as variable-length byte substrings, which are uniquely distinguishable
from the other ASCII data in which they are embedded. UTF-8 does not
require the use of a NULL (0) byte, which is always troublesome for TIFF
data readers, even though the spec allows them as delimiters.
About the only thing that may need discussion is whether or not to
prepend the Byte-Order-Marker (BOM) at the beginning of the encoded string.
This marker, which has been used as the "indicator" of a Unicode string,
is generally considered optional, but unicode-savvy data readers
are required to make note of and ignore the marker when read.
--Niles Ritter (author, GeoTIFF 1.0 standard)
More information about the Geotiff