[postgis-devel] [Fwd: Bug #148: shp2pgsql has problems with certain codepoints]
Michael Fuhr
mike at fuhr.org
Sun Jun 3 08:09:45 PDT 2007
On Sun, Jun 03, 2007 at 03:08:06PM +0100, Mark Cave-Ayland wrote:
> After playing around this morning, I've discovered the cause of bug 148;
> it's related to the automatic trimming of spaces from strings by
> shapelib.
>
> One of the problem shape files I have contains the following data in a
> string field which is encoded in UTF8: (note that it is padded with
> spaces)
>
> e5 ae 89 e5 be bd e6 bd 9c e5 b1 b1 e5 8e bf e5
> 8c 97 e6 9d 9c e5 9f 20 20 20 20 20 20 20 20 20
> 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
> 20 20 20 20 20 20 20 20 20 20 20 20 20 20
>
> However, the resulting output in the SQL file looked like this:
>
> e5 ae 89 e5 be bd e6 bd 9c e5 b1 b1 e5 8e bf e5
> 8c 97 e6 9d 9c e5 9f
>
> So it's fairly easy to see what's going on: since TRIM_DBF_WHITESPACE is
> defined in shapefil.h, shapelib is attempting to trim trailing spaces
> from the fields. However, it is eating one too many characters from the
> end since the final character should read "e5 9f 20" - this is because
> it naively removes all 0x20 characters from the end of the string
> without realising the final 0x20 is part of a UTF8 character.
e5 9f 20 is an ill-formed UTF-8 sequence because 0x20 isn't a valid
byte in UTF-8 multibyte sequences. See Table 3-6 in the Conformance
chapter of The Unicode Standard:
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
(The same table appears as Table 3-7 in the 5.0 standard, which
should be available soon in softcopy.)
One of the design features of UTF-8 is non-overlap: the ranges of
values for leading, trailing, and single-byte values are disjoint.
--
Michael Fuhr
More information about the postgis-devel
mailing list