[postgis-devel] [Fwd: Bug #148: shp2pgsql has problems with certain codepoints]

Sun Jun 3 08:09:45 PDT 2007

On Sun, Jun 03, 2007 at 03:08:06PM +0100, Mark Cave-Ayland wrote:
> After playing around this morning, I've discovered the cause of bug 148;
> it's related to the automatic trimming of spaces from strings by
> shapelib.
> 
> One of the problem shape files I have contains the following data in a
> string field which is encoded in UTF8: (note that it is padded with
> spaces)
> 
> e5 ae 89 e5 be bd e6 bd 9c e5 b1 b1 e5 8e bf e5
> 8c 97 e6 9d 9c e5 9f 20 20 20 20 20 20 20 20 20
> 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
> 20 20 20 20 20 20 20 20 20 20 20 20 20 20
> 
> However, the resulting output in the SQL file looked like this:
> 
> e5 ae 89 e5 be bd e6 bd 9c e5 b1 b1 e5 8e bf e5
> 8c 97 e6 9d 9c e5 9f
> 
> So it's fairly easy to see what's going on: since TRIM_DBF_WHITESPACE is
> defined in shapefil.h, shapelib is attempting to trim trailing spaces
> from the fields. However, it is eating one too many characters from the
> end since the final character should read "e5 9f 20" - this is because
> it naively removes all 0x20 characters from the end of the string
> without realising the final 0x20 is part of a UTF8 character.

e5 9f 20 is an ill-formed UTF-8 sequence because 0x20 isn't a valid
byte in UTF-8 multibyte sequences.  See Table 3-6 in the Conformance
chapter of The Unicode Standard:

http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf

(The same table appears as Table 3-7 in the 5.0 standard, which
should be available soon in softcopy.)

One of the design features of UTF-8 is non-overlap: the ranges of
values for leading, trailing, and single-byte values are disjoint.

-- 
Michael Fuhr