[postgis-devel] [Fwd: Bug #148: shp2pgsql has problems with certain codepoints]

Sun Jun 3 09:23:01 PDT 2007

On Sun, 2007-06-03 at 09:09 -0600, Michael Fuhr wrote:

> e5 9f 20 is an ill-formed UTF-8 sequence because 0x20 isn't a valid
> byte in UTF-8 multibyte sequences.  See Table 3-6 in the Conformance
> chapter of The Unicode Standard:
> 
> http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
> 
> (The same table appears as Table 3-7 in the 5.0 standard, which
> should be available soon in softcopy.)
> 
> One of the design features of UTF-8 is non-overlap: the ranges of
> values for leading, trailing, and single-byte values are disjoint.

Hi Michael,

Thanks for the reference.

Ugh. So in that case it must be being caused by a bad encoding somewhere
- back to the drawing board, although it is more likely that this is not
shp2pgsql's bug...

(digs deeper)

Now this is interesting. The archive sent by Bruce also has a copy of
the dataset in CSV format which is different from the shapefile version:

Shapefile (padded with spaces):

e5 ae 89 e5 be bd e6 bd 9c e5 b1 b1 e5 8e bf e5
8c 97 e6 9d 9c e5 9f 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20

CSV:

e5 ae 89 e5 be bd e6 bd 9c e5 b1 b1 e5 8e bf e5 
8c 97 e6 9d 9c e5 9f a0

Reading through the spec that you posted, then e5 9f a0 is likely to be
the correct version since the top 2 bits of a0 are 10. Bruce, could it
be that the shapefiles have been encoded incorrectly from the source
data? For reference, the data I am looking at above is for ID 2405
(Wancheng Xian) in the PRES_LOC field.

Kind regards,

Mark.

-- 
ILande - Open Source Consultancy
http://www.ilande.co.uk