[postgis-devel] [Fwd: Bug #148: shp2pgsql has problems with certain codepoints]

Michael Fuhr mike at fuhr.org
Sun Jun 3 14:50:57 PDT 2007


On Sun, Jun 03, 2007 at 05:23:01PM +0100, Mark Cave-Ayland wrote:
> Now this is interesting. The archive sent by Bruce also has a copy of
> the dataset in CSV format which is different from the shapefile version:
> 
> Shapefile (padded with spaces):
> 
> e5 ae 89 e5 be bd e6 bd 9c e5 b1 b1 e5 8e bf e5
> 8c 97 e6 9d 9c e5 9f 20 20 20 20 20 20 20 20 20
> 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
> 20 20 20 20 20 20 20 20 20 20 20 20 20 20
> 
> CSV:
> 
> e5 ae 89 e5 be bd e6 bd 9c e5 b1 b1 e5 8e bf e5 
> 8c 97 e6 9d 9c e5 9f a0

Do you have a link to this data?  I'm finding only Bruce Rusk's
original message to postgis-users and Bug #148, which has no
attachments.

http://postgis.refractions.net/pipermail/postgis-users/2007-May/015518.html
http://postgis.refractions.net/bugs/bug.php?op=show&bugid=148

In the comments for Bug #148 somebody (Bruce?) observes that three
of the four troublesome code points end in E0:

U+65E0
U+57E0
U+5F20
U+7AE0

The UTF-8 encodings of all four of these code points end in 0xa0:

U+65E0  e6 97 a0
U+57E0  e5 9f a0
U+5F20  e5 bc a0
U+7AE0  e7 ab a0

In some single-byte encodings such as ISO 8859-1 (Latin-1) the
character 0xa0 represents NBSP (U+00A0 NO-BREAK SPACE).  I wonder
if something somewhere is treating 0xa0 as a space instead of as a
trailing byte in a UTF-8 sequence.

-- 
Michael Fuhr



More information about the postgis-devel mailing list