[postgis-devel] [Fwd: Bug #148: shp2pgsql has problems with certain codepoints]
Michael Fuhr
mike at fuhr.org
Sun Jun 3 14:50:57 PDT 2007
On Sun, Jun 03, 2007 at 05:23:01PM +0100, Mark Cave-Ayland wrote:
> Now this is interesting. The archive sent by Bruce also has a copy of
> the dataset in CSV format which is different from the shapefile version:
>
> Shapefile (padded with spaces):
>
> e5 ae 89 e5 be bd e6 bd 9c e5 b1 b1 e5 8e bf e5
> 8c 97 e6 9d 9c e5 9f 20 20 20 20 20 20 20 20 20
> 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
> 20 20 20 20 20 20 20 20 20 20 20 20 20 20
>
> CSV:
>
> e5 ae 89 e5 be bd e6 bd 9c e5 b1 b1 e5 8e bf e5
> 8c 97 e6 9d 9c e5 9f a0
Do you have a link to this data? I'm finding only Bruce Rusk's
original message to postgis-users and Bug #148, which has no
attachments.
http://postgis.refractions.net/pipermail/postgis-users/2007-May/015518.html
http://postgis.refractions.net/bugs/bug.php?op=show&bugid=148
In the comments for Bug #148 somebody (Bruce?) observes that three
of the four troublesome code points end in E0:
U+65E0
U+57E0
U+5F20
U+7AE0
The UTF-8 encodings of all four of these code points end in 0xa0:
U+65E0 e6 97 a0
U+57E0 e5 9f a0
U+5F20 e5 bc a0
U+7AE0 e7 ab a0
In some single-byte encodings such as ISO 8859-1 (Latin-1) the
character 0xa0 represents NBSP (U+00A0 NO-BREAK SPACE). I wonder
if something somewhere is treating 0xa0 as a space instead of as a
trailing byte in a UTF-8 sequence.
--
Michael Fuhr
More information about the postgis-devel
mailing list