[postgis-devel] pgsql2shp, varchar(n), and UTF-8
Paul Ramsey
pramsey at refractions.net
Mon Jun 4 08:19:34 PDT 2007
Very interesting stuff! I think this goes beyond shp2pgsql to the
question of how DBF is supposed to handle unicode.
DBF has fixed-width columns. char(4) means, to some extent, a 4-byte
wide column. The problem seems to be our natural combination of the
idea of "four bytes wide" with "four characters wide".
Perhaps the answer is just to make much much fluffier DBF output in
shp2pgsql, allocating a full four bytes for every declared character
in the table definitions? That way we'll never get squeezed out,
even if the characters are from huge four-byte code pages.
P
On 3-Jun-07, at 10:57 PM, Michael Fuhr wrote:
> While investigating the recent discussion about UTF-8 and shapefiles
> I discovered what I think is a flaw in the way pgsql2shp handles
> varchar(n) and char(n) columns in UTF-8 (and probably any multi-byte
> encoding) databases. Example:
>
> create table foo (
> id integer primary key,
> val char(4) not null
> );
> select addgeometrycolumn('foo', 'geom', 4326, 'POINT', 2);
> insert into foo (id, val, geom) values (1, 'ñ', geomfromtext('POINT
> (0 0)', 4326));
>
> Then:
>
> pgsql2shp dbname foo
>
> A hex dump of foo.dbf shows the char(4) data as c3 b1 20 20, which
> is the two UTF-8 bytes for 'ñ' (c3 b1) plus two spaces to pad the
> data to four characters. But that seems wrong because c3 b1
> represents a single character so the padding should be three spaces,
> not two. psql's \copy command pads this value with three spaces
> as I would expect. pgsql2shp's behavior is a bigger problem if it
> truncates UTF-8 data: if we declare val to be varchar(1) or char(1)
> then pgsql2shp dumps only the first byte (c3) of the UTF-8 sequence.
> Another example: if val is varchar(7) or char(7) then 'español'
> gets dumped as 'españo', losing the final character because the
> UTF-8 encoding of this 7-character string is 8 bytes.
>
> --
> Michael Fuhr
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-devel
More information about the postgis-devel
mailing list