[postgis-devel] pgsql2shp, varchar(n), and UTF-8

Michael Fuhr mike at fuhr.org
Sun Jun 3 22:57:22 PDT 2007


While investigating the recent discussion about UTF-8 and shapefiles
I discovered what I think is a flaw in the way pgsql2shp handles
varchar(n) and char(n) columns in UTF-8 (and probably any multi-byte
encoding) databases.  Example:

create table foo (
    id   integer primary key,
    val  char(4) not null
);
select addgeometrycolumn('foo', 'geom', 4326, 'POINT', 2);
insert into foo (id, val, geom) values (1, 'ñ', geomfromtext('POINT(0 0)', 4326));

Then:

pgsql2shp dbname foo

A hex dump of foo.dbf shows the char(4) data as c3 b1 20 20, which
is the two UTF-8 bytes for 'ñ' (c3 b1) plus two spaces to pad the
data to four characters.  But that seems wrong because c3 b1
represents a single character so the padding should be three spaces,
not two.  psql's \copy command pads this value with three spaces
as I would expect.  pgsql2shp's behavior is a bigger problem if it
truncates UTF-8 data: if we declare val to be varchar(1) or char(1)
then pgsql2shp dumps only the first byte (c3) of the UTF-8 sequence.
Another example: if val is varchar(7) or char(7) then 'español'
gets dumped as 'españo', losing the final character because the
UTF-8 encoding of this 7-character string is 8 bytes.

-- 
Michael Fuhr



More information about the postgis-devel mailing list