[MAPSERVER-USERS] ASCII -> UTF-8 convert problems for importing (GIS) data

rich.fromm nospam420 at yahoo.com
Mon Apr 21 13:51:15 EDT 2008



Stefan Schwarzer wrote:
> 
> 
>>> hmm.... I have a shapefile, which has some unorthodox characters (Ç,
>>> ì, ...). Now, when importing the file (via shp2pgsql) into postgres,
>>> it complains about it not being UTF-8 (my database has that format).
>>>
>>> So, how can I convert either the dbf file or than in a later stage  
>>> the
>>> created text file from (I guess) ASCII into UTF-8?
> 
>> You have an option for shp2pgsql (-W I think) to tell shp2pgsql to  
>> convert
>> your data into this encoding:
> 
> Yep, tried that too. But I get this message:
> 
> shp2pgsql -s 4326 -I -W UTF-8 -D countries.shp gis.countries_new >  
> countries_new.sql
> Shapefile type: Polygon
> Postgis type: MULTIPOLYGON[2]
> utf8: Invalid or incomplete multibyte or wide character
> 
> We didn't really understand if the "-W" is to specify what the format  
> is (which we assumed) or into which format it has to be transformed.
> 
> So, we would need something  like transform ASCII into UTF-8.
> 

-W describes the input format.  The output format if you use it will be
UTF-8.  From the shp2pgsql(1) man page:

---
       -W <encoding>
              Specify the character encoding of Shapefile$-1òùs attributes. 
If this option is used the output will be encoded in UTF-8.
---

So no, you don't want to transform it from ASCII, because you clearly don't
have ASCII input, as ASCII does not have the characters you describe.

You need to find out what the input data is encoded in.  A very likely
candidate is ISO-8859-1 (aka Latin-1).

Take a look at the actual hex values of some of the non-English characters. 
(I use hexl-mode in emacs to do this, but there are plenty of other ways.) 
Compare them to ISO-8859-1, for example at either of these:

http://en.wikipedia.org/wiki/ISO_8859-1
http://anubis.dkuug.dk/JTC1/SC2/WG3/docs/n411.pdf

For the two examples you cite, we have:

0xC7 LATIN CAPITAL C WITH CEDILLA
0xEC LATIN SMALL I WITH GRAVE

Do they match?  But this is still a bit of a guessing game, because you
could find many matches and still not be right, e.g. ISO-8859-15 is very
similar.  A better way would be to look at the documentation for your input
data, or ask the provider of the data.

- Rich

-- 
View this message in context: http://www.nabble.com/ASCII--%3E-UTF-8-convert-problems-for-importing-%28GIS%29-data-tp16768968p16808302.html
Sent from the Mapserver - User mailing list archive at Nabble.com.



More information about the mapserver-users mailing list