[GRASS-dev] new v.in.geonames: problems with UTF-8 Unicode text

Markus Neteler neteler at osgeo.org
Mon Jun 30 14:51:40 EDT 2008


On Mon, Jun 30, 2008 at 5:26 PM, Glynn Clements
<glynn at gclements.plus.com> wrote:
> Markus Neteler wrote:
>
>> I am writing v.in.geonames to easily read in data from
>> http://download.geonames.org/export/dump/
>>
>> The script is essentially using v.in.ascii to read in the CSV file encoded
>> in UTF-8 Unicode text. There are placenames in various languages including
>> Japanese.
>> v.in.ascii isn't able to read them properly and fails on such lines, example:
>>
>> 3165456 Torre del Greco Torre del Greco Torre d%27%27o Grieco,Torre
>> d&#x0027;'o Grieco,Torre d''o Grieco,Torre del Greco,トッレ��デル・グレーコ
>> 40.7839209532791        14.3708038330078        P       PPL     IT
>>          04      NA      063084          90607           72
>> Europe/Rome     2008-06-28
>>
>> (I have slightly improved the v.in.ascii error message, not yet submitted):
>>
>> ERROR: Unparsable latitude value in column <4>: 'o Grieco,Torre d''o
>>        Grieco,Torre del Greco,トッレ・デル・グレーコ
>>
>> How to fix this problem?
>
> Can you please provide accurate and sufficient information about the
> problem?

As always, I try.

> I.e. the exact data being fed to v.in.ascii (*before* it has been
> mangled by the various components of the email chain), the v.in.ascii
> command which is failing, etc.

Attached the original file reduced to 1 offending line:
wget http://download.geonames.org/export/dump/IT.zip
cd /tmp
unzip IT.zip
grep 'Italian Republic' /tmp/IT.txt > /tmp/IT_example.csv

Import into LatLong location, replicating the script functionality

v.in.ascii cat=0 x=6 y=5 fs=tab in=/tmp/IT_example.csv out=test
columns='geonameid integer, name varchar(200), asciiname varchar(200),
alternatename varchar(4000), latitude double precision, longitude
double precision, featureclass varchar(1), featurecode varchar(10),
countrycode varchar(2), cc2 varchar(60), admin1code varchar(20),
admin2code varchar(20), admin3code varchar(20), admin4code
varchar(20), population integer, elevation varchar(5), gtopo30
integer, timezone varchar(50), modification date' --o
Scanning input for column types...
Current row: '�,དགའ་དའ་རས,ཨ་ཊ་ལ།,ཨཊ་ལ,იტალია,ጣሊያን,ጣልያን,អតាល,'Itāria,イタリア,イタリア共和国,意大利,ꑴꄊꆺ,이탈리아
     42.8333333      12.8333333A       PCLI    IT              00
                        58145000                762     Europe/Rome
 2008-03-21'
ERROR: Unparsable latitude value in column <4>: PCLI

-----------
For full example, just use the new scripts/v.in.geonames from SVN
(just fixed + "tr"
filter magic deactivated which filtered away all non Latin chars).

Markus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: IT_example.csv.gz
Type: application/x-gzip
Size: 821 bytes
Desc: not available
Url : http://lists.osgeo.org/pipermail/grass-dev/attachments/20080630/fde99bc1/IT_example.csv-0001.gz


More information about the grass-dev mailing list