[GRASS-dev] new v.in.geonames: problems with UTF-8 Unicode text
Markus Neteler
neteler at osgeo.org
Mon Jun 30 14:51:40 EDT 2008
On Mon, Jun 30, 2008 at 5:26 PM, Glynn Clements
<glynn at gclements.plus.com> wrote:
> Markus Neteler wrote:
>
>> I am writing v.in.geonames to easily read in data from
>> http://download.geonames.org/export/dump/
>>
>> The script is essentially using v.in.ascii to read in the CSV file encoded
>> in UTF-8 Unicode text. There are placenames in various languages including
>> Japanese.
>> v.in.ascii isn't able to read them properly and fails on such lines, example:
>>
>> 3165456 Torre del Greco Torre del Greco Torre d%27%27o Grieco,Torre
>> d''o Grieco,Torre d''o Grieco,Torre del Greco,トッレ��デル・グレーコ
>> 40.7839209532791 14.3708038330078 P PPL IT
>> 04 NA 063084 90607 72
>> Europe/Rome 2008-06-28
>>
>> (I have slightly improved the v.in.ascii error message, not yet submitted):
>>
>> ERROR: Unparsable latitude value in column <4>: 'o Grieco,Torre d''o
>> Grieco,Torre del Greco,トッレ・デル・グレーコ
>>
>> How to fix this problem?
>
> Can you please provide accurate and sufficient information about the
> problem?
As always, I try.
> I.e. the exact data being fed to v.in.ascii (*before* it has been
> mangled by the various components of the email chain), the v.in.ascii
> command which is failing, etc.
Attached the original file reduced to 1 offending line:
wget http://download.geonames.org/export/dump/IT.zip
cd /tmp
unzip IT.zip
grep 'Italian Republic' /tmp/IT.txt > /tmp/IT_example.csv
Import into LatLong location, replicating the script functionality
v.in.ascii cat=0 x=6 y=5 fs=tab in=/tmp/IT_example.csv out=test
columns='geonameid integer, name varchar(200), asciiname varchar(200),
alternatename varchar(4000), latitude double precision, longitude
double precision, featureclass varchar(1), featurecode varchar(10),
countrycode varchar(2), cc2 varchar(60), admin1code varchar(20),
admin2code varchar(20), admin3code varchar(20), admin4code
varchar(20), population integer, elevation varchar(5), gtopo30
integer, timezone varchar(50), modification date' --o
Scanning input for column types...
Current row: '�,དགའ་དའ་རས,ཨ་ཊ་ལ།,ཨཊ་ལ,იტალია,ጣሊያን,ጣልያን,អតាល,'Itāria,イタリア,イタリア共和国,意大利,ꑴꄊꆺ,이탈리아
42.8333333 12.8333333A PCLI IT 00
58145000 762 Europe/Rome
2008-03-21'
ERROR: Unparsable latitude value in column <4>: PCLI
-----------
For full example, just use the new scripts/v.in.geonames from SVN
(just fixed + "tr"
filter magic deactivated which filtered away all non Latin chars).
Markus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: IT_example.csv.gz
Type: application/x-gzip
Size: 821 bytes
Desc: not available
Url : http://lists.osgeo.org/pipermail/grass-dev/attachments/20080630/fde99bc1/IT_example.csv-0001.gz
More information about the grass-dev
mailing list