[GRASS-dev] new v.in.geonames: problems with UTF-8 Unicode text
Glynn Clements
glynn at gclements.plus.com
Mon Jun 30 17:16:05 EDT 2008
Markus Neteler wrote:
> >> I am writing v.in.geonames to easily read in data from
> >> http://download.geonames.org/export/dump/
> >>
> >> The script is essentially using v.in.ascii to read in the CSV file encoded
> >> in UTF-8 Unicode text. There are placenames in various languages including
> >> Japanese.
> >> v.in.ascii isn't able to read them properly and fails on such lines, example:
> >> How to fix this problem?
> >
> > Can you please provide accurate and sufficient information about the
> > problem?
>
> As always, I try.
As a general rule, if you're having problems with input which contains
non-ASCII text, use an attachment. The files appear to be UTF-8, but
your previous email used ISO-2022. They may seem "equivalent" to your
mail program, but they may not be equivalent so far as e.g. v.in.ascii
is concerned.
In this case, I don't think that encodings or non-ASCII characters are
actuallly the problem. However, if the data had actually been encoded
in ISO-2022, it could have been related.
> > I.e. the exact data being fed to v.in.ascii (*before* it has been
> > mangled by the various components of the email chain), the v.in.ascii
> > command which is failing, etc.
>
> Attached the original file reduced to 1 offending line:
> wget http://download.geonames.org/export/dump/IT.zip
> cd /tmp
> unzip IT.zip
> grep 'Italian Republic' /tmp/IT.txt > /tmp/IT_example.csv
>
> Import into LatLong location, replicating the script functionality
>
> v.in.ascii cat=0 x=6 y=5 fs=tab in=/tmp/IT_example.csv out=test
> columns='geonameid integer, name varchar(200), asciiname varchar(200),
> alternatename varchar(4000), latitude double precision, longitude
> double precision, featureclass varchar(1), featurecode varchar(10),
> countrycode varchar(2), cc2 varchar(60), admin1code varchar(20),
> admin2code varchar(20), admin3code varchar(20), admin4code
> varchar(20), population integer, elevation varchar(5), gtopo30
> integer, timezone varchar(50), modification date' --o
> Scanning input for column types...
> ERROR: Unparsable latitude value in column <4>: PCLI
I don't get this particular error, but I do have some other problems.
First, I had to increase the buffer size:
--- vector/v.in.ascii/points.c (revision 31901)
+++ vector/v.in.ascii/points.c (working copy)
@@ -74,7 +74,7 @@
char *coorbuf, *tmp_token, *sav_buf;
int skip = FALSE, skipped = 0;
- buflen = 1000;
+ buflen = 4000;
buf = (char *)G_malloc(buflen);
buf_raw = (char *)G_malloc(buflen);
coorbuf = (char *)G_malloc(256);
Otherwise, the input was truncated in the middle of the list of
translated names. This caused points_analyse[1] to see too few
columns, resulting in:
Scanning input for column types...
Maximum input row length: 999
Maximum number of columns: 11
Minimum number of columns: 4
ERROR: x column number > minimum last column number
(incorrect field separator?)
Fixing that, it now complains about:
Scanning input for column types...
Maximum input row length: 1309
Maximum number of columns: 14
Minimum number of columns: 14
WARNING: Table <test> linked to vector map <test> does not exist
ERROR: Number of columns defined (19) does not match number of columns (14)
This is caused by G_tokenize() skipping leading whitespace, including
tabs, even when the separator is a tab. Consequently, a run of
consecutive blank fields is interpreted as a single blank field.
After fixing that, I get:
Scanning input for column types...
Maximum input row length: 1309
Maximum number of columns: 19
Minimum number of columns: 19
WARNING: Column number 11 <admin1code> defined as string has only integer
values
Importing points...
Segmentation fault (core dumped)
This is caused by overflowing another 1000-byte buffer in
points_to_bin():
--- vector/v.in.ascii/points.c (revision 31911)
+++ vector/v.in.ascii/points.c (working copy)
@@ -269,7 +269,7 @@
int *coltype, int xcol, int ycol, int zcol, int catcol,
int skip_lines)
{
- char *buf, buf2[1000];
+ char *buf, buf2[4000];
int cat = 0;
int row = 1;
struct line_pnts *Points;
After which, the file appears to import without any problems.
I have committed a fix to G_tokenize(), and also enlarged the buffers
in v.in.ascii to 4000 bytes (although removing fixed limits altogether
would be better).
[1] BTW, don't we normally use US-English spellings, i.e. "analyze"
instead of "analyse"?
--
Glynn Clements <glynn at gclements.plus.com>
More information about the grass-dev
mailing list