[GRASS-dev] new v.in.geonames: problems with UTF-8 Unicode text

Mon Jun 30 17:16:05 EDT 2008

Markus Neteler wrote:

> >> I am writing v.in.geonames to easily read in data from
> >> http://download.geonames.org/export/dump/
> >>
> >> The script is essentially using v.in.ascii to read in the CSV file encoded
> >> in UTF-8 Unicode text. There are placenames in various languages including
> >> Japanese.
> >> v.in.ascii isn't able to read them properly and fails on such lines, example:

> >> How to fix this problem?
> >
> > Can you please provide accurate and sufficient information about the
> > problem?
> 
> As always, I try.

As a general rule, if you're having problems with input which contains
non-ASCII text, use an attachment. The files appear to be UTF-8, but
your previous email used ISO-2022. They may seem "equivalent" to your
mail program, but they may not be equivalent so far as e.g. v.in.ascii
is concerned.

In this case, I don't think that encodings or non-ASCII characters are
actuallly the problem. However, if the data had actually been encoded
in ISO-2022, it could have been related.

> > I.e. the exact data being fed to v.in.ascii (*before* it has been
> > mangled by the various components of the email chain), the v.in.ascii
> > command which is failing, etc.
> 
> Attached the original file reduced to 1 offending line:
> wget http://download.geonames.org/export/dump/IT.zip
> cd /tmp
> unzip IT.zip
> grep 'Italian Republic' /tmp/IT.txt > /tmp/IT_example.csv
> 
> Import into LatLong location, replicating the script functionality
> 
> v.in.ascii cat=0 x=6 y=5 fs=tab in=/tmp/IT_example.csv out=test
> columns='geonameid integer, name varchar(200), asciiname varchar(200),
> alternatename varchar(4000), latitude double precision, longitude
> double precision, featureclass varchar(1), featurecode varchar(10),
> countrycode varchar(2), cc2 varchar(60), admin1code varchar(20),
> admin2code varchar(20), admin3code varchar(20), admin4code
> varchar(20), population integer, elevation varchar(5), gtopo30
> integer, timezone varchar(50), modification date' --o
> Scanning input for column types...

> ERROR: Unparsable latitude value in column <4>: PCLI

I don't get this particular error, but I do have some other problems.

First, I had to increase the buffer size:

--- vector/v.in.ascii/points.c	(revision 31901)
+++ vector/v.in.ascii/points.c	(working copy)
@@ -74,7 +74,7 @@
     char *coorbuf, *tmp_token, *sav_buf;
     int skip = FALSE, skipped = 0;
 
-    buflen = 1000;
+    buflen = 4000;
     buf = (char *)G_malloc(buflen);
     buf_raw = (char *)G_malloc(buflen);
     coorbuf = (char *)G_malloc(256);

Otherwise, the input was truncated in the middle of the list of
translated names. This caused points_analyse[1] to see too few
columns, resulting in:

	Scanning input for column types...
	Maximum input row length: 999
	Maximum number of columns: 11
	Minimum number of columns: 4
	ERROR: x column number > minimum last column number
	       (incorrect field separator?)

Fixing that, it now complains about:

	Scanning input for column types...
	Maximum input row length: 1309
	Maximum number of columns: 14
	Minimum number of columns: 14
	WARNING: Table <test> linked to vector map <test> does not exist
	ERROR: Number of columns defined (19) does not match number of columns (14)

This is caused by G_tokenize() skipping leading whitespace, including
tabs, even when the separator is a tab. Consequently, a run of
consecutive blank fields is interpreted as a single blank field.

After fixing that, I get:

	Scanning input for column types...
	Maximum input row length: 1309
	Maximum number of columns: 19
	Minimum number of columns: 19
	WARNING: Column number 11 <admin1code> defined as string has only integer
	         values
	Importing points...
	Segmentation fault (core dumped)

This is caused by overflowing another 1000-byte buffer in
points_to_bin():

--- vector/v.in.ascii/points.c	(revision 31911)
+++ vector/v.in.ascii/points.c	(working copy)
@@ -269,7 +269,7 @@
 		  int *coltype, int xcol, int ycol, int zcol, int catcol,
 		  int skip_lines)
 {
-    char *buf, buf2[1000];
+    char *buf, buf2[4000];
     int cat = 0;
     int row = 1;
     struct line_pnts *Points;

After which, the file appears to import without any problems.

I have committed a fix to G_tokenize(), and also enlarged the buffers
in v.in.ascii to 4000 bytes (although removing fixed limits altogether
would be better).

[1] BTW, don't we normally use US-English spellings, i.e. "analyze"
instead of "analyse"?

-- 
Glynn Clements <glynn at gclements.plus.com>