[GRASSLIST:9364] Re: More v.in.ascii problems

Roger Bivand Roger.Bivand at nhh.no
Wed Dec 7 10:29:24 EST 2005


On Wed, 7 Dec 2005, Patton, Eric wrote:

> I have a 945MB text file that contains x,y,z, and cats. I run the following:
> 
>  
> v.in.ascii -zt input=SiteB_0.5m_backscatter.txt output=TEST_backscatter x=3
> y=2 z=4 cat=1 fs=' '
> 
> And receive the following error:
> 
> Maximum input row length: 34
> Maximum number of columns: 4
> Minimum number of columns: 4
> Building topology ...
> Registering lines:       6 [main] v.in.ascii 3388 fixup_mmaps_after_fork:
> WARNING: VirtualProtectEx to return to previous state in parent failed for
> MAP_PRIVATE address 
> 0x5BF0000, Win32 error 87
>  113738 [main] v.in.ascii 3388 fixup_mmaps_after_fork: WARNING:
> VirtualProtect to copy protection to child failed forMAP_PRIVATE address
> 0x5BF0000, Win32 error 487
>  212186 [main] v.in.ascii 3388 fixup_mmaps_after_fork: ReadProcessMemory
> (2nd try) failed for MAP_PRIVATE address 0x5BF0000, Win32 error 487
> C:\cygwin\usr\local\grass6.1.cvs\bin\v.in.ascii (3388): ***
> recreate_mmaps_after_fork_failed
>      76 [main] v.in.ascii 2884 fork_parent: child 3388 died waiting for dll
> loading
> 45286676 [main] v.in.ascii 1364 fixup_mmaps_after_fork: WARNING:
> VirtualProtect to copy protection to child failed forMAP_PRIVATE address
> 0x5BF0000, Win32 error 487
> 45347289 [main] v.in.ascii 1364 fixup_mmaps_after_fork: ReadProcessMemory
> (2nd try) failed for MAP_PRIVATE address 0x5BF0000, Win32 error 487
> C:\cygwin\usr\local\grass6.1.cvs\bin\v.in.ascii (1364): ***
> recreate_mmaps_after_fork_failed
> 47040634 [main] v.in.ascii 2884 fork_parent: child 1364 died waiting for dll
> loading
> ERROR: G_realloc: out of memory
> 
> Would the -b flag mentioned by Roger  alleviate this problem? I'm working on
> Cygwin/XP with 6.1cvs (Sept2). I do have a Ubuntu Breezy installation up and
> running, but I can't use the latest 6.1 cvs on it until I get my tk and tcl
> links sorted out.

In principle, yes, because the topology is not built, so the command exits 
before you see the meltdown. The cat column is being thrown away by -t (as 
far as I understand), as the database table is not being written. I'd 
expect the coords file to be about the same size as the input file, 
roughly 30M points. The -b flag is only in very recent CVS, 2 Sept. 
predates it, so you'd need a more recent build to try it. 

Roger

> 
> ~ Eric.
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  
> Eric Patton
>  
> Technologist, Geo-Spatial Data Services
> Geological Survey of Canada (Atlantic)
> Natural Resources Canada
> Bedford Institute of Oceanography
> Dartmouth, Nova Scotia, Canada B2Y 4A2
>  
> Postal address: P.O. Box 1006
> Courier address: 1 Challenger Drive
>  
> Telephone: (902)426-7732
> Facsimile:  (902)426-4104
> E-mail:       epatton at NRCan.gc.ca
>  
> 
> -----Original Message-----
> From: owner-GRASSLIST at baylor.edu [mailto:owner-GRASSLIST at baylor.edu] On
> Behalf Of Roger Bivand
> Sent: Wednesday, December 07, 2005 10:04 AM
> To: Hamish
> Cc: jgomezdans at gmail.com; GRASSLIST at baylor.edu; grass5 at grass.itc.it
> Subject: [GRASSLIST:9358] Re: v.in.ascii problems
> 
> On Thu, 8 Dec 2005, Hamish wrote:
> 
> > > There is an on-going discussion about this on the GRASS development 
> > > list. >>From a simple test I ran last night, v.in.ascii -b (the -b 
> > > flag is new in GRASS 6.1 CVS) does not build topology, and this 
> > > removes one of the two humps in memory consumption. The other hump 
> > > (> 200MB for a 1M point file with a single attribute) was associated 
> > > with writing the dbf file (the file is 60MB), and is where things 
> > > stick now. In addition, the -b flag leaves the vector data set at 
> > > level 1 topology (absent), and almost all vector commands need level 2.
> > > 
> > > I do now know whether the use of a different database driver than 
> > > the default would help. The dbf writing stage preceeds the topology 
> > > building, so the two memory-intensive humps are separate, with 
> > > topology being a little larger. Reading 1M points on a 1.5GHz P4 
> > > with topology took about 7 minutes, without about half that time.
> > 
> > 
> > Use the -z and -t flags to avoid making the table. (and the z= option) 
> > If the input is just x,y,z data there is no need for a table.
> 
> For me in an x-y location my data with 1M points and -zbt now read in just
> under a minute; lidaratm2.txt in effectively the same time (64 rather than
> 57 seconds, z here is double not int) and v.in.ascii stays at a respectable
> 3.3MB size. d.vect works, but as you say prints a warning.
> 
> > 
> > At minimum, we need v.info, v.surf.rst, v.univar, v.out.ascii (points) 
> > and some sort of subsampling module (ie s.cellstats port) working with 
> > this data. d.vect works already (with a warning). maybe v.surf.idw too.
> > 
> > Probably not many more modules though? -- I think if Radim doesn't 
> > want this to be common-place use of the vector model then it probably 
> > shouldn't be. He knows it better than anyone.. So for now massive 
> > point datasets need to be treated as a special case to the vector 
> > model & only a work-around solution.
> > 
> > 
> > e.g. with the sample LIDAR data (GRASS downloads page)
> > 
> > G61> v.in.ascii -zbt in=lidaratm2.txt out=lidaratm2 x=1 y=2 z=3 fs=,
> > 
> > The first 250k points take about 20 seconds to load.
> > 
> > 
> > If I use the full million it gets stuck on the scanning step:
> > 
> > D3/3: row 374430 : 28 chars
> > 
> > Interesting, that line is the second value with elevation > 100.
> > 
> > changing the first z value to 500.054 it segfaults pretty quick:
> > 
> > D5/5: Vect_hist_write()
> > D4/5: G_getl2: ->-75.622346,35.949693,500.054<-
> > D3/5: row 1 : 28 chars
> > D4/5: token: -75.622346
> > D4/5: is_latlong north: -75.622346
> > D4/5: row 1 col 0: '-75.622346' is_int = 0 is_double = 1
> > D4/5: token: 35.949693
> > D4/5: is_latlong north: 35.949693
> > D4/5: row 1 col 1: '35.949693' is_int = 0 is_double = 1
> > D4/5: row 1 col 2: '500.054' is_int = 0 is_double = 1
> > D4/5: G_getl2: ->-75.629469,35.949693,11.962<-
> > D3/5: row 2 : 27 chars
> > D4/5: token: -75.629469
> > D4/5: is_latlong north: -75.629469
> > D4/5: row 2 col 0: 'H629469' is_int = 0 is_double = 0 Segmentation 
> > fault
> > 
> > where is 'H629469' coming from?
> 
> I was also seeing seg-faults with my data in a long-lat location, so
> switched to x-y (current CVS 6.1).
> 
> Roger
> 
> > 
> > v.in.ascii/points.c
> >  tmp_token is getting corrupted, cascades from there
> > 
> > int points_analyse (){
> > ...
> >     char **tokens;
> > ...
> >     tmp_token=(char *) G_malloc(256);
> > ...
> >     while (1) {
> > ...
> >         tokens = G_tokenize (buf, fs); ...
> >         for ( i = 0; i < ntokens; i++ ) { ...
> > [*]                 sprintf(tmp_token, "%f", northing);
> > ...
> > 		    /* replace current DMS token by decimal degree */
> >                     tokens[i]=tmp_token;
> > 
> > BOOM. pointer abuse. (bug is new lat/lon scanning code, only in 
> > 6.1CVS)
> > 
> > [*] and if northing column is longer than 256 without hitting the fs, 
> >    buffer overflow??  add ,int maxlength, parameter to G_tokenize()?
> >    or can %f never be more than 256 bytes long?
> >     same %f effectively cutting down precision of lat/lon coords to 6 
> >    spots after the decimal place? (be that pretty small on the ground)
> > 
> > 
> > improvements come one bug at a time...
> > 
> > Hamish
> > 
> 
> --
> Roger Bivand
> Economic Geography Section, Department of Economics, Norwegian School of
> Economics and Business Administration, Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; fax +47 55 95 95 43
> e-mail: Roger.Bivand at nhh.no
> 

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no




More information about the grass-user mailing list