[GRASSLIST:1045] Re: openstreetmap data import into GRASS

Tue May 9 02:35:02 EDT 2006

> > problem:
> > 
> > while [ $ERRORCODE -eq 0 ] ; do
> >   read LINE
> >   ERRORCODE=$?
> >   test $ERRORCODE && continue
> >   if `echo $LINE | grep` ; then
> >     do_minimal_stuff()
> >   fi
> > done < 100mb_file.txt
> > 
> > bash takes 800mb ram (that's ok, I've got lots, no swapping) but
> > runs *incredibly* slowly. Like 80486-SX slowly.
> > 
> > why is that? What's a better way of working through lines containing
> > spaces? Set the file as a fd to pass to `read` instead of via
> > redirect?
> 
> Spawning one grep process per line isn't particularly efficient.

No it isn't. Doing this in a shell script for prototyping purposes.
So creating/destroying processes is the bottleneck?

top reports:
Cpu(s):   0.0% user,  82.7% system,  17.3% nice,   0.0% idle
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1949 hamish    17  18 89144  87m  86m S 13.7  8.6  83:50.53 bash
31434 hamish    19  18 89144  87m  86m R  0.3  8.6   0:00.01 bash

(currently running on a 41mb input file)

> What is that test meant to do?

the `echo $LINE | grep` test is really more than that :)
e.g. grepping the XML for the start of a <segment> record, or
using cut instead of grep to grab a value from the LINE.

an example:

#### create segment table fully populated with coordinates
num_segs=`wc -l planet_seg.dat | cut -f1 -d' '`
i=1
echo "segment|x1|y1|x2|y2" > planet_seg_lines.dat

for LINE in `cat planet_seg.dat` ; do
   SEG_ID=`echo "$LINE" | cut -f1 -d'|'`
   FROM_NODE=`echo "$LINE" | cut -f2 -d'|'`
   TO_NODE=`echo "$LINE" | cut -f3 -d'|'`
#   echo "seg $SEG_ID from $FROM_NODE   to $TO_NODE"
   FROM_COORD=`grep "^${FROM_NODE}|" planet_pt.dat | cut -f2,3 -d'|'`
   TO_COORD=`grep "^${TO_NODE}|" planet_pt.dat | cut -f2,3 -d'|'`

   if [ 0 -eq `echo $i | awk '{print $1 % 1000}'` ] ; then
      echo "seg $i of $num_segs"
   fi
   echo "$SEG_ID|$FROM_COORD|$TO_COORD" >> planet_seg_lines.dat

   i=`expr $i + 1`
done

(num_segs ~500k)
Yes, the FROM_COORD,TO_COORD greps take a little time but the loop still
runs slow if I skip past them.

I was trying to avoid using awk*, but most of that loop could be done by
a single awk process I guess. Would perl or python be a better solution?
Or will nothing approach C for speed?

[*] A large part of this exercise is to demonstrate the method. Awk's
clarity/readability is so dismal that I might as well do it in C in that
case.

I had the suggestion to change things to:
  while read LINE ; do
    ...
  done < inputfile.txt

instead of
  for LINE in `cat inputfile.txt` ; do
    ...
  done

in order to not load the entire 100mb file into memory, but I get the
same memory footprint, same slow result that way.

The speed does seem to be inversely proportional to input file size?,
so I wonder if this could be the problem, even if the above fix isn't
the right one.

I am talking like 1500 iterations per minute of the "for" loop. That
is fairly slow for a P4 2.8GHz, no swapping, 2.4.27 debian kernel ...
I wouldn't have thought it could be _that_ bad.

thanks,
Hamish