[GRASSLIST:1045] Re: openstreetmap data import into GRASS
Hamish
hamish_nospam at yahoo.com
Tue May 9 02:35:02 EDT 2006
> > problem:
> >
> > while [ $ERRORCODE -eq 0 ] ; do
> > read LINE
> > ERRORCODE=$?
> > test $ERRORCODE && continue
> > if `echo $LINE | grep` ; then
> > do_minimal_stuff()
> > fi
> > done < 100mb_file.txt
> >
> > bash takes 800mb ram (that's ok, I've got lots, no swapping) but
> > runs *incredibly* slowly. Like 80486-SX slowly.
> >
> > why is that? What's a better way of working through lines containing
> > spaces? Set the file as a fd to pass to `read` instead of via
> > redirect?
>
> Spawning one grep process per line isn't particularly efficient.
No it isn't. Doing this in a shell script for prototyping purposes.
So creating/destroying processes is the bottleneck?
top reports:
Cpu(s): 0.0% user, 82.7% system, 17.3% nice, 0.0% idle
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1949 hamish 17 18 89144 87m 86m S 13.7 8.6 83:50.53 bash
31434 hamish 19 18 89144 87m 86m R 0.3 8.6 0:00.01 bash
(currently running on a 41mb input file)
> What is that test meant to do?
the `echo $LINE | grep` test is really more than that :)
e.g. grepping the XML for the start of a <segment> record, or
using cut instead of grep to grab a value from the LINE.
an example:
#### create segment table fully populated with coordinates
num_segs=`wc -l planet_seg.dat | cut -f1 -d' '`
i=1
echo "segment|x1|y1|x2|y2" > planet_seg_lines.dat
for LINE in `cat planet_seg.dat` ; do
SEG_ID=`echo "$LINE" | cut -f1 -d'|'`
FROM_NODE=`echo "$LINE" | cut -f2 -d'|'`
TO_NODE=`echo "$LINE" | cut -f3 -d'|'`
# echo "seg $SEG_ID from $FROM_NODE to $TO_NODE"
FROM_COORD=`grep "^${FROM_NODE}|" planet_pt.dat | cut -f2,3 -d'|'`
TO_COORD=`grep "^${TO_NODE}|" planet_pt.dat | cut -f2,3 -d'|'`
if [ 0 -eq `echo $i | awk '{print $1 % 1000}'` ] ; then
echo "seg $i of $num_segs"
fi
echo "$SEG_ID|$FROM_COORD|$TO_COORD" >> planet_seg_lines.dat
i=`expr $i + 1`
done
(num_segs ~500k)
Yes, the FROM_COORD,TO_COORD greps take a little time but the loop still
runs slow if I skip past them.
I was trying to avoid using awk*, but most of that loop could be done by
a single awk process I guess. Would perl or python be a better solution?
Or will nothing approach C for speed?
[*] A large part of this exercise is to demonstrate the method. Awk's
clarity/readability is so dismal that I might as well do it in C in that
case.
I had the suggestion to change things to:
while read LINE ; do
...
done < inputfile.txt
instead of
for LINE in `cat inputfile.txt` ; do
...
done
in order to not load the entire 100mb file into memory, but I get the
same memory footprint, same slow result that way.
The speed does seem to be inversely proportional to input file size?,
so I wonder if this could be the problem, even if the above fix isn't
the right one.
I am talking like 1500 iterations per minute of the "for" loop. That
is fairly slow for a P4 2.8GHz, no swapping, 2.4.27 debian kernel ...
I wouldn't have thought it could be _that_ bad.
thanks,
Hamish
More information about the grass-user
mailing list