[GRASSLIST:1046] Re: openstreetmap data import into GRASS

Tue May 9 04:47:27 EDT 2006

Hamish wrote:

> > > problem:
> > > 
> > > while [ $ERRORCODE -eq 0 ] ; do
> > >   read LINE
> > >   ERRORCODE=$?
> > >   test $ERRORCODE && continue
> > >   if `echo $LINE | grep` ; then
> > >     do_minimal_stuff()
> > >   fi
> > > done < 100mb_file.txt
> > > 
> > > bash takes 800mb ram (that's ok, I've got lots, no swapping) but
> > > runs *incredibly* slowly. Like 80486-SX slowly.
> > > 
> > > why is that? What's a better way of working through lines containing
> > > spaces? Set the file as a fd to pass to `read` instead of via
> > > redirect?
> > 
> > Spawning one grep process per line isn't particularly efficient.
> 
> No it isn't. Doing this in a shell script for prototyping purposes.
> So creating/destroying processes is the bottleneck?

Either that or do_minimal_stuff(), depending upon what the latter
does.

> > What is that test meant to do?
> 
> the `echo $LINE | grep` test is really more than that :)
> e.g. grepping the XML for the start of a <segment> record, or
> using cut instead of grep to grab a value from the LINE.
> 
> 
> an example:
> 
> #### create segment table fully populated with coordinates
> num_segs=`wc -l planet_seg.dat | cut -f1 -d' '`
> i=1
> echo "segment|x1|y1|x2|y2" > planet_seg_lines.dat
> 
> for LINE in `cat planet_seg.dat` ; do
>    SEG_ID=`echo "$LINE" | cut -f1 -d'|'`
>    FROM_NODE=`echo "$LINE" | cut -f2 -d'|'`
>    TO_NODE=`echo "$LINE" | cut -f3 -d'|'`
> #   echo "seg $SEG_ID from $FROM_NODE   to $TO_NODE"
>    FROM_COORD=`grep "^${FROM_NODE}|" planet_pt.dat | cut -f2,3 -d'|'`
>    TO_COORD=`grep "^${TO_NODE}|" planet_pt.dat | cut -f2,3 -d'|'`
> 
>    if [ 0 -eq `echo $i | awk '{print $1 % 1000}'` ] ; then
>       echo "seg $i of $num_segs"
>    fi
>    echo "$SEG_ID|$FROM_COORD|$TO_COORD" >> planet_seg_lines.dat
> 
>    i=`expr $i + 1`
> done

Ah. That's a different matter. I took "echo $LINE | grep" to imply
that you were feeding the line to grep. If you're grepping a complete
other file for every line of input, you may have bigger problems than
just the command-spawning overhead.

And the command-spawning overhead is pretty bad: 5 invocations of cut
and 2 of grep per line of input.

[I dread to think how slow that would be on Cygwin, where spawning
commands is anything up to 100 times slower than Linux (or another
"real" Unix) on the same hardware.]

> (num_segs ~500k)
> Yes, the FROM_COORD,TO_COORD greps take a little time but the loop still
> runs slow if I skip past them.

Ok, so the command-spawning overhead probably is significant.

> I was trying to avoid using awk*, but most of that loop could be done by
> a single awk process I guess. Would perl or python be a better solution?
> Or will nothing approach C for speed?

Anything which can do everything in a single process will be faster
than spawning multiple external commands per line of input.

Simple string operations equivalent to the cut or grep commands
require a handful of CPU cycles per byte. Spawning a command means
opening the executable, setting up its memory segments, mapping the
shared libraries which it uses, undoing all that on termination, etc.

Also, grep will have to compile the regex into to a state table every
time; if you were using the C regex functions or a language with
built-in regex support, you would normally only compile the regex once
(the same issues apply to a literal string; the same search algorithm
is used).

Even an interpreted language would be very much faster than that. C
might not be all that much faster. Or even any faster; if the
processing is simple, awk/perl/etc may be able to process the data as
fast as the disk can supply it.

> [*] A large part of this exercise is to demonstrate the method. Awk's
> clarity/readability is so dismal that I might as well do it in C in that
> case.

Hmm. Pure C can be pretty dismal for string handling; awk has the
advantage that strings are first-class values, i.e. you don't have to
worry about allocation and the like.

It may be that the awk script could be written better. OTOH, a C
program would probably either leave most of the work to an XML library
or use lex/yacc for the parsing.

Obviously, any approach which doesn't use a dedicated XML library is
going to be assuming a very restricted subset of XML (e.g. no newlines
in awkward places). There's no way that you can parse arbitrary XML
with a simple script.

> I had the suggestion to change things to:
>   while read LINE ; do
>     ...
>   done < inputfile.txt
> 
> instead of
>   for LINE in `cat inputfile.txt` ; do
>     ...
>   done
> 
> in order to not load the entire 100mb file into memory, but I get the
> same memory footprint, same slow result that way.

I hadn't even noticed that; yes, processing files should be done with
"while read ...", not a for loop.

> The speed does seem to be inversely proportional to input file size?,
> so I wonder if this could be the problem, even if the above fix isn't
> the right one.

If the time taken to run grep on planet_pt.dat is significant, the
time per line would be proportional to the size of planet_pt.dat (plus
any fixed overhead), and so the total time taken would be proportional
to the product of the files' sizes.

That issue won't go away simply by re-writing the program in another
language. To get around that issue (which could be a bigger issue than
the command-spawning overhead if planet_pt.dat is large), the lookup
needs to be more efficient, e.g. hashtable, btree, or bisection
search, rather than a linear search.

> I am talking like 1500 iterations per minute of the "for" loop. That
> is fairly slow for a P4 2.8GHz, no swapping, 2.4.27 debian kernel ...
> I wouldn't have thought it could be _that_ bad.

Oh, I would. Spawning a command is /lot/ of work. Consider:

	$ strace /usr/bin/true
	execve("/usr/bin/true", ["true"], [/* 44 vars */]) = 0
	uname({sys="Linux", node="cerise", ...}) = 0
	brk(0)                                  = 0x804c000
	access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
	open("/etc/ld.so.cache", O_RDONLY)      = 3
	fstat64(3, {st_mode=S_IFREG|0644, st_size=71395, ...}) = 0
	mmap2(NULL, 71395, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7fa9000
	close(3)                                = 0
	open("/lib/libc.so.6", O_RDONLY)        = 3
	read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0000V\1\000"..., 512) = 512
	fstat64(3, {st_mode=S_IFREG|0755, st_size=1236576, ...}) = 0
	mmap2(NULL, 1142068, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7e92000
	mmap2(0xb7fa3000, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x110) = 0xb7fa3000
	mmap2(0xb7fa7000, 7476, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7fa7000
	close(3)                                = 0
	mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7e91000
	mprotect(0xb7fa3000, 4096, PROT_READ)   = 0
	mprotect(0xb7fce000, 4096, PROT_READ)   = 0
	munmap(0xb7fa9000, 71395)               = 0
	open("/dev/urandom", O_RDONLY)          = 3
	read(3, "2\6\217\366", 4)               = 4
	close(3)                                = 0
	exit_group(0)                           = ?

That's a fork() and exec() (both of which are expensive), plus another
22 system calls in the startup, for a program which doesn't even do
anything; at least 48 context switches, plus any for page faults.

Equivalent string-processing functions such as strstr or strcpy are
trivial in comparison.

-- 
Glynn Clements <glynn at gclements.plus.com>