[GRASS-dev] Re: parallelizing GRASS modules

Mon Dec 5 05:30:54 EST 2011

Hamish:
> > wrt r.walk: I would think to start with parallelizing r.cost, then
> > porting the method over to r.walk. Both modules use the segment
> > library, so coding it to process each segment in its own thread
> > seems like the way to do that.

[I have since studied r.cost and the segment library more closely, and
now realize that the segment library is a custom page swap virtual array
to keep RAM low; as opposed to a more straight forward quad-tree-like,
rows=4096, or 'r.in.xyz percent=' style divide and conquer method]

Markus M:
> The segment library in trunk is mostly IO bound, CPU overhead is very
> low. Parallelizing the segment library could actually slow down the
> system because several parts of a file are accessed in parallel that
> can not be cached by the system. That's why the iostream lib does
> strictly sequential reading. The segment library, although designed to
> allow random access, should be used such that access is not too random

fwiw, I ran trunk's r.cost through valgrind+kcachgrind profiling:
 (see grass mediawiki Bugs page for method)
50% of the module time was taken in segment_get()
25% in segment_address()
18% in segment_address_fast()
16% in get_lowest()
8% in segment_pagein()
7% in memcpy()

[r.watershed]
> Alternatively, there may be potential to parallelize some of
> the inner loops for (MFD) flow accumulation, but I guess that the
> overhead introduced by paralellization is rather high and may not
> be leveled out with more cores because there are only 8
> neighbours to each cell.

threads are way more efficient than processes, and the compilers are
getting better all the time, but yes, there is a finite overhead hit
you take with each thread created & destroyed, so parallelizing outer
loops is much preferred over simple inner loops, and MarkusN's mapset-
per-job is still the most efficient way to get a huge batch job done.
.. as long as it's not an iterative task like Michael has; although I
guess he could run different starting condition cases serially on
different CPUs.

> The easiest way to speed up things is to get a
> really fast harddrive and use that only for grass
> databases.

maybe I should add my poor-man's RAID setup & tuning notes to the grass
wiki :)

> Sorry for the lack of enthusiasm.

I do not expect everything to happen like magic, for me this is as much
of a learning exercise as anything else. From our 500 or so modules there
will be a few which will be easy wins, and many more which will not be.
But some wins are better than none.

> IMHO, code should be optimized first before it is parallelized,
> otherwise you run ineffective code on several cores.

shrug, that is still better than running ineffective code on a single
core, when the optimization may not happen for years...

The thing I like about OpenMP is that it is very non-invasive, so that
it is not a huge investment and refactoring needed to play with it, and
if it is not switched on it is as if no changes were made at all. So it
is easily ignored/dumped when it comes time to optimize. and the more we
look at the inner algorithms the more we might see ways to optimize better.
:-)

Right now I'd be happy with just being able to efficiently parallelize
inefficient things, if it can be done with minimal effort and minimal
changes.

> I'm sure there is still some potential for optimization
> in the code base...

I'd say there is huge potential. [*cough* v.db.select]

wrt v.surf.rst, I've been playing with its lib/gmath/lu.c LU decomposition
function.  can anyone figure a way to get the following to work? afaic
guess a[i] in one thread is having a race with a[k] in another.

/* corrupts data but speeds it up a lot: */
#pragma omp parallel for private(i, k, sum) shared(j, a)
	for (i = 0; i < j; i++) {
	    sum = a[i][j];

/* on its own, this pragma works but hugely slows things down: */
//#pragma omp parallel for schedule (static) private(k) shared(i, j) reduction(-:sum)
	    for (k = 0; k < i; k++)
		sum -= a[i][k] * a[k][j];
	    a[i][j] = sum;
	}

still need to investigate if we can wrap an already optimized and paral-
ellized BLAS/LAPACK/ATLAS function to replace that, if present.
(even a 20% slower algorithm well parallelized will be <=60% quicker on a
dual-core and <=540% quicker on a 8-core cpu)

thanks,
Hamish