[GRASS-dev] Re: parallelizing GRASS modules

Mon Dec 5 08:56:05 EST 2011

Hamish:
> > wrt v.surf.rst, I've been playing with its lib/gmath/lu.c LU
> > decomposition function.  can anyone figure a way to get the
> > following to work? 
[...]

Sören wrote:
> Hamish, please do not optimize the LU code as it should be removed (NR
> licence issue). Have a look at the other direct solver in gmath library
> and ccmath library.

I'm looking at this as a learning exercise as much as anything else, so
I don't really mind a bit of wasted effort. And if I can speed up
v.surf.rst today by adding a conditional one-liner somewhere, I think it
is worth my time. (In ~30 minutes of experimenting I already have it
completing ~25% faster; the just posted problem runs 50% faster but gives
garbage results)

> in solvers_direct.c:
> * G_math_solver_cholesky
> * G_math_solver_lu
> * G_math_solver_gauss

I peeked at G_math_lu_decomposition() but not G_math_solver_lu() yet.
It looks reasonably promising:
lu.c:int G_ludcmp(double **a, int n, int *indx, double *d)
solvers_direct.c:int G_math_solver_lu(double **A, double *x, double *b, int rows)

> The NR (Numerical Recipes) solver is designed to run very
> fast in serial, but i think is very hard to parallelize.

fwiw, long ago the Numerical Recipes authors gave permission for their
code to be used in GRASS, but yes, we should work to remove it anyway.

> I did not have had a look on the ccmath LU solver.
...
> Have a look at SuperLU to process dense matrices efficiently in parallel.

ok, more reading to do..

having looked at v.surf.rst a bit more, it seems to me the loop that
really wants to be parallelized is in lib/rst/interp_float/segmen2d.c
IL_interp_segments_2d().  the "cv" for loop starting on this line:

    for (skip_index = 0; skip_index < m_skip; skip_index++) {

calls matrix_create() ~256 times during the course of the module run, and
within that the LU decomposition function is already abstracted so easy
to swap out with something else.  If each one of those matrix_create()s
could be run in their own thread there would be no need to parallelize
the deep linear algebra innards, and so no huge overhead to worry about
and numeric code designed to be serial could remain that way.
it is a bit more complicated, so may be a job for pthreads..?

Hamish