[GRASS-dev] Re: parallelizing GRASS modules

Sören Gebbert soerengebbert at googlemail.com
Mon Dec 5 08:13:00 EST 2011


Hi,

>> IMHO, code should be optimized first before it is parallelized,
>> otherwise you run ineffective code on several cores.
>
> shrug, that is still better than running ineffective code on a single
> core, when the optimization may not happen for years...
>
> The thing I like about OpenMP is that it is very non-invasive, so that
> it is not a huge investment and refactoring needed to play with it, and
> if it is not switched on it is as if no changes were made at all. So it
> is easily ignored/dumped when it comes time to optimize. and the more we
> look at the inner algorithms the more we might see ways to optimize better.
> :-)
>
> Right now I'd be happy with just being able to efficiently parallelize
> inefficient things, if it can be done with minimal effort and minimal
> changes.

OpenMP is "simple" to use in a program that has been designed from the
beginning to run in parallel. Parallelizing serial code is much
harder. In my experience harder then optimizing code, except you
have simple matrix and vector operations.

>> I'm sure there is still some potential for optimization
>> in the code base...
>
> I'd say there is huge potential. [*cough* v.db.select]
>
>
> wrt v.surf.rst, I've been playing with its lib/gmath/lu.c LU decomposition
> function.  can anyone figure a way to get the following to work? afaic
> guess a[i] in one thread is having a race with a[k] in another.
>
>
> /* corrupts data but speeds it up a lot: */
> #pragma omp parallel for private(i, k, sum) shared(j, a)
>        for (i = 0; i < j; i++) {
>            sum = a[i][j];
>
> /* on its own, this pragma works but hugely slows things down: */
> //#pragma omp parallel for schedule (static) private(k) shared(i, j) reduction(-:sum)
>            for (k = 0; k < i; k++)
>                sum -= a[i][k] * a[k][j];
>            a[i][j] = sum;
>        }

Hamish, please do not optimize the LU code as it should be removed (NR
licence issue). Have a look at
the other direct solver in gmath library and ccmath library.
in solvers_direct.c:
* G_math_solver_cholesky
* G_math_solver_lu
* G_math_solver_gauss

These solver are parallelized, but not optimized. The disadvantage of
these solver parallelization is the triangular nature of the matrix
and therefor the thread creation overhead in case of at least one
third of the matrix computation. Hence the speedup for these LU, GAUSS
and Cholesky solver will always be worse, except you design the code
to run more efficiently in parallel.

The NR (Numerical Recipes) solver is designed to run very fast in
serial, but i think is very hard to parallelize.

I did not have had a look on the ccmath LU solver.

>
> still need to investigate if we can wrap an already optimized and paral-
> ellized BLAS/LAPACK/ATLAS function to replace that, if present.
> (even a 20% slower algorithm well parallelized will be <=60% quicker on a
> dual-core and <=540% quicker on a 8-core cpu)

You need a solver using the same matrix storage concept as the
existing one, otherwise
the memory consumption will double.

Have a look at SuperLU to process dense matrices efficiently in parallel.

Best regards
Soeren

>
>
> thanks,
> Hamish


More information about the grass-dev mailing list