[GRASS-dev] multithreading v.surf.bspline and others with OpenMP

Mon Nov 28 02:21:52 EST 2011

Hi,

On June 2, 2011, Soeren wrote:
> Have a look at:
> http://grass.osgeo.org/wiki/OpenMP
>
> There are only a few modules using OpenMP (r.gwflow, r3.gwflow and
> r.solute.transport). Parts of the gpde and gmath libraries in grass
> 6.5 and 7 are parallelized using OpenMP. These libraries provide tests
> and benchmarks in the libraries test directories.


I am looking to add OpenMP support to v.surf.bspline and am very much
a beginner with it. Profiling with valgrind and kcachegrind (see the
wiki Bugs page) shows that 99% of the module time was spent in gmath lib's
G_math_cholesky_sband_decomposition(). This looks like low hanging fruit.


So I build grass7 with openmp support and add the following patch:
(see OpenMP ./configure patch in the trac'er)

Index: lib/gmath/solvers_direct_cholesky_band.c
===================================================================

--- lib/gmath/solvers_direct_cholesky_band.c    (revision 49355)
+++ lib/gmath/solvers_direct_cholesky_band.c    (working copy)
@@ -24,6 +24,7 @@
 
     for (i = 0; i < rows; i++) {
        G_percent(i, rows, 2);
+#pragma omp parallel for schedule (static) private(j, k, sum, end) shared(A, T, i, bandwidth)
        for (j = 0; j < bandwidth; j++) {
            sum = A[i][j];
            /* start = 1 */


and on a 6-core CPU the module runs 3.5x faster. but the resulting map
is just full of NaNs. Running with OMP_NUM_THREADS=1 gives me data again.

a) what have I done wrong? why all NaNs?

b) 3.5x speedup is very nice, but any way to improve on that 40% efficiency
loss?

c) for the many raster modules that do
        for (row = 0; row < nrows; row++) {
            for (col = 0; col < ncols; col++) {

I'm guessing it is better to multithread the columns for loop (filling
a row's array in parallel), but to keep writing the raster rows serially.
But does the cost of creating and destroying 2-16 threads per row end up
costing too much in terms of create/destroy overhead? IIUC "schedule
(static)" helps to minimize the cost of creating each thread a little? (??)

d) I had a look at 'r.univar -e'; it uses heapsort() which AFAIU is a
sorting method which can not be parallelized.

e) I had a look at r.neighbors, which spends 50% of its time in gather().
I got it to run multithreaded on all cores, but the result was many times
slower than if I just had it run serially. why? too much waiting for "n"?

int gather(DCELL * values, int offset)
{
    int row, col;
    int n = 0;
    DCELL *c;

    *values = 0;

#pragma omp parallel for private(c, col) reduction(+:n)
    for (row = 0; row < ncb.nsize; row++)
        for (col = 0; col < ncb.nsize; col++) {
            c = &ncb.buf[row][offset + col];

            if (ncb.mask && !ncb.mask[row][col])
                continue;

            if (Rast_is_d_null_value(c))
                Rast_set_d_null_value(&values[n], 1);
            else
                values[n] = *c;

            n++;
        }

    return n ? n : -1;
}

(I moved "DCELL *c" out of the inner loop as I suspected allocating a
new variable might be expensive. But it still ran at about the same speed
so maybe it is cheap to do that after all, or maybe the compiler did
something smart)


f) we talked a little about this before, but it would be good to settle
on a uniform name for test suite scripts within a module or libraries
directory. I humbly suggest to use "test_suite/". Using "profiler/"
doesn't cover regression testing, "test" is too generic, and putting
it in the main module dir introduces too much file clutter IMO.

g) v.surf.rst spends 58% of its time in gmath lib's G_ludcmp() and 20% of
its time in __iee754_log().  G_ludcmp() also looks like very low hanging
fruit. (great!)

h) is it possible &/or desirable to use (aka outsource) pre-optimized &
pre-threaded BLAS or LAPACK libraries for our linear algebra needs?


thanks,
Hamish