[GRASS-dev] Re: multithreading v.surf.bspline and others with OpenMP

Mon Nov 28 05:11:40 EST 2011

Hi Hamish,

2011/11/28 Hamish <hamish_b at yahoo.com>:
> Hi,
>
> On June 2, 2011, Soeren wrote:
>> Have a look at:
>> http://grass.osgeo.org/wiki/OpenMP
>>
>> There are only a few modules using OpenMP (r.gwflow, r3.gwflow and
>> r.solute.transport). Parts of the gpde and gmath libraries in grass
>> 6.5 and 7 are parallelized using OpenMP. These libraries provide tests
>> and benchmarks in the libraries test directories.
>
>
> I am looking to add OpenMP support to v.surf.bspline and am very much
> a beginner with it. Profiling with valgrind and kcachegrind (see the
> wiki Bugs page) shows that 99% of the module time was spent in gmath lib's
> G_math_cholesky_sband_decomposition(). This looks like low hanging fruit.
>
>
> So I build grass7 with openmp support and add the following patch:
> (see OpenMP ./configure patch in the trac'er)
>
> Index: lib/gmath/solvers_direct_cholesky_band.c
> ===================================================================
> --- lib/gmath/solvers_direct_cholesky_band.c    (revision 49355)
> +++ lib/gmath/solvers_direct_cholesky_band.c    (working copy)
> @@ -24,6 +24,7 @@
>
>     for (i = 0; i < rows; i++) {
>        G_percent(i, rows, 2);
> +#pragma omp parallel for schedule (static) private(j, k, sum, end) shared(A, T, i, bandwidth)
>        for (j = 0; j < bandwidth; j++) {
>            sum = A[i][j];
>            /* start = 1 */
>
>
> and on a 6-core CPU the module runs 3.5x faster. but the resulting map
> is just full of NaNs. Running with OMP_NUM_THREADS=1 gives me data again.
>
> a) what have I done wrong? why all NaNs?

Try r49406.

You need to separate the computation for j == 0 and j > 0.

>
> b) 3.5x speedup is very nice, but any way to improve on that 40% efficiency
> loss?

The speedup is better as larger the band matrix is. But limiting
factor of parallel processing speedup is the the first computation for
j == 0. This operation must be done before the rest can be processed
in parallel. The next limiting factor is the time which the OS needs
to create a thread, unless the OpenMP implementation uses a thread
pool ... .

>
> c) for the many raster modules that do
>        for (row = 0; row < nrows; row++) {
>            for (col = 0; col < ncols; col++) {
>
> I'm guessing it is better to multithread the columns for loop (filling
> a row's array in parallel), but to keep writing the raster rows serially.

Yes, much better indeed.

> But does the cost of creating and destroying 2-16 threads per row end up
> costing too much in terms of create/destroy overhead? IIUC "schedule
> (static)" helps to minimize the cost of creating each thread a little? (??)

This is only useful in case the processing of the columns needs much
more time than the thread creation by the OS, unless a thread pool
.... .

>
> d) I had a look at 'r.univar -e'; it uses heapsort() which AFAIU is a
> sorting method which can not be parallelized.
>
> e) I had a look at r.neighbors, which spends 50% of its time in gather().
> I got it to run multithreaded on all cores, but the result was many times
> slower than if I just had it run serially. why? too much waiting for "n"?

Multithreading, especially in case of OpenMP reduction, is only
meaningful in case the data is large enough, otherwise the serial
gathering of n and the thread creation takes much longer then the
computation, unless a thread pool ..... .

>
> int gather(DCELL * values, int offset)
> {
>    int row, col;
>    int n = 0;
>    DCELL *c;
>
>    *values = 0;
>
> #pragma omp parallel for private(c, col) reduction(+:n)
>    for (row = 0; row < ncb.nsize; row++)
>        for (col = 0; col < ncb.nsize; col++) {
>            c = &ncb.buf[row][offset + col];
>
>            if (ncb.mask && !ncb.mask[row][col])
>                continue;
>
>            if (Rast_is_d_null_value(c))
>                Rast_set_d_null_value(&values[n], 1);
>            else
>                values[n] = *c;
>
>            n++;
>        }
>
>    return n ? n : -1;
> }
>
> (I moved "DCELL *c" out of the inner loop as I suspected allocating a
> new variable might be expensive. But it still ran at about the same speed
> so maybe it is cheap to do that after all, or maybe the compiler did
> something smart)
>
>
> f) we talked a little about this before, but it would be good to settle
> on a uniform name for test suite scripts within a module or libraries
> directory. I humbly suggest to use "test_suite/". Using "profiler/"
> doesn't cover regression testing, "test" is too generic, and putting
> it in the main module dir introduces too much file clutter IMO.

No problem with test_suite for library tests and benchmarks.

>
> g) v.surf.rst spends 58% of its time in gmath lib's G_ludcmp() and 20% of
> its time in __iee754_log().  G_ludcmp() also looks like very low hanging
> fruit. (great!)
>
> h) is it possible &/or desirable to use (aka outsource) pre-optimized &
> pre-threaded BLAS or LAPACK libraries for our linear algebra needs?

The GRASS ATLAS wrapper is and example for such an approach. ATLAS can be used,
but in case it is not installed, the default GRASS implementation is
used. Feel free to add new function wrapper. :)

Best regards
Soeren

>
>
> thanks,
> Hamish
>