[GRASS-dev] Re: multithreading v.surf.bspline and others with OpenMP

Tue Nov 29 02:20:03 EST 2011

Hamish:
> > I am looking to add OpenMP support to v.surf.bspline
...

Sören wrote:
> Try r49406.
> You need to separate the computation for j == 0 and j > 0.

nice, thanks.

> > b) 3.5x speedup is very nice, but any way to improve
> > on that 40% efficiency loss?
> 
> The speedup is better as larger the band matrix is. But limiting
> factor of parallel processing speedup is the the first computation for
> j == 0. This operation must be done before the rest can be processed
> in parallel. The next limiting factor is the time which the OS needs
> to create a thread, unless the OpenMP implementation uses a thread
> pool ... .

I guess that getting a thread pool typically involves waiting until the
next OS upgrade, or longer?

> > c) for the many raster modules that do
> >        for (row = 0; row < nrows; row++) {
> >            for (col = 0; col < ncols; col++)
> {
> >
> > I'm guessing it is better to multithread the columns for loop
> > (filling a row's array in parallel), but to keep writing the raster
> > rows serially.
> 
> Yes, much better indeed.

... but the parallelizing the row loop would be much better if the
whole array was in memory, or mmap()'d?

> > But does the cost of creating and destroying 2-16 threads per row
> > end up costing too much in terms of create/destroy overhead?
> > IIUC "schedule (static)" helps to minimize the cost of creating
> > each thread a little? (??)
> 
> This is only useful in case the processing of the columns
> needs much more time than the thread creation by the OS, unless a
> thread pool .... .

under the current method of parallelizing the inside loop though,
say for a quad-core CPU with a 1200 cols x 800 rows array, we
get 4 threads per row, each handling 300 columns, and for the task
have created and destroyed 4*800= 3200 threads on a system which
will only handle 4 at a time.

much better (but perhaps harder) would be to parallelize as close to
the main process level as we can, and then only deal with the overhead
of creating/destroying e.g. 4 threads not 3200.

On the otherhand, for OpenCL (I'll work on support for that after the
OpenMP stuff has been committed) a modern GPU may well have 500 cores.

in the case of v.surf.bspline I note it runs using 4-16 subregions for
the tests runs I did. if those could each be sent to their own thread
I think we'd be done (for a few years), without the 40% efficiency loss.

If so, is it then possible to call omp_set_num_threads(1); to tell gmath
lib not to try and parallelize it any more? The fn descr says "number of
threads used by default in subsequent parallel sections", so maybe so.

> Multithreading, especially in case of OpenMP reduction, is only
> meaningful in case the data is large enough, otherwise the serial
> gathering of n and the thread creation takes much longer then the
> computation, unless a thread pool ..... .

And even moreso for OpenCL, as copying the data into and the result back
out of video card memory is very very slow.

> > f) we talked a little about this before, but it would
> > be good to settle on a uniform name for test suite scripts
...

also it would be good to confirm a standard dataset to use. Generating
fake data on the fly is self-boot strapping, but requires passing
fixed seeds to v.random etc. Otherwise N.C.2008 probably gives a
wider range of possibilities than the much smaller spearfish. (mainly
that spearfish doesn't include lidar data)
any thoughts?

> > g) v.surf.rst spends 58% of its time in gmath lib's G_ludcmp() and 20%
> > of its time in __iee754_log().  G_ludcmp() also looks like very low
> > hanging fruit. (great!)

it also looks like a very similar clone of other code in gmath lib, and
I'd expect BLAS/LAPACK/ATLAS too.

I was able to get v.surf.rst to run faster by putting some pragmas into
G_ludcmp(), but again I wonder if it would be more efficient to concentrate
on parallelizing the module's quadtree segments instead of the inner loops
of the linear algebra. And again, maybe a two step approach: do the libs
now (relatively easy), then later do the segments and have that module code
also switch off threading for its library calls with omp_set_num_threads().

> > h) is it possible &/or desirable to use (aka outsource) pre-optimized
> > & pre-threaded BLAS or LAPACK libraries for our linear algebra needs?
> 
> The GRASS ATLAS wrapper is and example for such an approach. ATLAS can
> be used, but in case it is not installed, the default GRASS
> implementation is used.

Oh, I did not know that was there. We can work on adding it to trunk's
./configure next.

Hamish

ps- I didn't add --with-openmp-includes= to ./configure in my patch, just
--with-openmp-libs=.  To use the omp_*() fns I guess omp.h is wanted, and
so I should do that after all?