[GRASS-dev] Re: multithreading v.surf.bspline and others with OpenMP

Wed Nov 30 12:12:31 EST 2011

Hi,

2011/11/29 Hamish <hamish_b at yahoo.com>:
> Hamish:
>> > I am looking to add OpenMP support to v.surf.bspline
> ...
>
> Sören wrote:
>> Try r49406.
>> You need to separate the computation for j == 0 and j > 0.
>
> nice, thanks.
>
>> > b) 3.5x speedup is very nice, but any way to improve
>> > on that 40% efficiency loss?
>>
>> The speedup is better as larger the band matrix is. But limiting
>> factor of parallel processing speedup is the the first computation for
>> j == 0. This operation must be done before the rest can be processed
>> in parallel. The next limiting factor is the time which the OS needs
>> to create a thread, unless the OpenMP implementation uses a thread
>> pool ... .
>
> I guess that getting a thread pool typically involves waiting until the
> next OS upgrade, or longer?

OpenMP is compiler specific and therefor the thread pool
implementation. Have you tried the free Intel C compiler for open
source projects? This one is really fast. The thread creation speed is
OS dependent.

>
>
>> > c) for the many raster modules that do
>> >        for (row = 0; row < nrows; row++) {
>> >            for (col = 0; col < ncols; col++)
>> {
>> >
>> > I'm guessing it is better to multithread the columns for loop
>> > (filling a row's array in parallel), but to keep writing the raster
>> > rows serially.
>>
>> Yes, much better indeed.
>
> ... but the parallelizing the row loop would be much better if the
> whole array was in memory, or mmap()'d?

Yes. Most of the GRASS blas level 2 and 3 functions can run in
parallel using OpenMP,
because everything is in memory.

>
>
>> > But does the cost of creating and destroying 2-16 threads per row
>> > end up costing too much in terms of create/destroy overhead?
>> > IIUC "schedule (static)" helps to minimize the cost of creating
>> > each thread a little? (??)
>>
>> This is only useful in case the processing of the columns
>> needs much more time than the thread creation by the OS, unless a
>> thread pool .... .
>
> under the current method of parallelizing the inside loop though,
> say for a quad-core CPU with a 1200 cols x 800 rows array, we
> get 4 threads per row, each handling 300 columns, and for the task
> have created and destroyed 4*800= 3200 threads on a system which
> will only handle 4 at a time.
>
> much better (but perhaps harder) would be to parallelize as close to
> the main process level as we can, and then only deal with the overhead
> of creating/destroying e.g. 4 threads not 3200.

This is indeed harder. First it depends on the algorithm of a module.
Can it be parallelized at the main process level at all?
Its easier in the loops to reuse threads doing the same job again and
again ... a thread pool. The pool gets initialized at the start of the
program waiting for work. The threads will be destroyed at the end of
the program.
Some years ago we discussed a C thread pool implementation in a
parallel algorithm course at University ... unfortunately the
supervisor was to #!*&%? lazy to present us the solution. :(

>
> On the otherhand, for OpenCL (I'll work on support for that after the
> OpenMP stuff has been committed) a modern GPU may well have 500 cores.

Handling this amount of threads is pure magic in my humble opinion,
but i guess this is very efficient
implemented in hardware on the graphic card.... .

>
> in the case of v.surf.bspline I note it runs using 4-16 subregions for
> the tests runs I did. if those could each be sent to their own thread
> I think we'd be done (for a few years), without the 40% efficiency loss.

Indeed. Parallelization on subregion level is the best thing to do. In
case the subregions are large enough, a MPI solution my be meaningful
too?

>
> If so, is it then possible to call omp_set_num_threads(1); to tell gmath
> lib not to try and parallelize it any more? The fn descr says "number of
> threads used by default in subsequent parallel sections", so maybe so.

We should in principal avoid nested OpenMP regions, better to use
non-parallelized
solver and stuff in a subthread. A better idea is to use MPI to
distribute the jobs and OpenMP to parallelize the work on the MPI
nodes.

>
>
>> Multithreading, especially in case of OpenMP reduction, is only
>> meaningful in case the data is large enough, otherwise the serial
>> gathering of n and the thread creation takes much longer then the
>> computation, unless a thread pool ..... .
>
> And even moreso for OpenCL, as copying the data into and the result back
> out of video card memory is very very slow.

Well PCI Express v3.0 with 16 lanes: 16 GB/s is not that bad? Hard
disk is much slower.

>
>
>
>> > f) we talked a little about this before, but it would
>> > be good to settle on a uniform name for test suite scripts
> ...
>
> also it would be good to confirm a standard dataset to use. Generating
> fake data on the fly is self-boot strapping, but requires passing
> fixed seeds to v.random etc. Otherwise N.C.2008 probably gives a
> wider range of possibilities than the much smaller spearfish. (mainly
> that spearfish doesn't include lidar data)
> any thoughts?

A standard dataset is a good idea, but very often you need to generate
the data, to test specific
parameter settings. Keeping all this data in a test dataset will IMHO
bloat the test dataset. Besides of that
the tests should be designed to work with small data for speed reason,
except tests for specific bugs which occur in case of large data.

>
>
>> > g) v.surf.rst spends 58% of its time in gmath lib's G_ludcmp() and 20%
>> > of its time in __iee754_log().  G_ludcmp() also looks like very low
>> > hanging fruit. (great!)
>
> it also looks like a very similar clone of other code in gmath lib, and
> I'd expect BLAS/LAPACK/ATLAS too.

Unfortunately G_ludcmp() is NR code and its faster then the solver in
gmath and ccmath.
But it should be replaced by the LU solver implemented in the ccmath library.

Some years ago i implemented a v.surf.rst version
using much faster solver which have been parallelized with OpenMP, but
without much success. The matrices created by spline interpolation are
dense and mostly in
a bad condition. Hence, the LU decomposition solver and the GAUSS
solver are the most robust
and meaningful solver available.

>
> I was able to get v.surf.rst to run faster by putting some pragmas into
> G_ludcmp(), but again I wonder if it would be more efficient to concentrate
> on parallelizing the module's quadtree segments instead of the inner loops
> of the linear algebra. And again, maybe a two step approach: do the libs
> now (relatively easy), then later do the segments and have that module code
> also switch off threading for its library calls with omp_set_num_threads().

Parallelizing the module's quadtree segments instead of the inner
loops is the best idea.
I was thinking about this too, but the rst code scared me to much ...
functions with 40 parameters
as arguments and so on ... brrrrr.

>> > h) is it possible &/or desirable to use (aka outsource) pre-optimized
>> > & pre-threaded BLAS or LAPACK libraries for our linear algebra needs?
>>
>> The GRASS ATLAS wrapper is and example for such an approach. ATLAS can
>> be used, but in case it is not installed, the default GRASS
>> implementation is used.
>
> Oh, I did not know that was there. We can work on adding it to trunk's
> ./configure next.

We can do, but the IMHO the ATLAS wrapper is not in use by any module,
except the library test module.

Best regards
Soeren

>
>
> Hamish
>
>
> ps- I didn't add --with-openmp-includes= to ./configure in my patch, just
> --with-openmp-libs=.  To use the omp_*() fns I guess omp.h is wanted, and
> so I should do that after all?

As OpenMP is compiler dependent, the compiler should know where to
search for the includes, because its part of the standard library?
Well i am not 100% sure .... . IMHO to put OpenMP support in the
config, you need specific solution for every compiler on the market.
The gmath and gpde libraries make no use of omp_* functions, so the
OpenMP stuff can be specified as compiler and linker flags.