[GRASS-stats] Example code for MRPP on a multivariate data set

Mon Jan 10 15:07:39 EST 2011

On Sun, Jan 9, 2011 at 9:18 PM, Roger Bivand <Roger.Bivand at nhh.no> wrote:
> On Sun, 9 Jan 2011, Markus Neteler wrote:
...
>> http://cran.r-project.org/web/views/HighPerformanceComputing.html
>>
>> An openMP approach or likewise with implicit parallelism would be great
>> since I cannot rewrite R...
>
> Markus,
>
> No implicit - one needs to use mechanisms in the snow package (using sockets
> is easiest and faster than PVM or MPI, but isn't fault tolerant) or similar
> to start R on the worker nodes, and to divide up the tiles or whatever that
> can be parallelised into a list for execution.

Roger,

thanks for your help. However, for above suggestion one needs to know data and
algorithms (somewhat at least) - in this case I was "blindly" executing some
calculations for Nikos.

> Depending on the data
> configuration, this can help or not (if all the nodes need memory for large
> objects, then they crowd out the machine). It all depends on what is being
> done. If a task is embarassingly parallelisable (like bootstrapping, or
> kriging from few points to many tiles), it can be effective, but one needs
> to think through all the implications and plan the work to suit the problem
> at hand and the available hardware. I don't think that openMP can schedule
> arbitrary job junks by itself either?

I was thinking about the low level functions in R which are regularly called,
less about individual extensions. For sure, openMP requires good code
knowledge, I tried a bit together with Yann Chemin to parallelize i.atcorr
some time ago.

> References in:
>
> http://www.nhh.no/Admin/Public/DWSDownload.aspx?File=%2fFiles%2fFiler%2finstitutter%2fsam%2fDiscussion+papers%2f2010%2f25.pdf

Thanks for this nice paper, I wasn't aware of it.

> I can send the example script from the paper if it would be useful, as a
> rough template for simplistic use of snow.
>
> But I've no idea whether this application is embarassingly parallelisable, I
> suspect not, and that it is using dense matrix methods where sparse methods
> might be possible - vegdist() returns a dist object, which is (n*(n-1))/2 in
> size. It will then copy these and use them as matrices. To reduce memory
> usage with big N, mrpp() should be rewritten to make dmat sparse over a
> distance threshold. Internally, dmat is promoted to full dense
> representation (n*n) here quite big. Why it is bloating in your case, I
> don't know. Looking at mrpp(), it could be parallelised in:
>
>    perms <- sapply(1:permutations, function(x) grouping[permuted.index(N,
>        strata = strata)])
>    m.ds <- numeric(permutations)
>    m.ds <- apply(perms, 2, function(x) mrpp.perms(x, dmat, indls,
>        w))
>
> by spreading permutation burden across nodes if (and only if) one cout avoid
> copying dmat out to each node. It would need careful analysis.

I guess that I have to leave that to the experts... Thanks for your
advice, though,
I hope it will be picked up from this list.

thanks
Markus

-- 
Markus Neteler, PhD
Fondazione Edmund Mach (FEM) - IASMA Research and Innovation Centre
Department of Biodiversity and Molecular Ecology
Head of GIS and Remote Sensing Unit
Via E. Mach, 1 - 38010 S. Michele all'Adige (TN), Italy
Web:   http://gis.cri.fmach.it  -   http://grass.osgeo.org

> Hope this helps,
>
> Roger
>
>>
>> thanks
>> Markus
>>
>>
>
> --
> Roger Bivand
> Economic Geography Section, Department of Economics, Norwegian School of
> Economics and Business Administration, Helleveien 30, N-5045 Bergen,
> Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
> e-mail: Roger.Bivand at nhh.no
>
>