[GRASS-stats] Example code for MRPP on a multivariate data set

Sun Jan 9 15:18:31 EST 2011

On Sun, 9 Jan 2011, Markus Neteler wrote:

> On Sat, Jan 8, 2011 at 12:48 PM, Nikos Alexandris
> <nik at nikosalexandris.net> wrote:
> ...
>> To have some results for my work, I also run the procedures on smaller subsets
>> (say 3000 observations instead of 18865) which takes some time but is feasible
>> @home.
>>
>> Currently there is a process running on a big cluster (thanks to a very kind
>> person who's always there). Hopefully we'll know soon enough how much time
>> this will take.
>
> The job is still running on "my" blade :) Using 68GB of RAM.
>
> Does anyone in the list have experience in running R on a multicore
> system? This list is rather overwhelming for me:
>
> http://cran.r-project.org/web/views/HighPerformanceComputing.html
>
> An openMP approach or likewise with implicit parallelism would be great
> since I cannot rewrite R...

Markus,

No implicit - one needs to use mechanisms in the snow package (using 
sockets is easiest and faster than PVM or MPI, but isn't fault tolerant) 
or similar to start R on the worker nodes, and to divide up the tiles or 
whatever that can be parallelised into a list for execution. Depending on 
the data configuration, this can help or not (if all the nodes need memory 
for large objects, then they crowd out the machine). It all depends on 
what is being done. If a task is embarassingly parallelisable (like 
bootstrapping, or kriging from few points to many tiles), it can be 
effective, but one needs to think through all the implications and plan 
the work to suit the problem at hand and the available hardware. I don't 
think that openMP can schedule arbitrary job junks by itself either?

References in:

http://www.nhh.no/Admin/Public/DWSDownload.aspx?File=%2fFiles%2fFiler%2finstitutter%2fsam%2fDiscussion+papers%2f2010%2f25.pdf

I can send the example script from the paper if it would be useful, as a 
rough template for simplistic use of snow.

But I've no idea whether this application is embarassingly parallelisable, 
I suspect not, and that it is using dense matrix methods where sparse 
methods might be possible - vegdist() returns a dist object, which is 
(n*(n-1))/2 in size. It will then copy these and use them as matrices. To 
reduce memory usage with big N, mrpp() should be rewritten to make dmat 
sparse over a distance threshold. Internally, dmat is promoted to full 
dense representation (n*n) here quite big. Why it is bloating in your 
case, I don't know. Looking at mrpp(), it could be parallelised in:

     perms <- sapply(1:permutations, function(x) grouping[permuted.index(N,
         strata = strata)])
     m.ds <- numeric(permutations)
     m.ds <- apply(perms, 2, function(x) mrpp.perms(x, dmat, indls,
         w))

by spreading permutation burden across nodes if (and only if) one cout 
avoid copying dmat out to each node. It would need careful analysis.

Hope this helps,

Roger

>
> thanks
> Markus
>
>

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no