[GRASS-stats] Example code for MRPP on a multivariate data set

Sat Jan 8 06:48:50 EST 2011

Dear all,

after many trial-n-error attempts, re-reading basic R-stuff (like "?load"), 
I've finally ended up with several single scripts to run, each time, on one 
data.frame. I don't have the time currently to make this more automatic.

So, for ignorant and beginners like me, here is an example code on how to 
perform MRPP

  - on a single data.frame composed by 6 variables (in this example Landsat5 
bands)
  - with several observations (in this example 18865 obs. = sampled pixels)
  - grouped according to a factor column (in this example 7 major land cover 
classes)

Of course, the data.frame needs to be prepared (lot of fun with this :-)). I 
have 10 data.frames and a script for each data.frame. They are all available 
at any time for anybody interested (at a personal ftp-space).

If asked, I can also provide the vector/raster file of the samples (extracted 
from a postfire Landsat5 TM 2007 imagery in a small region northern of Athens 
in Greece).

The test could be useful to descrive the within- and between-groups 
heterogeneity. And why not to check how the statistics change after some 
transformations of the original data?

First, you need to have the "vegan" package installed. So, for (K)Ubuntu 
user's this will be:

--%<---
# superuser
sudo R

# get the package
install.packges ( "vegan" , dependencies = TRUE )

# quit and run a normal R-session
q()
R #or pick your favourite gui
--->%--

Then move to the test:
(ATTENTION: this 18865 observations is an overkill for home-machines)

--%<---
# load data.frame
load ( "samples_postfire_lsat5.RData" )

# check structure
str ( samples_postfire_lsat5 )

'data.frame':   18865 obs. of  7 variables:
 $ Band 1: int  199 119 175 170 110 109 102 129 138 122 ...
 $ Band 2: int  120 63 104 88 57 60 53 68 75 66 ...
 $ Band 3: int  132 77 118 109 70 78 68 82 89 76 ...
 $ Band 4: int  109 76 96 90 71 79 72 70 70 62 ...
 $ Band 5: int  151 126 140 142 118 123 127 125 120 120 ...
 $ Band 7: int  90 75 91 90 72 77 76 89 86 92 ...
 $ Class : Factor w/ 7 levels "Urban","Mineral extr. sites",..: 1 1 1 1 1 1 1 
1 1 1 ...

# mrpp() function is in:
library ( vegan )

# mrpp on 6 variables, grouped according to column named "Class"
samples_postfire_lsat5.mrpp <- mrpp (
  samples_postfire_lsat5[,1:6] ,
  grouping = samples_postfire_lsat5[["Class"]]
)

# see the result!
#samples_postfire_lsat5.mrpp

# save it
save (samples_postfire_lsat5.mrpp , file = "samples_postfire_lsat5.mrpp.RData" 
)

# euclidean (veg)distance - ATTENTION: default distance to estimate is "Bray"
samples_postfire_lsat5.vegdist <- vegdist (
  samples_postfire_lsat5[,1:6] ,
  method="euclidean"
)

# meandist
samples_postfire_lsat5.md <- meandist (
  samples_postfire_lsat5.vegdist ,
  samples_postfire_lsat5[["Class"]]
)

# check
summary ( samples_postfire_lsat5.md ) 

# save results
save ( samples_postfire_lsat5.md ,
  file = "samples_bitemporal_lsat5.md.RData"
)
--->%--

To have some results for my work, I also run the procedures on smaller subsets 
(say 3000 observations instead of 18865) which takes some time but is feasible 
@home.

Currently there is a process running on a big cluster (thanks to a very kind 
person who's always there). Hopefully we'll know soon enough how much time 
this will take.

Maybe this piece of info could go in the wiki at some point if it is of any 
value.

Kind regards, Nikos