[GRASS-stats] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Nikos Alexandris nikos.alexandris at felis.uni-freiburg.de
Tue Mar 3 01:27:41 EST 2009


Apologies for starting a new thread. My post is long and I thought it's
better to start an extra-testing thread. I did some tests and I present
them below in the following order:

 1. Briefly about PCA
 2. The modules/functions that implement PCA in GRASS & R
 3. My claims (Entitled Comments)
 4. Evidence (=the numbers derived from i.pca, prcomp, princomp,
m.eigensystem using some MODIS surface reflectance bands).


Finally all is clear _but_ one thing: the only "unknown" variable (to
me) is still the Eigenvalue provided by i.pca. I can't nail this one:
*** how is it calculated? *** Looks like it is some _weighted_
variance... !??

The only thing I noticed with some testing (not thoroughly though) is
that the Eigenvalue (=Variance) reported by "i.pca" is ~0.58 of the
variance reported by "prcomp()" (sdev^2) !? That is:

     sqrt(Eigenvalue(i.pca)) / sdev(prcomp) = 0.58xxx


Hamish, if you have the time (or anyone , could you please "translate"
the code in "grass6_dev/imagery/i.pca/main.c", concerning the "eigval"
variable, in some pseudocode. It is really the last thing to clarify ( I
think ).


Kindest regards, Nikos


------------------------
# 1. PCA ### ### ### ###

*. PCA is performed via:

   - method 1 (eigenvectors) : by determining the eigenvalues and
eigenvectors of a given covariance or correlation matrix.

         - covariance matrix -> non-standardised
         - correlation matrix -> standardised

   - method 2 (SVD): based on by a singular value decomposition of the
data matrix. This is a more general solution to PCA than the
"eigenvectors".

     ...The difference between the two methods is explained in [1]...
     ...SVD is used for numerical accuracy... (R Documentation)
     ...Results between the two methos can differ a bit...

     ...Data centering reduces the square mean error of 
*approximating* the input data
     ...Data scaling 


*. PCA returns:

   - eigenvalues (=variance _or_ sdev^2)
   - eigenvectors (=loadings, rotation, weighting coefficient)



-------------------------------------------
# 2. Implementations of PCA ### ### ### ###

* R's implementation of:

   - method 1 (eigen) is the _princomp()_ function

      -*applies*:
         \data centering
         \data scaling

      -returns:
         \sdev  -  standard deviation of principal components (that is:
the sqrt(Variance)
         \loadings  -  eigenvectors


   - method 2 (SVD) is the _prcomp()_ function

      -options:
         \data centering (center= TRUE | FALSE) [default is TRUE]
         \data scaling (scale=TRUE | FALSE) [default is FALSE]

      -returns:
         \sdev (as above)
         \rotation (same as loadings)



* GRASS' implementation of:

   - method 1 (eigen) is the _m.eigensystem_ module

      -returns:
         E -> eigenvalue -> Variance(E) -> sdev(E)^2
         V -> eigenvectors associated with E
         N -> normalized eigenvectors V --> This is _data centering_.
         W -> N vector multiplied by the square root of the magnitude of
      the eigenvalue (E), in other words: W = N * sdev(E)
         

   - method 2 (SVD) is the   _i.pca_ module

      -returns:
         \Eigenvalues (latest version by Hamish)
         \Eigenvectors



-----------------------------
# 3. Comments ### ### ### ###

# 1: it is (?) meaningless to compare the numerical results
(eigenvectors) of "i.pca" with the results of "m.eigensystem" since they
implement two different algebraic solutions to PCA with different
options (data centering is NOT performed by i.pca). The different
algebraic solutions don't give big differences. Data centering does.


# 2: "i.pca" corresponds to "prcomp( ,center=FALSE,scale=FALSE)" and
"m.eigensystem" corresponds to "princomp()"


# 3: The standard deviations (sdev) reported by prcomp(,center=TRUE,
scale=FALSE) _match_ the variances (=eigenvalues = sdev^2) reported by
"m.eigensystem".

# 4: Obviously, "prcomp()" with center=TRUE and scaling=FALSE by default
gives (almost) same results as "princomp()".



------------------------------------------
# 4. Comparing PCA results ### ### ### ###

# prcomp() #############################################################

### Details for prcomp {stats} from R Documentation
The calculation is done by a singular value decomposition of the
(centered and possibly scaled) data matrix, not by using eigen on the
covariance matrix. This is generally the preferred method for numerical
accuracy.
## more details in [2]
#Default settings of prcomp(): _center=TRUE_ and _scale=FALSE_

### with center=FALSE, scale=FALSE
prcomp(mod07, center=FALSE, scale=FALSE) <<== this corresponds to i.pca

Standard deviations:
[1] 4288.3788  476.8904  114.3971

Rotation:
                                    PC1        PC2        PC3
MOD2007_242_500_sur_refl_b02 -0.6353238  0.7124070 -0.2980602
MOD2007_242_500_sur_refl_b06 -0.6485551 -0.2826985  0.7067234
MOD2007_242_500_sur_refl_b07 -0.4192135 -0.6423066 -0.6416403


### with center=TRUE, scale=FALSE
prcomp(mod07, center=TRUE, scale=FALSE)
Standard deviations:
[1] 857.5749 436.0928 108.5085

Rotation:
                                   PC1         PC2        PC3
MOD2007_242_500_sur_refl_b02 0.4184468  0.83881578 -0.3482677
MOD2007_242_500_sur_refl_b06 0.7249935 -0.07751872  0.6843794
MOD2007_242_500_sur_refl_b07 0.5470710 -0.53886820 -0.6405735


### with center=TRUE, scale=TRUE
prcomp(mod07, center=TRUE, scale=TRUE)
Standard deviations:
[1] 1.5030740 0.8397807 0.1885121

Rotation:
                                   PC1        PC2        PC3
MOD2007_242_500_sur_refl_b02 0.4807060  0.8202848 -0.3099267
MOD2007_242_500_sur_refl_b06 0.6561893 -0.1020540  0.7476634
MOD2007_242_500_sur_refl_b07 0.5816677 -0.5627768 -0.5873201


### with center=FALSE, scale=TRUE
prcomp(mod07, center=FALSE, scale=TRUE)
Standard deviations:
[1] 1.71889117 0.20787871 0.04689961

Rotation:
                                    PC1         PC2        PC3
MOD2007_242_500_sur_refl_b02 -0.5747640  0.74015136 -0.3490305
MOD2007_242_500_sur_refl_b06 -0.5812896 -0.06907305  0.8107597
MOD2007_242_500_sur_refl_b07 -0.5759763 -0.66888331 -0.4699430


# i.pca ############################################################## 

# perform pca
i.pca input=MOD07_b02,MOD07_b06,MOD07_b07 output=TEST

Eigen values, (vectors), and [percent importance]:
PC1 6307563.04 ( -0.64 -0.65 -0.42 ) [ 98.71% ]
PC2 78023.63 ( -0.71 0.28 0.64 ) [ 1.22% ]
PC3 4504.60 ( -0.30 0.71 -0.64 ) [ 0.07% ]

### Comment: Comparing with prcomp()'s results it is obvious that
"i.pca" implements the SVD method _without_ centering and _without_
scaling the data prior to the analysis. ###




# princomp() - cov #####################################################

# Details for princomp {stats} from R Documentation
The calculation is done using eigen on the correlation or covariance
matrix


### using _covariance_ matrix
# prints only standard deviations
princomp(mod07)

Call:
princomp(x = mod07)

Standard deviations:
  Comp.1   Comp.2   Comp.3 
857.5737 436.0922 108.5083 

 3  variables and  350596 observations.


# get loadings (=rotation = eigenvectors)
(princomp(mod07))$loadings

Loadings:
                             Comp.1 Comp.2 Comp.3
MOD2007_242_500_sur_refl_b02 -0.418  0.839  0.348
MOD2007_242_500_sur_refl_b06 -0.725        -0.684
MOD2007_242_500_sur_refl_b07 -0.547 -0.539  0.641

               Comp.1 Comp.2 Comp.3
SS loadings     1.000  1.000  1.000
Proportion Var  0.333  0.333  0.333
Cumulative Var  0.333  0.667  1.000



# m.eigensystem - cov #################################################

# Determine eigen values and vectors of _covariance_ matrix
## Marked with "#" are the _N_'s that correspond to the eigenvectors
derived from R's princomp() ##
(echo 3; r.covar MOD07_b02,MOD07_b06,MOD07_b07)|m.eigensystem

r.covar: complete ...
100%
E    778244.0258462029          .0000000000    79.20
V          .5006581842          .0000000000
V          .8256483300          .0000000000
V          .6155834548          .0000000000
N      #   .4372107421   #      .0000000000
N      #   .7210155161   #      .0000000000
N      #   .5375717557   #      .0000000000
W       385.6991853500          .0000000000
W       636.0664787886          .0000000000
W       474.2358050886          .0000000000
E    192494.5769628266          .0000000000    19.59
V         -.8689798010          .0000000000
V          .0996340298          .0000000000
V          .5731134848          .0000000000
N      #  -.8309940700   #      .0000000000
N      #   .0952787255   #      .0000000000
N      #   .5480609638   #      .0000000000
W      -364.5920328433          .0000000000
W        41.8027823088          .0000000000
W       240.4573848757          .0000000000
E     11876.4548199713          .0000000000     1.21
V          .2872248982          .0000000000
V         -.5731591248          .0000000000
V          .5351449518          .0000000000
N      #   .3439413070   #      .0000000000
N      #  -.6863370819   #      .0000000000
N      #   .6408165005   #      .0000000000
W        37.4824307850          .0000000000
W       -74.7964308085          .0000000000
W        69.8356366100          .0000000000



# princomp() - cov #####################################################

### using _correlation_ matrix
princomp(mod07, cor=TRUE)

Call:
princomp(x = mod07, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3 
1.5030740 0.8397807 0.1885121 

 3  variables and  350596 observations.


# get loadings
(princomp(mod07, cor=TRUE))$loadings

Loadings:
                             Comp.1 Comp.2 Comp.3
MOD2007_242_500_sur_refl_b02 -0.481  0.820  0.310
MOD2007_242_500_sur_refl_b06 -0.656 -0.102 -0.748
MOD2007_242_500_sur_refl_b07 -0.582 -0.563  0.587

               Comp.1 Comp.2 Comp.3
SS loadings     1.000  1.000  1.000
Proportion Var  0.333  0.333  0.333
Cumulative Var  0.333  0.667  1.000




# m.eigensystem - corr #################################################

# Determine eigen values and vectors of _correlation_ matrix
## Marked with "#" are the _N_'s that correspond to the eigenvectors
derived from R's princomp() ##
(echo 3; r.covar -r MOD07_b02,MOD07_b06,MOD07_b07)|m.eigensystem

r.covar: complete ...
 100%
E         2.2915877718          .0000000000    76.39
V         -.5755655569          .0000000000
V         -.7660355041          .0000000000
V         -.6809380186          .0000000000
 N     #   -.4896413269   #      .0000000000
 N     #   -.6516766616   #      .0000000000
 N     #   -.5792830912   #      .0000000000
W         -.7412186091          .0000000000
W         -.9865075560          .0000000000
W         -.8769182329          .0000000000
E          .6740687010          .0000000000    22.47
V          .8667178982          .0000000000
V         -.1116525720          .0000000000
V         -.6069908335          .0000000000
 N      #   .8145815825   #      .0000000000
 N      #  -.1049362531   #      .0000000000
 N      #  -.5704780699   #      .0000000000
W          .6687852213          .0000000000
W         -.0861544341          .0000000000
W         -.4683721194          .0000000000
E          .0343435272          .0000000000     1.14
V          .2486404469          .0000000000
V         -.6006166822          .0000000000
V          .4655120098          .0000000000
 N      #   .3109794470   #      .0000000000
 N      #  -.7512029762   #      .0000000000
 N      #   .5822249325   #      .0000000000
W          .0576307320          .0000000000
W         -.1392129859          .0000000000
W          .1078979635          .0000000000


[1] http://www.snl.salk.edu/~shlens/pub/notes/pca.pdf

[2] Copy-pasted from R Documentation

### prcomp() returns the following:

   1. sdev
          the standard deviations of the principal components (i.e.,
          the square roots of the eigenvalues of the cov./cor. matrix,
          though the calculation is actually done with the singular
          values of the data matrix).

   2. rotation
          the matrix of variable loadings (i.e., a matrix whose
          columns contain the eigenvectors). The function princomp
          returns this in the element loadings.

   * Note

          The signs of the columns of the rotation matrix are
          arbitrary, and so may differ between different programs
          for PCA, and even between different builds of R.



More information about the grass-stats mailing list