[GRASS-user] Comparison between "i.pca" and R's "prcomp()": explanations and questions

Nikos Alexandris nikos.alexandris at felis.uni-freiburg.de
Fri Feb 6 08:12:00 EST 2009


OK, I think I got it :-)
# Apologies for cross-posting)

* First read about PCA [1] [2] [3] - extra [4]
* Understand the basic steps of the algorithm

-----------------------------------------------------------------------
  Usually:

  1. organizing a dataset in a matrix

  2. data centering (=subtracting the dimensions means from tehmself so
the dataset has mean=0)

  3. calculate
     (a) the covariance matrix (=non-standartised PCA) _or_
     [alternatively (b) the correlation matrix (=standartised PCA)]

# The principal components are the eigenvectors of the covarianve matrix

  4. calculate
     (a) the eigenvectors and eigenvalues of the covariance matrix _or_
     (b) the SVD(=singular value decomposition)

  5. sort variances in decreasing order (decreasing eigenvalue)

  6. project original dataset *signals* (PC's = eigenvector *
input-data
-----------------------------------------------------------------------

* Understand what the PCA process produces and *make clear* what the
terms mean:

  * eigenvectors = loadings = rotation
  * eigenvalues = importance = proportion of variance

=======================================================================

Then you will probably confirm that grass' "i.pca" module performs
exactly the same as R's "prcomp" *with* the option "center=FALSE". That
is, "i.pca" follows the "SVD" method and does _NOT_ subtract off the
mean(=average across each dimension) from each data dimension so that
the whole data set has "mean=0" before proceeding to calculate the
covariance matrix. The question that comes naturally is why NOT? 

Usually, all tutorials and PCA applications do perform data centering
with the argument that it reduces the square mean error of
*approximating* the input data!


Questions:
 Why is the SVD method used?
 Why is "data centering" not applied? What's the motivation behind it?
 Why are the "variances" not printed? (=Bug!?)
 Why not change it to have a nice printed eigenvector matrix?

=======================================================================

An example to compare grass' i.pca with R's prcomp:

# i.pca on 6 MODIS bands
i.pca
input=MOD2006_239_500_sur_refl_b02,MOD2006_239_500_sur_refl_b06,MOD2006_239_500_sur_refl_b07,MOD2007_242_500_sur_refl_b02,MOD2007_242_500_sur_refl_b06,MOD2007_242_500_sur_refl_b07 output=pca.500m.b267.06N_b267.07 rescale=0,0


Computing Means for band 1...
 100%
Computing Means for band 2...
 100%

[...]

Computing row 1 of covariance matrix...
 100%
Computing row 2 of covariance matrix...
 100%
[...]

Transforming <pca.500m.b267.06N_b267.07.1>...
 100%
Transforming <pca.500m.b267.06N_b267.07.2>...
 100%
[...]

Eigen values:
( -0.48 -0.45 -0.28 -0.45 -0.45 -0.29 )
( -0.48 0.24 0.37 -0.53 0.26 0.47 )
( -0.52 -0.37 -0.28 0.50 0.49 0.16 )
( 0.45 -0.32 -0.50 -0.33 0.10 0.58 )
( 0.03 -0.48 0.54 0.29 -0.45 0.42 )
( 0.26 -0.52 0.40 -0.28 0.52 -0.39 )

-----------------------------------------------------------------------
# Notice that the printed-out message "Eigen values" here is wrong. It
should be "Eigen vectors" (it is line 43 in i.pca's support.c code).
However it is correctly logged in the history of the produced PC's. I've
already filed a ticket about this in grass-trac [1] when I did not
really understand what is right and what is wrong. When I will find some
time I will submit a "diff" file.


# Important for the interpretation of i.pca's output is that the "Eigen
vectors" are reported *per-row* for each Principal Component. The order
of the columns correspond to the order of the input data so that one can
see *how much* of each of the input-bands contributes to each PC.


# That is (in the example presented here):

       b2.2006 b6.2006 b7.2007 b2.2007 b6.2007 b7.2007
PC1: (  -0.48   -0.45   -0.28   -0.45   -0.45   -0.29 )
PC2: (  -0.48    0.24    0.37   -0.53    0.26    0.47 )
PC3: (  -0.52   -0.37   -0.28    0.50    0.49    0.16 )
PC4:(    0.45   -0.32   -0.50   -0.33    0.10    0.58 )
PC5: (   0.03   -0.48    0.54    0.29   -0.45    0.42 )
PC6: (   0.26   -0.52    0.40   -0.28    0.52   -0.39 )


# The signs of the loadings are *probably* arbitrary. As one can see,
the i.pca-produced loadings differ from R's output (in fact they are
sometimes, on a per-column basis, exactly the opposite).


# With respect to the "Eigen values" which are definitely calculated but
for some reason _NOT_ printed I will file another ticket. With my poor
programing skills I have reasons to believe that it's just 2-3 lines in
i.pca's support.c code (in order to print the "eigval" variable !?).
-----------------------------------------------------------------------

# repeat in R


# load reflectances of 2006
modis_2006 <- readRAST6 (c('MOD2006_239_500_sur_refl_b02',
'MOD2006_239_500_sur_refl_b06', 'MOD2006_239_500_sur_refl_b07'))


# load reflectances of 2007
modis_2007 <- readRAST6(c('MOD2007_242_500_sur_refl_b02',
'MOD2007_242_500_sur_refl_b06', 'MOD2007_242_500_sur_refl_b07'))


# combine 2006 and 2007 bands in one data set (=data.frame)
modis_2006_2007 <- cbind.data.frame(modis_2006,modis_2007)


# check structure
str(modis_2006_2007)


# remove remaining "x", "y" columns
modis_2006_2007[4] <- NULL
modis_2006_2007[4] <- NULL
modis_2006_2007[7] <- NULL
modis_2006_2007[7] <- NULL
str(modis_2006_2007)


# workaround for NA's
modis_2006_2007.values <- which(!is.na(modis_2006_2007
$MOD2006_239_500_sur_refl_b02) & !is.na(modis_2006_2007
$MOD2006_239_500_sur_refl_b06) & !is.na(modis_2006_2007
$MOD2006_239_500_sur_refl_b07) & !is.na(modis_2006_2007
$MOD2007_242_500_sur_refl_b02) & !is.na(modis_2006_2007
$MOD2007_242_500_sur_refl_b06) & !is.na(modis_2006_2007
$MOD2007_242_500_sur_refl_b07))


#
modis0607 <- modis0607[modis_2006_2007.values, ]


# check structure
str(modis0607)


# perform pca
# *without* data centering
prcomp.modis0607 <- prcomp(modis0607, center=FALSE)


# check structure
str(prcomp.modis0607)


# get eigenvalues (=importance  or  R2  or  Proportion of Variance)
# note: pc's eigenvalues and eigenvectors are reported column-wise
summary(prcomp.modis0607)

Importance of components:
                            PC1      PC2      PC3      PC4      PC5     PC6
Standard deviation     6099.923 5.82e+02 3.77e+02 2.77e+02 1.18e+02 5.9e+01
Proportion of Variance    0.985 8.96e-03 3.75e-03 2.04e-03 3.70e-04 9.0e-05
Cumulative Proportion     0.985 9.94e-01 9.98e-01 1.00e+00 1.00e+00 1.0e+00


# get eigenvectors (=loadings)
prcomp.modis0607$rotation

                                    PC1        PC2        PC3        PC4         PC5        PC6
MOD2006_239_500_sur_refl_b02 -0.4807220  0.4825238  0.5207411 -0.4452700  0.03052128  0.2563374
MOD2006_239_500_sur_refl_b06 -0.4470113 -0.2356211  0.3660947  0.3214877 -0.48412952 -0.5224008
MOD2006_239_500_sur_refl_b07 -0.2806118 -0.3715298  0.2788246  0.4958771  0.54502126  0.4031590
MOD2007_242_500_sur_refl_b02 -0.4450009  0.5302019 -0.5006461  0.3308220  0.29127415 -0.2755536
MOD2007_242_500_sur_refl_b06 -0.4539149 -0.2634320 -0.4914164 -0.1027778 -0.45166993  0.5181755
MOD2007_242_500_sur_refl_b07 -0.2937342 -0.4723437 -0.1578723 -0.5766105  0.42283865 -0.3929577


# perform pca *with* data centering
prcomp.modis0607.centered <- prcomp(modis0607)


#
summary(prcomp.modis0607.centered)

Importance of components:
                            PC1     PC2      PC3      PC4      PC5      PC6
Standard deviation     1176.169 491.280 373.4322 264.8188 1.17e+02 58.04874
Proportion of Variance    0.747   0.130   0.0753   0.0379 7.38e-03  0.00182
Cumulative Proportion     0.747   0.878   0.9529   0.9908 9.98e-01  1.00000


#
prcomp.modis0607.centered$rotation

                                   PC1         PC2        PC3         PC4         PC5        PC6
MOD2006_239_500_sur_refl_b02 0.2804589  0.47785563  0.5596947 -0.56809295  0.00469019 -0.2387197
MOD2006_239_500_sur_refl_b06 0.4926638 -0.11460778  0.3822267  0.30434025  0.47318369  0.5305887
MOD2006_239_500_sur_refl_b07 0.4028082 -0.22631870  0.2966538  0.47206675 -0.54825334 -0.4184414
MOD2007_242_500_sur_refl_b02 0.3036884  0.72617336 -0.3987563  0.24955951 -0.28812664  0.2759387
MOD2007_242_500_sur_refl_b06 0.5181977 -0.08253658 -0.4944620 -0.08331658  0.46348579 -0.5083360
MOD2007_242_500_sur_refl_b07 0.3944810 -0.41612363 -0.2216797 -0.54090553 -0.42149429  0.3896764

================================================================================================

Kind regards, Nikos


###########
References:

[1] http://en.wikipedia.org/wiki/Principal_components_analysis

[2] http://www.snl.salk.edu/~shlens/pub/notes/pca.pdf

[3] http://trac.osgeo.org/grass/ticket/430

[4] A very nice practical use of PCA in burned area mapping (by Nikos
Koutsias et al.):
A forward/backward principal component analysis of Landsat-7 ETM+ data
to enhance the spectral signal of burnt surfaces




More information about the grass-user mailing list