[GRASS-dev] PCA question

Tue Jun 26 03:02:40 PDT 2012

Michael Barton:

> > Thanks Marcus,
> > Please see my post below, starting with "I think I've figured it out..."

Markus Metz: 

> Yes, I saw, but the formula below "I think I've figured it out..."
> looked strange to me.

> > Since I'm working in GRASS instead of R,

> [ should be "and", not "instead of" ;) ]

> Here is an example based on on the landsat imagery in mapset landsat,
> North Carolina sample dataset:

> i.pca without rescaling of the output:
> i.pca input=lsat7_2000_10,lsat7_2000_20,lsat7_2000_30
> output_prefix=lsat7_2000_pc rescale=0,0

> The eigenvectors are
> 0.4494, 0.5128, 0.7315
> 0.6726, 0.3447,-0.6548
> 0.5879,-0.7863, 0.1901

> this is a square 3x3 matrix. R gives the inverse matrix as

> 0.4493372, 0.6726549, 0.5879235
> 0.5128129, 0.3446144,-0.7862660
> 0.7315070,-0.6548316, 0.1899992

> Test with band 10:
> the general formula is
> b1' = (ev1-1' * pc1 + ev1-2' * pc2 + ev1-3' * pc3) + mean(b1)

> The mean of lsat7_2000_b10 is 79.925.
> r.mapcalc "lsat7_2000_10.pc = (lsat7_2000_pc.1 * 0.4493372 +
> lsat7_2000_pc.2 * 0.6726549 + lsat7_2000_pc.3 * 0.5879235) + 79.925"

> Difference to original:
> r.mapcalc "diff = lsat7_2000_10 - lsat7_2000_10.pc"
> 
> r.univar diff -g
> n=250325
> null_cells=0
> cells=250325
> min=-0.00140021304849824
> max=0.00723340429107111
> range=0.00863361733956935
> mean=-1.8473599814612e-05
> 
> That's close enough, given that the eigenvectors are printed by i.pca
> with limited precision (only 4 decimal places).

The math demonstrated above is pretty clear and Correct!  The above 
should/will go to the wiki, i.e. in a separate "Inverse PCA" page.  Hopefully 
I'll get to that as soon as possible (=next month!).

A general note though:

in a new produced PC-coordinate system, one has to be very careful about its 
interpretation.  There is no prior knowledge on what (new values) is what 
(i.e. relation to land cover) exactly, as in the way we do understand that 
reflectance, for example, can be associated -- and explained -- with specific 
land cover types or other surface features.

We have to keep in mind that the re-distribution of the information contained 
in the input bands, does strictly follow math rules (i.e. achieve a minimal 
RMSE, extract highest variances into first "principal" components).  It does 
not care about class labels.

Any manipulation of the data, prior to and while aiming at an inverse PCA, 
might mix-up things in a way that the final data are very difficult to interpret 
and, especially, to automatically post-process (i.e. use well known Remote 
Sensing algorithms).

In the case of a pan-sharpening attempt, I would thoroughly test.

> Without using R, the inverse of a 3x3 matrix can be manually
> calculated as follows:
> 
> original matrix:
> a b c
> d e f
> g h k
> 
> inverse matrix:
> (ek - fh)   (ch - bk)   (bf - ce)
> (fg - dk)   (ak - cg)   (cd - af)
> (dh - eg)  (gb - ah)   (ae - bd)
> 
> each entry in this inverse matrix needs to be multiplied with
> 
> 1 / (a(ek - fh) - b(kd - fg) + c(dh - eg))
> 
> in order to get the transformation matrix to be applied to the pc scores.
> 
> The formula is more complex for matrices with more than 3 dimensions,
> e.g. for i.pca with 6 input bands.
> 
> > So my questions are:
> > 
> > 1) Do I have the inversion correct in terms of how it needs to be
> > calculated in GRASS?
> I don't think so. Using your formula and testing for the difference gives me
> min=-546.255208609669
> max=174.771853180844
> range=721.027061790513
> mean=79.9249815357317
> 
> a bit too much IMHO.

It is a "bit" more than a "bit" too much...

> > 2) Even though the mean of all bands seems to be subtracted from each
> > factor score, does inverting the matrix mean that the mean of *each* band
> > is added back to its transformed value? Adding either the mean of all
> > original bands or 1 original band seems to produce values that are much
> > higher than the original, and so need to be rescaled. Maybe this is OK.

> The mean of each band is added back to each recalculated band. See
> above r.mapcalc formula for band 10.

> > 3) I do not normalize or rescale in i.pca. This seems to make it easier to
> > do the inverse PCA with fewer calculations. Is there any reason I
> > *should* rescale and/or normalize?

> Normalization applies to the input of i.pca and is by default not
> done. It needs to be done for heterogenous input data such as e.g.
> rainfall, temperature, NDVI, etc. Rescaling is automatically applied
> to the output of i.pca unless explicitly disabled with rescale=0,0.

Adding my 1 old drachma:

PCA works on/with global stats.  One would want to normalise exactly when it 
comes to "heterogenous" input data, speak largely different ranges of input 
data. For example there is an input "band" which ranges from 0 to 100 and 
another one which ranges from 0 to 100000.  Normalisation would help to make 
different inputs comparable.  That is mainly the reason to apply "scaling", as 
far as I understand PCA.

Note, though, that a normalisation will swipe out "subtle" differences between 
input "bands" of "similar" range which might or might _not_ be 
useful/destructive -- it always depends on what you want to do with your data.  
There is no guarantee that you don't want to normalise even for inputs of the 
same nature/unit.

I guess that in the case of a pan-sharpening attempt, one would not want to 
normalise the input data!

[rest deleted]

Nikos