[GRASS-stats] Re: [GRASS-user] Testing i.pca ~ prcomp(), m.eigensystem ~ princomp()

Thu Apr 2 03:49:56 EDT 2009

Edzer Pebesma wrote:
> Markus Metz wrote:
>   
>> I think scale and normalize are two different things.
>>     
> I believe that in statistics these two words don't have a generally
> accepted definition. They're useful as long as you explain what you mean
> by them.
>   
At least in the statistics literature I use, these two methods are 
differently defined. Scaling is like r.rescale, and normalization 
converts data to a mean of 0 and a stddev of 1, the data distribution is 
changed to a standard normal distribution. But usually I wouldn't worry 
too much about terms as long as it is explained what they mean.
> Well, PCA only captures covariance or correlation, meaning linear
> relationships, and it may be the case that the most interesting features
> are non-linear. 
So if a PCA does not capture non-linear relationships, I don't see how 
it could help to use PC's that explain nearly no variation in the 
dataset. And you could do e.g. a log transform first, or whatever else 
is appropriate to convert the suspected type of non-linear relation to a 
linear relation and then feed the transformed variables to a PCA.
> For instance, NDVI is the ratio of a sum over a
> difference (or reversed?), which cannot be expressed as a linear
> combination of bands. 
Not directly, but being a normalized difference (should be standardised 
not normalized) it can be approximated with linear combinations, i.e. 
there is at least some correlation between the raw bands and a 
normalized difference calculated from them.
> The first PCA(s?) usually express brightness, only
> later ones give more interesting features resulting from more complex
> interactions of bands (notably differences) -- loadings usually have the
> same sign for the first PC, and mixed signs for later PC's. John C.
> Davis in "statistics and data analysis for geologists" called this the
> "size and shape effect". The most interesting PC's may have a EV smaller
> than 1, when they come from correlation matrices. Geochemists don't shy
> away from interpreting 7 or more factors.
>   
The question is not the number of factors, but what criteria to use to 
select and interpret the resulting PCs. What makes a PC interesting can 
be the amount of explained variance, but also the dominant variables in 
it. BTW, some textbooks recommend to use only rotated PCs if a rotation 
could be performed. In a mathematical sense, the sign of the loadings is 
arbitrary because the absolute value as well as the result of a PCA will 
stay the same after new_var = -old_var. The same sign for the first PC 
and so on is not generally valid and with regard to imagery probably 
only applies to surface reflectance or radiation measured at the sensor, 
and I would guess is dependent on the number of bands and the wavelength 
captured by each.
All this is however far from the i.pca eigenvalue problem, going towards 
comments on the general use of PCAs for remote sensing and as such 
probably only of interest to the grass-stats ml.