[GRASS-user] Error while using variance in r.series function

Fri Feb 25 07:03:19 EST 2011

Glynn wrote:
> r.series returns the actual (population) variance. The figure
> you're looking for is the sample variance, which provides an
> estimation of the population variance from a sample of the
> population.
> 
> You can calcualate the sample variance using e.g.:
> 
>     r.series input=...
> output=out.pvar,out.count method=variance,count
>     r.mapcalc "out.svar = out.pvar *
> out.count / (out.count - 1)"
> 
> As you're not the first person to have asked this, we might
> want to add alternate versions of the variance and stddev
> methods (and possibly skewness and kurtosis).

I thought about that a lot for r.univar and r.in.xyz, but in the
end decided it was best to just use the biased estimator (n) and
document that in the help page+code. For r.univar n is typically
so big that it doesn't matter much. For r.series where n is
usually quite small it can have a big effect.
For r.in.xyz n can be very big or small depending on the users'
application, but in the end one of the two must be chosen and
so I rationalized that all the data that was going to arrive
had, and so use the population version.

n.b. A goal of both r.univar and r.in.xyz was to give the user
enough raw materials to put the components together (with e.g.
r.mapcalc) to make more complicated statistical tests if needed.

What I struggled with, and still don't have a solid answer for,
is: in the context of cells (r.series, r.in.xyz) and map arrays
(r.univar), what does the entire population encompass?

I would guess that the typical application of the resulting
(mean+variance) raster maps would be to aggregate a time series,
or a series of diff't analytical methods based on the same
starting data, among the infinite population of time possibilities
or different methods you could come up with, so maybe treat as the
sample-estimator not the population-estimator.

But maybe the input data is all you are concerned with, and you
are not planning to use the data to extrapolate to the rest of
the "population" (e.g. by a linear regression), in which case
you might claim to have the full population data in hand.

I'd be interested to hear from a proper stats-philosopher about
what would be the favoured tack.

For r.series, if there's an easy r.mapcalc method to get the
unfavoured metric, I could live with the one used being documented
and an example in the help page on how to get the other if that's
what the user wants.

Hamish