[GRASS5] [bug #2380] (grass) unexpected '(' in r.univar

Sun Apr 18 05:07:41 EDT 2004

> > a) Population vs. sample variance (& standard deviation)
> > 
> > r.series and r.univar use sum((xi-mean(x))^2)/n
> >    (i.e. population variance aka "sigma^2")
> > 
> > while 
> > 
> > s.univar and s.cellstats use sum((xi-mean(x))^2)/(n-1)
> >    (i.e. sample or bias-corrected variance aka "s^2")
> > 
> > 
> > For consistency we should pick one way & document it.
> 
> Or consistently offer both options.

I think that's overkill (see below). As long as we specify what we are
presenting, if the user knows n, they can always figure out the other
one in the rare case they need it.

> > The difference
> > between n and n-1 for big maps with huge numbers of cells isn't very
> > much, so this isn't too critical, but someone might need to do
> > analysis on very small/sparse maps one day.... I've used n-1, for no
> > great reason besides the current region is 'sample' of a larger
> > location. Can any stats people comment?
> 
> Note that, for r.series, n is the number of maps, not the number of
> cells, so n vs n-1 would be more significant there.

pulling out the stats textbook...
"Regardless of sample size, however, it is good practice to divide a
sum by n-1 when computing a variance or standard deviation. It should be
assumed that the symbol s^2 refers to a variance obtained by the
division of the sum of squares by the degrees of freedom, as the
quantity n-1 is generally called. The only time when division of the sum
of squares by n is appropriate is when the interest of the investigator
is limited to the sample at hand and to its variance and standard
deviation as descriptive statistics of the sample, in contrast to using
these as estimates of the population parameters. In the rare cases in
which the investigator possesses data on the entire population division
by n is justified because then the investigator is not estimating a
parameter, but is evaluating it."
-- "Biometry" by Sokal & Rohlf, 3rd Ed. p.53
 (proudly from SUNY Stony Brook!)

So you could argue that we could use n for r.series, but not for
r.univar. Still I think we should stick with n-1 for everything, as you
want to be dealing with degrees of freedom when your sample size is
small, AFAIK.

> We would also need to consider implementing median (quartiles,
> percentiles) efficiently for large amounts of data. Sorting the entire
> set is usually overkill, but binning and sorting requires two passes.
> 
> Similar issues apply to computing the mode.

This remains on the TODO list of the newly uploaded r.univar(2) now in
CVS. Feel free to give it a shot, I'm probably not going to have time to
add these extended stats anytime soon.

best,
Hamish