[GRASS5] statistics library?

Glynn Clements glynn.clements at virgin.net
Tue Sep 9 09:15:56 EDT 2003


Hamish wrote:

> I was looking to modify r.series to add an option for x% trimmed mean,
> and it occured to me that I'd probably want the same for r.statistics.
> 
> Looking further, s.univar, s.windavg, s.cellstats, r.series,
> r.statistics, and probably others all impliment their own mix of
> statistical queries, some with more options, some with less.
> 
> Wouldn't it be better to have a standard (simple) stats library in
> src/libes/gmath/ which worked on an unsorted array of floats?

Probably. Although there might be situations where a having both
integer and FP versions would be useful (obviously that doesn't make
sense for e.g. mean, variance etc, but it might for others, e.g. sum,
median).

> Is there something in R or somewhere else that could be reused? Would
> that just lead to dependancy vs. sync-ing headaches, and be overkill
> anyway?

Using R would definitely be overkill.

> Start off with univar.c?

I think that src/raster/r.series/cmd/c_*.c might be a better
interface, in the sense of each function computing a single measure
rather than computing everything. If you just want e.g. sum/mean, all
of those calls to pow() for the variance/skew/kurtosis computations
would be excessive.

OTOH, if you wanted both variance and standard deviation, you wouldn't
want to compute the variance twice. So, we might want some sort of
hybrid, which doesn't compute values which aren't required, and which
only computes the required values once.

For cases where the number of samples is likely to be large, it would
be better to have an interface which allows the data to be passed in
chunks, rather than having to have all of the data in memory at once. 
However, the median (and quartiles, percentiles) can't be computed
this way; you have to have all of the data in memory at once.

[Also, while you can compute the variance from just the count, sum and
sum-of-squares, it is more accurate to compute the mean first then
accumulate the deviation-squared values in a second pass. This came up
a while back in the context of r.univar computing a negative variance
(due to rounding error) when all of the values are identical,
resulting in the standard deviation compuatation failing.]

-- 
Glynn Clements <glynn.clements at virgin.net>




More information about the grass-dev mailing list