[GRASS5] [bug #2380] (grass) unexpected '(' in r.univar

Glynn Clements glynn.clements at virgin.net
Wed Apr 14 23:58:34 EDT 2004


Hamish wrote:

> a) Population vs. sample variance (& standard deviation)
> 
> r.series and r.univar use sum((xi-mean(x))^2)/n
>    (i.e. population variance aka "sigma^2")
> 
> while 
> 
> s.univar and s.cellstats use sum((xi-mean(x))^2)/(n-1)
>    (i.e. sample or bias-corrected variance aka "s^2")
> 
> 
> For consistency we should pick one way & document it.

Or consistently offer both options.

> The difference
> between n and n-1 for big maps with huge numbers of cells isn't very
> much, so this isn't too critical, but someone might need to do analysis
> on very small/sparse maps one day.... I've used n-1, for no great reason
> besides the current region is 'sample' of a larger location.
> Can any stats people comment?

Note that, for r.series, n is the number of maps, not the number of
cells, so n vs n-1 would be more significant there.

> b) gmath library: I looked at using the c_var.c & co. functions from
> r.series, but these require passing all input values (ie the whole map
> in memory) at once, which while good for a general library function or
> for n<1000 cells-of-the-same-coordinate like r.series or r.mapcalc might
> use, it doesn't cut it for a 10000x10000 DCELL map.

Yep. General purpose versions of the common aggregate functions would
need to provide a begin/update/end API, so that the data can be
supplied in chunks.

We would also need to consider implementing median (quartiles,
percentiles) efficiently for large amounts of data. Sorting the entire
set is usually overkill, but binning and sorting requires two passes.

Similar issues apply to computing the mode.

-- 
Glynn Clements <glynn.clements at virgin.net>




More information about the grass-dev mailing list