[GRASS-dev] r.statistics limitation to CELL

Mon Jun 4 04:04:15 EDT 2007

Markus Neteler wrote:

> > > > 2. r.statistics uses r.stats to calculate the statistics, and r.stats
> > > > reads its inputs as CELL.
> > > 
> > > In this case, r.statistics could also accept an FCELL map
> > > without complaining? Currently I need extra steps to round
> > > elevation to a CELL map before running r.statistics.
> > 
> > You should be able to simply remove the check at
> > raster/r.statistics/main.c:83:
> > 
> >     if( G_raster_map_is_fp(covermap->answer, mapset) != 0 )
> >         G_fatal_error (_("This module currently only works for integer (CELL) maps"));
> > 
> > Or you might want to replace the error with a warning that the result
> > is inaccurate.
> 
> If so, the latter. Currently I have to round() the elevation map
> before running r.statistics - so results are inaccurate anyway
> but the procedure requires more extra work.

But if you round() manually, you know that you're introducing
inaccuracy, hence the suggestion of a warning if you allow r.stats to
do it automatically.

> The master question is: is it possible to conditionalize calls in
> r3.stats to make it working for 2D maps. The new implementation from
> Soeren sounds promising and along the lines of your suggestion to
> rewrite r.stats from scratch.

I don't know how r3.stats works.

However, if we were to re-write something, it would probably be better
to re-write r.statistics to avoid using r.stats than to re-write
r.stats itself. r.stats has other uses beyond being a back-end for
r.statistics.

r.statistics is a simpler case, as I would expect the number of
distinct categories in the base map to be low enough that you could
comfortably store an accumulator per category.

To simplify matters futher, I would use a linear array, indexed by
base category, rather than an associative array (btree, hash table
etc). Base categories would normally be small non-negative integers
(i.e. suitable for use as indices), and if they aren't (i.e. the
allocation is sparse or includes negative values), you can always
"renumber" the categories using a reclass, e.g.:

r.stats -n $inmap | awk '{print $1,"=",NR-1}' | r.reclass input=$inmap output=$outmap

Finally, it may be better to move the "awkward" aggregates (median,
mode) to a separate module, as the algorithms involved are completely
different.

It would probably be worth offering a choice of a single- or
multiple-pass algorithm. Count, sum, mean, minimum and maximum only
require a single pass. AFAICT, variance, standard deviation, skew and
kurtosis can all be done in a single pass, but a multiple-pass
algorithm would be more accurate.

Actually, computing the mode of a large amount of FP data (or integer
data with a large range) is even worse than computing quantiles. 
AFAICT, any general solution must have worst-case memory consumption
proportional to either the number of cells in the input map or the
number of possible values (i.e. 2^32 for CELL/FCELL, 2^64 for DCELL).

For a large map with a small range, you need one counter for each
possible value. For a map with a large range, the optimal solution is
to simply sort the input map then scan the sorted data while tracking
the most common value found to data.

For a DCELL map which is too large to fit into memory, the only
practical solution is to reduce the precision.

-- 
Glynn Clements <glynn at gclements.plus.com>