[GRASSLIST:4085] Re: normal distribution

Tue Jul 16 11:58:16 EDT 2002

At 03:01 PM 7/16/02 +0100, Thomas Dewez wrote:
>I created a difference image between two DEM of the same are and would like
>to filter out only the outliers. ...
>
>The difference image is standardized ((obs-mean)/stdev) and I intend to
>reject any score larger than +-1.96. How could I test that the difference
>image is truly normal so that the threshold is meaningful? Do you reckon
>this is a valuable way to proceed to reject values? It is nicely context
>sensitive but have I missed something?

This is difficult to answer because the most important part of the 
question, your objective, is unstated.  What follows therefore is a set of 
general remarks.

The +-1.96 threshold will reject approximately 5% of all difference, 
assuming they are Normally distributed, *regardless* of the cause of any 
differences between the two DEMs.  Such a Procrustean solution is unlikely 
to be useful or even relevant.  It will poke too many holes in your images.

(1)     Useful.  This means you should be rejecting true "outliers," 
whatever they are.  If you are assuming the difference in DEMs is a 
stationary multigaussian random function (representing "noise" or "error" 
or what have you), then an outlier would be any difference not consistent 
with that model.  Assuming the images are fairly large, say with N pairs of 
matched pixels, then you should not use +-1.96 but instead use something 
around the 1/(2N) and 100 - 1/(2N) percentage points of the standard Normal 
distribution.  Indeed, you can view this as an approximation of a 
(99%-confidence) prediction interval; see Hahn & Meeker, Statistical 
Intervals, p. 62 (Wiley, 1991).  For typical DEMs (hundreds to millions of 
points), these percentage points will be in the 4-7 range, considerably 
larger than 1.96.  Furthermore, for a more robust approach, consider 
estimating the standard deviation based on middle percentiles of the 
differences, such as the interquartile range, and using that to compute the 
scores.

(2)     Relevant.  You could be testing extremely small differences.  Maybe 
they don't matter?  You could instead identify differences that are of a 
size that matters to your application and forget about finding statistical 
outliers.  At least be sure to evaluate all the differences in the context 
of the elevation accuracy expected from each DEM.

As to the second part of your question, a Normal probability plot would be 
an excellent diagnostic test.  You don't want to apply a standard test 
(Kolmogorov-Smirnov or Shapiro-Wilks, Anderson-Darling) because it will be 
so powerful (due to the numerous data) that it's sure to reject the 
hypothesis of Normality no matter what.  If your software won't handle a 
probability plot with zillions of points, then sample the differences 
either randomly or systematically (on a grid) and plot the sample.  A 
sample size of a thousand or so should be fine.  But it would be nice to 
use all the data, because that will highlight the nature of any truly 
outlying data (which might not be picked up in a subsample of the pixels).

I would say this approach is not very "context sensitive," at least not if 
you mean spatial context, because it ignores location information 
altogether.  You are likely to discover that the differences in DEMs 
reflect artifacts of their construction (such as interpolation from contour 
lines) as much or more than they reflect true changes in ground 
elevation.  Be prepared not only for strong non-Normality, but also for 
differences that have strong spatial patterns.

--Bill Huber
Quantitative Decisions
www.quantdec.com (contains pages on environmental statistics, including 
software and aids for probability plotting and prediction limits)