[GRASSLIST:4085] Re: normal distribution
Quantitative Decisions
whuber at quantdec.com
Tue Jul 16 11:58:16 EDT 2002
At 03:01 PM 7/16/02 +0100, Thomas Dewez wrote:
>I created a difference image between two DEM of the same are and would like
>to filter out only the outliers. ...
>
>The difference image is standardized ((obs-mean)/stdev) and I intend to
>reject any score larger than +-1.96. How could I test that the difference
>image is truly normal so that the threshold is meaningful? Do you reckon
>this is a valuable way to proceed to reject values? It is nicely context
>sensitive but have I missed something?
This is difficult to answer because the most important part of the
question, your objective, is unstated. What follows therefore is a set of
general remarks.
The +-1.96 threshold will reject approximately 5% of all difference,
assuming they are Normally distributed, *regardless* of the cause of any
differences between the two DEMs. Such a Procrustean solution is unlikely
to be useful or even relevant. It will poke too many holes in your images.
(1) Useful. This means you should be rejecting true "outliers,"
whatever they are. If you are assuming the difference in DEMs is a
stationary multigaussian random function (representing "noise" or "error"
or what have you), then an outlier would be any difference not consistent
with that model. Assuming the images are fairly large, say with N pairs of
matched pixels, then you should not use +-1.96 but instead use something
around the 1/(2N) and 100 - 1/(2N) percentage points of the standard Normal
distribution. Indeed, you can view this as an approximation of a
(99%-confidence) prediction interval; see Hahn & Meeker, Statistical
Intervals, p. 62 (Wiley, 1991). For typical DEMs (hundreds to millions of
points), these percentage points will be in the 4-7 range, considerably
larger than 1.96. Furthermore, for a more robust approach, consider
estimating the standard deviation based on middle percentiles of the
differences, such as the interquartile range, and using that to compute the
scores.
(2) Relevant. You could be testing extremely small differences. Maybe
they don't matter? You could instead identify differences that are of a
size that matters to your application and forget about finding statistical
outliers. At least be sure to evaluate all the differences in the context
of the elevation accuracy expected from each DEM.
As to the second part of your question, a Normal probability plot would be
an excellent diagnostic test. You don't want to apply a standard test
(Kolmogorov-Smirnov or Shapiro-Wilks, Anderson-Darling) because it will be
so powerful (due to the numerous data) that it's sure to reject the
hypothesis of Normality no matter what. If your software won't handle a
probability plot with zillions of points, then sample the differences
either randomly or systematically (on a grid) and plot the sample. A
sample size of a thousand or so should be fine. But it would be nice to
use all the data, because that will highlight the nature of any truly
outlying data (which might not be picked up in a subsample of the pixels).
I would say this approach is not very "context sensitive," at least not if
you mean spatial context, because it ignores location information
altogether. You are likely to discover that the differences in DEMs
reflect artifacts of their construction (such as interpolation from contour
lines) as much or more than they reflect true changes in ground
elevation. Be prepared not only for strong non-Normality, but also for
differences that have strong spatial patterns.
--Bill Huber
Quantitative Decisions
www.quantdec.com (contains pages on environmental statistics, including
software and aids for probability plotting and prediction limits)
More information about the grass-user
mailing list