[postgis-devel] standard deviation based histogram extent reduction

Thu Jun 10 08:42:26 PDT 2004

On Thu, Jun 10, 2004 at 12:36:35PM +0100, Mark Cave-Ayland wrote:

(many cuts)

> The other corner case I have seen was when one of our import scripts
> went wrong and we ended up with several very large x/y coords in our
> table along with our normal data. When we did an ANALYZE, because the
> overall extents were so large, all of the geometries appeared in a
> single box in the bottom left hand corner of histogram and hence
> provided really bad query plans - deleting these erroneus geometries and
> doing ANALYZE solved the problem and everything went back to normal.
> 
> The solution I thought of was to try and reduce the effect of outliers
> using some statistical theory: for example we know that for a random
> sample of data, 95% of the data lies within 2 standard deviations of the
> mean. So I was considering calculating the mean of all LLB.x/y and
> URT.x/y, using this to calculate the standard deviation, and hence
> calculate the geomextents from this. This should hopefully reduce the
> effect of excessively large coordinates on the extents of the
> selectivity histogram at the cost of making the histogram area slightly
> smaller than that currently indicated by the overall sample extents.
> 
> The pseudo-code example for LLB.x would look something like:
> 
> 	// Calculate mean
> 	for each LLB.x in sample
> 		sumLLB.x += LLB.x
> 	next
> 	meanLLB.x = sumLLB.x / samplerows
> 
> 	// Calculate standard deviation
> 	for each LLB.x in sample
> 		s2 = (LLB.x - meanLLB.x) * (LLB.x - meanLLB.x)
> 	next
> 	s2 /= (samplerows - 1)
> 	s = sqrt(s2)
> 
> 	// For 95% coverage we should lie within 2sds of the mean
> 	geomstats->xmin = meanLLB.x - 2 * s
> 
> 
> The downside is that we have to iterate through the sample tuples twice,
> once to calculate the mean and then once again to calculate the standard
> deviation. However, since we are only using a sample of rows from the
> table then hopefully the effect on larger tables will be negligible.
> Comments anyone?

I'm working on your suggested histogram extent reduction algorithm.

I've added an intermediate scan to compute standard deviation,
and I use that to compute histogram extent:

        geomstats->xmin = max((avgLOWx - 2 * sdLOWx), sample_extent->LLB.x);
        geomstats->ymin = max((avgLOWy - 2 * sdLOWy), sample_extent->LLB.y);
        geomstats->xmax = min((avgHIGx + 2 * sdHIGx), sample_extent->URT.x);
        geomstats->ymax = min((avgHIGy + 2 * sdHIGy), sample_extent->URT.y);

On the third scan (when histogram cells are given a value), every 
feature is checked to still fall in the modified extent, thus skipping
'hard deviant' completely out.

Test case is: a bunch of equal points, an hard deviant one.
Samplep rows are: 3000.

When the hard deviant is not cought in the sample bucket, this
is what I get:

 NOTICE:   sample_extent: xmin,ymin: 10.000000,10.000000
 NOTICE:   sample_extent: xmax,ymax: 10.000000,10.000000
 NOTICE:   standard deviations:
 NOTICE:    LOWx: 10.000000 +- 0.000000
 NOTICE:    LOWy: 10.000000 +- 0.000000
 NOTICE:    HIGx: 10.000000 +- 0.000000
 NOTICE:    HIGy: 10.000000 +- 0.000000
 NOTICE:   histo: xmin,ymin: 10.000000,10.000000
 NOTICE:   histo: xmax,ymax: 10.000000,10.000000

When the hard deviant is cought, I get this:

 NOTICE:   sample_extent: xmin,ymin: 10.000000,10.000000
 NOTICE:   sample_extent: xmax,ymax: 14.000000,11000.000000
 NOTICE:   standard deviations:
 NOTICE:    LOWx - avg:10.001333 sd:0.073030
 NOTICE:    LOWy - avg:13.663333 sd:200.649030
 NOTICE:    HIGx - avg:10.001333 sd:0.073030
 NOTICE:    HIGy - avg:13.663333 sd:200.649030
 NOTICE:   feat 1481 is an hard deviant, skipped
 NOTICE:   histo: xmin,ymin: 10.000000,10.000000
 NOTICE:   histo: xmax,ymax: 10.147392,414.961395

After histogram extent reduction the hard deviant falls completely
outside of it.

I've handled this case in normalizzation skipping these features
and counting the actually examined ones.

Still, the extent has been enlarged too much!.
All examined features fall in the first cell (upper-left) of
the histogram, but this cell is very tall, and in order to
completely overlap it My search box has to be bigger then
actually needed.

Since this histogram is a vertical bar, if we used
a 1600*1 cells grid instead of a 40*40 the result would have been
more accurate.
Also, if we had re-computed histogram extent after the exclusion of the
'hard deviator' we would have make a better job, but this could end
up being a recursive operation ...

I haven't tested this yet on production data, but I've committed it,
to allow you to test with me.
To disable it, define USE_STANDARD_DEVIATION to 0 at top file.

The utils/ directory contains a script to test accuracy of the estimator.

Mark, it's a pleasure working with you :)

--strk;