[postgis-devel] RE: standard deviation based histogram extent reduction

Mark Cave-Ayland m.cave-ayland at webbased.co.uk
Thu Jun 10 09:40:06 PDT 2004


Hi strk,

> -----Original Message-----
> From: 'strk' [mailto:strk at keybit.net] 
> Sent: 10 June 2004 16:42
> To: Mark Cave-Ayland
> Cc: postgis-devel at postgis.refractions.net
> Subject: standard deviation based histogram extent reduction

(big cut)

> I'm working on your suggested histogram extent reduction algorithm.
> 
> I've added an intermediate scan to compute standard 
> deviation, and I use that to compute histogram extent:
> 
>         geomstats->xmin = max((avgLOWx - 2 * sdLOWx), 
> sample_extent->LLB.x);
>         geomstats->ymin = max((avgLOWy - 2 * sdLOWy), 
> sample_extent->LLB.y);
>         geomstats->xmax = min((avgHIGx + 2 * sdHIGx), 
> sample_extent->URT.x);
>         geomstats->ymax = min((avgHIGy + 2 * sdHIGy), 
> sample_extent->URT.y);
> 
> On the third scan (when histogram cells are given a value), every 
> feature is checked to still fall in the modified extent, thus 
> skipping 'hard deviant' completely out.
> 
> Test case is: a bunch of equal points, an hard deviant one. 
> Samplep rows are: 3000.
> 
> When the hard deviant is not cought in the sample bucket, 
> this is what I get:
> 
>  NOTICE:   sample_extent: xmin,ymin: 10.000000,10.000000
>  NOTICE:   sample_extent: xmax,ymax: 10.000000,10.000000
>  NOTICE:   standard deviations:
>  NOTICE:    LOWx: 10.000000 +- 0.000000
>  NOTICE:    LOWy: 10.000000 +- 0.000000
>  NOTICE:    HIGx: 10.000000 +- 0.000000
>  NOTICE:    HIGy: 10.000000 +- 0.000000
>  NOTICE:   histo: xmin,ymin: 10.000000,10.000000
>  NOTICE:   histo: xmax,ymax: 10.000000,10.000000
> 
> When the hard deviant is cought, I get this:
> 
>  NOTICE:   sample_extent: xmin,ymin: 10.000000,10.000000
>  NOTICE:   sample_extent: xmax,ymax: 14.000000,11000.000000
>  NOTICE:   standard deviations:
>  NOTICE:    LOWx - avg:10.001333 sd:0.073030
>  NOTICE:    LOWy - avg:13.663333 sd:200.649030
>  NOTICE:    HIGx - avg:10.001333 sd:0.073030
>  NOTICE:    HIGy - avg:13.663333 sd:200.649030
>  NOTICE:   feat 1481 is an hard deviant, skipped

>  NOTICE:   histo: xmin,ymin: 10.000000,10.000000
>  NOTICE:   histo: xmax,ymax: 10.147392,414.961395

Great! So this is the rectangle in which we are 95% sure any data inside
here is valid... so why don't we use this as the cut-off filter for the
histogram bounding box too? It would mean a four pass algorithm but....
:)

So keep passes 1 & 2 as they are at the moment. On pass 3, we calculate
the bounding box of the histogram as the original algorithm did,
*except* we do not include any geometries which lie outside of the
xmin,ymin,xmax,ymax rectangle calculated above as these are assumed to
be noise. Finally on pass 4, we populate the histogram, again ignoring
values which lie outside of this range.

You may even find that we can increase the level to 3 * SDs from the
mean and still manage to cut out the outliers - this puts us at being
somewhere around 99% certain that any data within the bounding rectangle
is valid. Which is nice.

> Still, the extent has been enlarged too much!.
> All examined features fall in the first cell (upper-left) of 
> the histogram, but this cell is very tall, and in order to 
> completely overlap it My search box has to be bigger then 
> actually needed.
> 
> Since this histogram is a vertical bar, if we used
> a 1600*1 cells grid instead of a 40*40 the result would have 
> been more accurate. Also, if we had re-computed histogram 
> extent after the exclusion of the 'hard deviator' we would 
> have make a better job, but this could end up being a 
> recursive operation ...

Hopefully this will be solved by cutting the outliers out of the data
from the histogram bounding box too (see above).

> I haven't tested this yet on production data, but I've 
> committed it, to allow you to test with me. To disable it, 
> define USE_STANDARD_DEVIATION to 0 at top file.
> 
> The utils/ directory contains a script to test accuracy of 
> the estimator.
>
> Mark, it's a pleasure working with you :)

Seriously strk, this is all great stuff! I think that everyone, and not
just myself, appreciates the effort you've put into this. I look forward
to doing some more serious testing on this when I have more time on my
hands :)


Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.





More information about the postgis-devel mailing list