[postgis-devel] RE: standard deviation based histogram extent reduction

Thu Jun 10 12:01:09 PDT 2004

I've added irregular sized histogram grid (cell aspect is always
nearly square keeping total cells near to requested precision).

I'd like to have some test results before working on other
refinements. Just to make sure we are not introducing any bug.

Thanks for you attention.

--strk;

On Thu, Jun 10, 2004 at 05:40:06PM +0100, Mark Cave-Ayland wrote:
> Hi strk,
> 
> > -----Original Message-----
> > From: 'strk' [mailto:strk at keybit.net] 
> > Sent: 10 June 2004 16:42
> > To: Mark Cave-Ayland
> > Cc: postgis-devel at postgis.refractions.net
> > Subject: standard deviation based histogram extent reduction
> 
> (big cut)
> 
> > I'm working on your suggested histogram extent reduction algorithm.
> > 
> > I've added an intermediate scan to compute standard 
> > deviation, and I use that to compute histogram extent:
> > 
> >         geomstats->xmin = max((avgLOWx - 2 * sdLOWx), 
> > sample_extent->LLB.x);
> >         geomstats->ymin = max((avgLOWy - 2 * sdLOWy), 
> > sample_extent->LLB.y);
> >         geomstats->xmax = min((avgHIGx + 2 * sdHIGx), 
> > sample_extent->URT.x);
> >         geomstats->ymax = min((avgHIGy + 2 * sdHIGy), 
> > sample_extent->URT.y);
> > 
> > On the third scan (when histogram cells are given a value), every 
> > feature is checked to still fall in the modified extent, thus 
> > skipping 'hard deviant' completely out.
> > 
> > Test case is: a bunch of equal points, an hard deviant one. 
> > Samplep rows are: 3000.
> > 
> > When the hard deviant is not cought in the sample bucket, 
> > this is what I get:
> > 
> >  NOTICE:   sample_extent: xmin,ymin: 10.000000,10.000000
> >  NOTICE:   sample_extent: xmax,ymax: 10.000000,10.000000
> >  NOTICE:   standard deviations:
> >  NOTICE:    LOWx: 10.000000 +- 0.000000
> >  NOTICE:    LOWy: 10.000000 +- 0.000000
> >  NOTICE:    HIGx: 10.000000 +- 0.000000
> >  NOTICE:    HIGy: 10.000000 +- 0.000000
> >  NOTICE:   histo: xmin,ymin: 10.000000,10.000000
> >  NOTICE:   histo: xmax,ymax: 10.000000,10.000000
> > 
> > When the hard deviant is cought, I get this:
> > 
> >  NOTICE:   sample_extent: xmin,ymin: 10.000000,10.000000
> >  NOTICE:   sample_extent: xmax,ymax: 14.000000,11000.000000
> >  NOTICE:   standard deviations:
> >  NOTICE:    LOWx - avg:10.001333 sd:0.073030
> >  NOTICE:    LOWy - avg:13.663333 sd:200.649030
> >  NOTICE:    HIGx - avg:10.001333 sd:0.073030
> >  NOTICE:    HIGy - avg:13.663333 sd:200.649030
> >  NOTICE:   feat 1481 is an hard deviant, skipped
> 
> >  NOTICE:   histo: xmin,ymin: 10.000000,10.000000
> >  NOTICE:   histo: xmax,ymax: 10.147392,414.961395
> 
> Great! So this is the rectangle in which we are 95% sure any data inside
> here is valid... so why don't we use this as the cut-off filter for the
> histogram bounding box too? It would mean a four pass algorithm but....
> :)
> 
> So keep passes 1 & 2 as they are at the moment. On pass 3, we calculate
> the bounding box of the histogram as the original algorithm did,
> *except* we do not include any geometries which lie outside of the
> xmin,ymin,xmax,ymax rectangle calculated above as these are assumed to
> be noise. Finally on pass 4, we populate the histogram, again ignoring
> values which lie outside of this range.
> 
> You may even find that we can increase the level to 3 * SDs from the
> mean and still manage to cut out the outliers - this puts us at being
> somewhere around 99% certain that any data within the bounding rectangle
> is valid. Which is nice.
> 
> > Still, the extent has been enlarged too much!.
> > All examined features fall in the first cell (upper-left) of 
> > the histogram, but this cell is very tall, and in order to 
> > completely overlap it My search box has to be bigger then 
> > actually needed.
> > 
> > Since this histogram is a vertical bar, if we used
> > a 1600*1 cells grid instead of a 40*40 the result would have 
> > been more accurate. Also, if we had re-computed histogram 
> > extent after the exclusion of the 'hard deviator' we would 
> > have make a better job, but this could end up being a 
> > recursive operation ...
> 
> Hopefully this will be solved by cutting the outliers out of the data
> from the histogram bounding box too (see above).
> 
> > I haven't tested this yet on production data, but I've 
> > committed it, to allow you to test with me. To disable it, 
> > define USE_STANDARD_DEVIATION to 0 at top file.
> > 
> > The utils/ directory contains a script to test accuracy of 
> > the estimator.
> >
> > Mark, it's a pleasure working with you :)
> 
> Seriously strk, this is all great stuff! I think that everyone, and not
> just myself, appreciates the effort you've put into this. I look forward
> to doing some more serious testing on this when I have more time on my
> hands :)
> 
> 
> Cheers,
> 
> Mark.
> 
> ---
> 
> Mark Cave-Ayland
> Webbased Ltd.
> Tamar Science Park
> Derriford
> Plymouth
> PL6 8BX
> England
> 
> Tel: +44 (0)1752 764445
> Fax: +44 (0)1752 764446
> 
> 
> This email and any attachments are confidential to the intended
> recipient and may also be privileged. If you are not the intended
> recipient please delete it from your system and notify the sender. You
> should not copy it or use it for any purpose nor disclose or distribute
> its contents to any other person.
> 
> 
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-devel