[postgis-devel] RE: standard deviation based histogram extent reduction

Mark Cave-Ayland m.cave-ayland at webbased.co.uk
Fri Jun 18 03:31:18 PDT 2004


Hi strk,

I'm currently working on other projects so I haven't had a chance to
play with the standard deviation stats code as much as I would like....
however it works really well at filtering erroneus values out of the
histogram :)

I loaded in a couple of Tiger shapefiles of about 10,000 geometries,
analyzed, tried some queries, added some extra features at a distance
from the main data set and repeated. The extra features had no effect on
this histogram and everything continued to work as it should - except
the test_estimation.pl script which returned strange results because it
did a SELECT extent(the_geom).... and therefore included the extra
features when dividing the extent into n by n boxes....

The only thing I did find was that the current error was a little too
high for my liking; I found better results by increasing the threshold
to 3 standard deviations from the mean. This should put the error
somewhere around 0.3% and had the effect of reducing the difference
between sample_extent and sd_histbox for me, especially on smaller
datasets. I would be interested to compare results with other users who
have experimented with PG 7.5 and CVS to see whether they agree with
these results.


Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

> -----Original Message-----
> From: 'strk' [mailto:strk at keybit.net] 
> Sent: 11 June 2004 12:32
> To: Mark Cave-Ayland
> Cc: postgis-devel at postgis.refractions.net
> Subject: Re: [postgis-devel] RE: standard deviation based 
> histogram extent reduction
> 
> 
> On Fri, Jun 11, 2004 at 08:14:59AM +0100, Mark Cave-Ayland wrote:
> > Hi strk,
> > 
> > > -----Original Message-----
> > > From: 'strk' [mailto:strk at keybit.net]
> > > Sent: 10 June 2004 20:01
> > > To: Mark Cave-Ayland
> > > Cc: postgis-devel at postgis.refractions.net
> > > Subject: Re: [postgis-devel] RE: standard deviation based 
> > > histogram extent reduction
> > > 
> > > 
> > > I've added irregular sized histogram grid (cell aspect is
> > > always nearly square keeping total cells near to 
> requested precision).
> > > 
> > > I'd like to have some test results before working on other
> > > refinements. Just to make sure we are not introducing any bug.
> > > 
> > > Thanks for you attention.
> > > 
> > > --strk;
> > 
> > The irregular sized histogram code looks good to me.
> 
> Actually I've found a bug in it. Now should be fixed.
> 
> > 
> > The only improvement I was suggesting was that instead of 
> considering 
> > the cutoff rectangle as being the overall histogram extent, 
> we should 
> > recalculate the histogram extent ignoring everything 
> outside of this 
> > rectangle. This would have the result in most cases of 
> bringing in the 
> > histogram extent much "tighter" around the dataset and 
> hence increase 
> > the accuracy - other than changing the histogram extents, it won't 
> > change any of the existing code or methodology.
> > 
> > Cheers,
> > 
> > Mark.
> 
> I've committed the "improvement", togheter with handling of 
> infinite geometries.
> 
> Debugging output will report the three steps of histogram extent
> definition: sample extent (sample_extent), standard deviation 
> based reduced extent (sd_histbox), new histogram extent after 
> outliers cut (histobox).
> 
> Number of examined features will also be reported to check 
> how many samples were cut-off (this is actually: 
> outliers+nulls+infinite, but its is easy to check - if you 
> want finer report set DEBUG_GEOMETRY_STATS to 2).
> 
> Here are a couple of tests with default stat target.
> 
>   ---
>   --- 20610 Multipolygons 
>   ---
>   
>   $ grep best mpoly-NOsd # examined: 3000/3000
>       2   (best/worst/avg)        1.32    -2.68   +-1.97
>       4   (best/worst/avg)        0       -5.64   +-0.8
>       8   (best/worst/avg)        0       -5.09   +-0.29
>       16  (best/worst/avg)        0       -4.75   +-0.1
>       32  (best/worst/avg)        0       -4.08   +-0.04
>   
>   $ grep best mpoly-sd  # examined: 2759/3000
>       2   (best/worst/avg)        0.2     3.6     +-2.22
>       4   (best/worst/avg)        0       2.96    +-1.11
>       8   (best/worst/avg)        0       -2.79   +-0.41
>       16  (best/worst/avg)        0       -3      +-0.12
>       32  (best/worst/avg)        0       -3.21   +-0.04
> 
>   --- 
>   --- 2125 Multilinestrings (too few to tell..)
>   --- 
>   
>   $ grep best mline-NOsd # examined: 2125/2125
>       2   (best/worst/avg)        -0.37   -2.72   +-1.41
>       4   (best/worst/avg)        0       -2.77   +-0.58
>       8   (best/worst/avg)        0       -2.72   +-0.17
>       16  (best/worst/avg)        0       -3.29   +-0.1
>       32  (best/worst/avg)        0       -4.94   +-0.07
>   
>   $ grep best mline-sd # examined: 1913/2125
>       2   (best/worst/avg)        0.51    -5.97   +-2.84
>       4   (best/worst/avg)        0.04    -2.54   +-0.89
>       8   (best/worst/avg)        0       -2.44   +-0.35
>       16  (best/worst/avg)        0       -2.44   +-0.12
>       32  (best/worst/avg)        0       -2.72   +-0.06
> 
>  
> 
> --strk;
> 





More information about the postgis-devel mailing list