[postgis-devel] RE: standard deviation based histogram extent reduction

'strk' strk at keybit.net
Tue Jun 22 09:53:02 PDT 2004


I've made standard deviation multiplication factor a compile-time
define. You can play with SDFACTOR right below USE_STANDARD_DEVIATION.
Let me know...

--strk;

On Fri, Jun 18, 2004 at 11:31:18AM +0100, Mark Cave-Ayland wrote:
> Hi strk,
> 
> I'm currently working on other projects so I haven't had a chance to
> play with the standard deviation stats code as much as I would like....
> however it works really well at filtering erroneus values out of the
> histogram :)
> 
> I loaded in a couple of Tiger shapefiles of about 10,000 geometries,
> analyzed, tried some queries, added some extra features at a distance
> from the main data set and repeated. The extra features had no effect on
> this histogram and everything continued to work as it should - except
> the test_estimation.pl script which returned strange results because it
> did a SELECT extent(the_geom).... and therefore included the extra
> features when dividing the extent into n by n boxes....
> 
> The only thing I did find was that the current error was a little too
> high for my liking; I found better results by increasing the threshold
> to 3 standard deviations from the mean. This should put the error
> somewhere around 0.3% and had the effect of reducing the difference
> between sample_extent and sd_histbox for me, especially on smaller
> datasets. I would be interested to compare results with other users who
> have experimented with PG 7.5 and CVS to see whether they agree with
> these results.
> 
> 
> Cheers,
> 
> Mark.
> 
> ---
> 
> Mark Cave-Ayland
> Webbased Ltd.
> Tamar Science Park
> Derriford
> Plymouth
> PL6 8BX
> England
> 
> Tel: +44 (0)1752 764445
> Fax: +44 (0)1752 764446
> 
> 
> This email and any attachments are confidential to the intended
> recipient and may also be privileged. If you are not the intended
> recipient please delete it from your system and notify the sender. You
> should not copy it or use it for any purpose nor disclose or distribute
> its contents to any other person.
> 
> > -----Original Message-----
> > From: 'strk' [mailto:strk at keybit.net] 
> > Sent: 11 June 2004 12:32
> > To: Mark Cave-Ayland
> > Cc: postgis-devel at postgis.refractions.net
> > Subject: Re: [postgis-devel] RE: standard deviation based 
> > histogram extent reduction
> > 
> > 
> > On Fri, Jun 11, 2004 at 08:14:59AM +0100, Mark Cave-Ayland wrote:
> > > Hi strk,
> > > 
> > > > -----Original Message-----
> > > > From: 'strk' [mailto:strk at keybit.net]
> > > > Sent: 10 June 2004 20:01
> > > > To: Mark Cave-Ayland
> > > > Cc: postgis-devel at postgis.refractions.net
> > > > Subject: Re: [postgis-devel] RE: standard deviation based 
> > > > histogram extent reduction
> > > > 
> > > > 
> > > > I've added irregular sized histogram grid (cell aspect is
> > > > always nearly square keeping total cells near to 
> > requested precision).
> > > > 
> > > > I'd like to have some test results before working on other
> > > > refinements. Just to make sure we are not introducing any bug.
> > > > 
> > > > Thanks for you attention.
> > > > 
> > > > --strk;
> > > 
> > > The irregular sized histogram code looks good to me.
> > 
> > Actually I've found a bug in it. Now should be fixed.
> > 
> > > 
> > > The only improvement I was suggesting was that instead of 
> > considering 
> > > the cutoff rectangle as being the overall histogram extent, 
> > we should 
> > > recalculate the histogram extent ignoring everything 
> > outside of this 
> > > rectangle. This would have the result in most cases of 
> > bringing in the 
> > > histogram extent much "tighter" around the dataset and 
> > hence increase 
> > > the accuracy - other than changing the histogram extents, it won't 
> > > change any of the existing code or methodology.
> > > 
> > > Cheers,
> > > 
> > > Mark.
> > 
> > I've committed the "improvement", togheter with handling of 
> > infinite geometries.
> > 
> > Debugging output will report the three steps of histogram extent
> > definition: sample extent (sample_extent), standard deviation 
> > based reduced extent (sd_histbox), new histogram extent after 
> > outliers cut (histobox).
> > 
> > Number of examined features will also be reported to check 
> > how many samples were cut-off (this is actually: 
> > outliers+nulls+infinite, but its is easy to check - if you 
> > want finer report set DEBUG_GEOMETRY_STATS to 2).
> > 
> > Here are a couple of tests with default stat target.
> > 
> >   ---
> >   --- 20610 Multipolygons 
> >   ---
> >   
> >   $ grep best mpoly-NOsd # examined: 3000/3000
> >       2   (best/worst/avg)        1.32    -2.68   +-1.97
> >       4   (best/worst/avg)        0       -5.64   +-0.8
> >       8   (best/worst/avg)        0       -5.09   +-0.29
> >       16  (best/worst/avg)        0       -4.75   +-0.1
> >       32  (best/worst/avg)        0       -4.08   +-0.04
> >   
> >   $ grep best mpoly-sd  # examined: 2759/3000
> >       2   (best/worst/avg)        0.2     3.6     +-2.22
> >       4   (best/worst/avg)        0       2.96    +-1.11
> >       8   (best/worst/avg)        0       -2.79   +-0.41
> >       16  (best/worst/avg)        0       -3      +-0.12
> >       32  (best/worst/avg)        0       -3.21   +-0.04
> > 
> >   --- 
> >   --- 2125 Multilinestrings (too few to tell..)
> >   --- 
> >   
> >   $ grep best mline-NOsd # examined: 2125/2125
> >       2   (best/worst/avg)        -0.37   -2.72   +-1.41
> >       4   (best/worst/avg)        0       -2.77   +-0.58
> >       8   (best/worst/avg)        0       -2.72   +-0.17
> >       16  (best/worst/avg)        0       -3.29   +-0.1
> >       32  (best/worst/avg)        0       -4.94   +-0.07
> >   
> >   $ grep best mline-sd # examined: 1913/2125
> >       2   (best/worst/avg)        0.51    -5.97   +-2.84
> >       4   (best/worst/avg)        0.04    -2.54   +-0.89
> >       8   (best/worst/avg)        0       -2.44   +-0.35
> >       16  (best/worst/avg)        0       -2.44   +-0.12
> >       32  (best/worst/avg)        0       -2.72   +-0.06
> > 
> >  
> > 
> > --strk;
> > 
> 



More information about the postgis-devel mailing list