[postgis-devel] RE: standard deviation based histogram extent reduction
'strk'
strk at keybit.net
Tue Jun 22 09:53:02 PDT 2004
I've made standard deviation multiplication factor a compile-time
define. You can play with SDFACTOR right below USE_STANDARD_DEVIATION.
Let me know...
--strk;
On Fri, Jun 18, 2004 at 11:31:18AM +0100, Mark Cave-Ayland wrote:
> Hi strk,
>
> I'm currently working on other projects so I haven't had a chance to
> play with the standard deviation stats code as much as I would like....
> however it works really well at filtering erroneus values out of the
> histogram :)
>
> I loaded in a couple of Tiger shapefiles of about 10,000 geometries,
> analyzed, tried some queries, added some extra features at a distance
> from the main data set and repeated. The extra features had no effect on
> this histogram and everything continued to work as it should - except
> the test_estimation.pl script which returned strange results because it
> did a SELECT extent(the_geom).... and therefore included the extra
> features when dividing the extent into n by n boxes....
>
> The only thing I did find was that the current error was a little too
> high for my liking; I found better results by increasing the threshold
> to 3 standard deviations from the mean. This should put the error
> somewhere around 0.3% and had the effect of reducing the difference
> between sample_extent and sd_histbox for me, especially on smaller
> datasets. I would be interested to compare results with other users who
> have experimented with PG 7.5 and CVS to see whether they agree with
> these results.
>
>
> Cheers,
>
> Mark.
>
> ---
>
> Mark Cave-Ayland
> Webbased Ltd.
> Tamar Science Park
> Derriford
> Plymouth
> PL6 8BX
> England
>
> Tel: +44 (0)1752 764445
> Fax: +44 (0)1752 764446
>
>
> This email and any attachments are confidential to the intended
> recipient and may also be privileged. If you are not the intended
> recipient please delete it from your system and notify the sender. You
> should not copy it or use it for any purpose nor disclose or distribute
> its contents to any other person.
>
> > -----Original Message-----
> > From: 'strk' [mailto:strk at keybit.net]
> > Sent: 11 June 2004 12:32
> > To: Mark Cave-Ayland
> > Cc: postgis-devel at postgis.refractions.net
> > Subject: Re: [postgis-devel] RE: standard deviation based
> > histogram extent reduction
> >
> >
> > On Fri, Jun 11, 2004 at 08:14:59AM +0100, Mark Cave-Ayland wrote:
> > > Hi strk,
> > >
> > > > -----Original Message-----
> > > > From: 'strk' [mailto:strk at keybit.net]
> > > > Sent: 10 June 2004 20:01
> > > > To: Mark Cave-Ayland
> > > > Cc: postgis-devel at postgis.refractions.net
> > > > Subject: Re: [postgis-devel] RE: standard deviation based
> > > > histogram extent reduction
> > > >
> > > >
> > > > I've added irregular sized histogram grid (cell aspect is
> > > > always nearly square keeping total cells near to
> > requested precision).
> > > >
> > > > I'd like to have some test results before working on other
> > > > refinements. Just to make sure we are not introducing any bug.
> > > >
> > > > Thanks for you attention.
> > > >
> > > > --strk;
> > >
> > > The irregular sized histogram code looks good to me.
> >
> > Actually I've found a bug in it. Now should be fixed.
> >
> > >
> > > The only improvement I was suggesting was that instead of
> > considering
> > > the cutoff rectangle as being the overall histogram extent,
> > we should
> > > recalculate the histogram extent ignoring everything
> > outside of this
> > > rectangle. This would have the result in most cases of
> > bringing in the
> > > histogram extent much "tighter" around the dataset and
> > hence increase
> > > the accuracy - other than changing the histogram extents, it won't
> > > change any of the existing code or methodology.
> > >
> > > Cheers,
> > >
> > > Mark.
> >
> > I've committed the "improvement", togheter with handling of
> > infinite geometries.
> >
> > Debugging output will report the three steps of histogram extent
> > definition: sample extent (sample_extent), standard deviation
> > based reduced extent (sd_histbox), new histogram extent after
> > outliers cut (histobox).
> >
> > Number of examined features will also be reported to check
> > how many samples were cut-off (this is actually:
> > outliers+nulls+infinite, but its is easy to check - if you
> > want finer report set DEBUG_GEOMETRY_STATS to 2).
> >
> > Here are a couple of tests with default stat target.
> >
> > ---
> > --- 20610 Multipolygons
> > ---
> >
> > $ grep best mpoly-NOsd # examined: 3000/3000
> > 2 (best/worst/avg) 1.32 -2.68 +-1.97
> > 4 (best/worst/avg) 0 -5.64 +-0.8
> > 8 (best/worst/avg) 0 -5.09 +-0.29
> > 16 (best/worst/avg) 0 -4.75 +-0.1
> > 32 (best/worst/avg) 0 -4.08 +-0.04
> >
> > $ grep best mpoly-sd # examined: 2759/3000
> > 2 (best/worst/avg) 0.2 3.6 +-2.22
> > 4 (best/worst/avg) 0 2.96 +-1.11
> > 8 (best/worst/avg) 0 -2.79 +-0.41
> > 16 (best/worst/avg) 0 -3 +-0.12
> > 32 (best/worst/avg) 0 -3.21 +-0.04
> >
> > ---
> > --- 2125 Multilinestrings (too few to tell..)
> > ---
> >
> > $ grep best mline-NOsd # examined: 2125/2125
> > 2 (best/worst/avg) -0.37 -2.72 +-1.41
> > 4 (best/worst/avg) 0 -2.77 +-0.58
> > 8 (best/worst/avg) 0 -2.72 +-0.17
> > 16 (best/worst/avg) 0 -3.29 +-0.1
> > 32 (best/worst/avg) 0 -4.94 +-0.07
> >
> > $ grep best mline-sd # examined: 1913/2125
> > 2 (best/worst/avg) 0.51 -5.97 +-2.84
> > 4 (best/worst/avg) 0.04 -2.54 +-0.89
> > 8 (best/worst/avg) 0 -2.44 +-0.35
> > 16 (best/worst/avg) 0 -2.44 +-0.12
> > 32 (best/worst/avg) 0 -2.72 +-0.06
> >
> >
> >
> > --strk;
> >
>
More information about the postgis-devel
mailing list