[postgis-devel] RE: standard deviation based histogram extent reduction

Mark Cave-Ayland m.cave-ayland at webbased.co.uk
Thu Jun 24 03:28:43 PDT 2004


Hi strk,

Cheers for this! The experiments last week were leading me towards make
the filter less strong by moving to a value around 3.25 which stills
cuts out extreme outliers. Unfortunately I'm working on another project
so I don't get too much time to play with PostGIS at the moment...
however, I've got a CVS account so I can commit the change myself when I
get time to save you bandwidth. I'll let you know when I can start
playing again :)


Kind regards,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

> -----Original Message-----
> From: 'strk' [mailto:strk at keybit.net] 
> Sent: 22 June 2004 17:53
> To: Mark Cave-Ayland
> Cc: postgis-devel at postgis.refractions.net
> Subject: Re: [postgis-devel] RE: standard deviation based 
> histogram extent reduction
> 
> 
> I've made standard deviation multiplication factor a 
> compile-time define. You can play with SDFACTOR right below 
> USE_STANDARD_DEVIATION. Let me know...
> 
> --strk;
> 
> On Fri, Jun 18, 2004 at 11:31:18AM +0100, Mark Cave-Ayland wrote:
> > Hi strk,
> > 
> > I'm currently working on other projects so I haven't had a 
> chance to 
> > play with the standard deviation stats code as much as I would 
> > like.... however it works really well at filtering erroneus 
> values out 
> > of the histogram :)
> > 
> > I loaded in a couple of Tiger shapefiles of about 10,000 
> geometries, 
> > analyzed, tried some queries, added some extra features at 
> a distance 
> > from the main data set and repeated. The extra features had 
> no effect 
> > on this histogram and everything continued to work as it should - 
> > except the test_estimation.pl script which returned strange results 
> > because it did a SELECT extent(the_geom).... and therefore included 
> > the extra features when dividing the extent into n by n boxes....
> > 
> > The only thing I did find was that the current error was a 
> little too 
> > high for my liking; I found better results by increasing 
> the threshold 
> > to 3 standard deviations from the mean. This should put the error 
> > somewhere around 0.3% and had the effect of reducing the difference 
> > between sample_extent and sd_histbox for me, especially on smaller 
> > datasets. I would be interested to compare results with other users 
> > who have experimented with PG 7.5 and CVS to see whether they agree 
> > with these results.
> > 
> > 
> > Cheers,
> > 
> > Mark.
> > 
> > ---
> > 
> > Mark Cave-Ayland
> > Webbased Ltd.
> > Tamar Science Park
> > Derriford
> > Plymouth
> > PL6 8BX
> > England
> > 
> > Tel: +44 (0)1752 764445
> > Fax: +44 (0)1752 764446
> > 
> > 
> > This email and any attachments are confidential to the intended 
> > recipient and may also be privileged. If you are not the intended 
> > recipient please delete it from your system and notify the 
> sender. You 
> > should not copy it or use it for any purpose nor disclose or 
> > distribute its contents to any other person.
> > 
> > > -----Original Message-----
> > > From: 'strk' [mailto:strk at keybit.net]
> > > Sent: 11 June 2004 12:32
> > > To: Mark Cave-Ayland
> > > Cc: postgis-devel at postgis.refractions.net
> > > Subject: Re: [postgis-devel] RE: standard deviation based 
> > > histogram extent reduction
> > > 
> > > 
> > > On Fri, Jun 11, 2004 at 08:14:59AM +0100, Mark Cave-Ayland wrote:
> > > > Hi strk,
> > > > 
> > > > > -----Original Message-----
> > > > > From: 'strk' [mailto:strk at keybit.net]
> > > > > Sent: 10 June 2004 20:01
> > > > > To: Mark Cave-Ayland
> > > > > Cc: postgis-devel at postgis.refractions.net
> > > > > Subject: Re: [postgis-devel] RE: standard deviation based
> > > > > histogram extent reduction
> > > > > 
> > > > > 
> > > > > I've added irregular sized histogram grid (cell 
> aspect is always 
> > > > > nearly square keeping total cells near to
> > > requested precision).
> > > > > 
> > > > > I'd like to have some test results before working on other 
> > > > > refinements. Just to make sure we are not introducing any bug.
> > > > > 
> > > > > Thanks for you attention.
> > > > > 
> > > > > --strk;
> > > > 
> > > > The irregular sized histogram code looks good to me.
> > > 
> > > Actually I've found a bug in it. Now should be fixed.
> > > 
> > > > 
> > > > The only improvement I was suggesting was that instead of
> > > considering
> > > > the cutoff rectangle as being the overall histogram extent,
> > > we should
> > > > recalculate the histogram extent ignoring everything
> > > outside of this
> > > > rectangle. This would have the result in most cases of
> > > bringing in the
> > > > histogram extent much "tighter" around the dataset and
> > > hence increase
> > > > the accuracy - other than changing the histogram 
> extents, it won't
> > > > change any of the existing code or methodology.
> > > > 
> > > > Cheers,
> > > > 
> > > > Mark.
> > > 
> > > I've committed the "improvement", togheter with handling of
> > > infinite geometries.
> > > 
> > > Debugging output will report the three steps of histogram extent
> > > definition: sample extent (sample_extent), standard deviation
> > > based reduced extent (sd_histbox), new histogram extent after 
> > > outliers cut (histobox).
> > > 
> > > Number of examined features will also be reported to check
> > > how many samples were cut-off (this is actually: 
> > > outliers+nulls+infinite, but its is easy to check - if you
> > > want finer report set DEBUG_GEOMETRY_STATS to 2).
> > > 
> > > Here are a couple of tests with default stat target.
> > > 
> > >   ---
> > >   --- 20610 Multipolygons 
> > >   ---
> > >   
> > >   $ grep best mpoly-NOsd # examined: 3000/3000
> > >       2   (best/worst/avg)        1.32    -2.68   +-1.97
> > >       4   (best/worst/avg)        0       -5.64   +-0.8
> > >       8   (best/worst/avg)        0       -5.09   +-0.29
> > >       16  (best/worst/avg)        0       -4.75   +-0.1
> > >       32  (best/worst/avg)        0       -4.08   +-0.04
> > >   
> > >   $ grep best mpoly-sd  # examined: 2759/3000
> > >       2   (best/worst/avg)        0.2     3.6     +-2.22
> > >       4   (best/worst/avg)        0       2.96    +-1.11
> > >       8   (best/worst/avg)        0       -2.79   +-0.41
> > >       16  (best/worst/avg)        0       -3      +-0.12
> > >       32  (best/worst/avg)        0       -3.21   +-0.04
> > > 
> > >   --- 
> > >   --- 2125 Multilinestrings (too few to tell..)
> > >   ---
> > >   
> > >   $ grep best mline-NOsd # examined: 2125/2125
> > >       2   (best/worst/avg)        -0.37   -2.72   +-1.41
> > >       4   (best/worst/avg)        0       -2.77   +-0.58
> > >       8   (best/worst/avg)        0       -2.72   +-0.17
> > >       16  (best/worst/avg)        0       -3.29   +-0.1
> > >       32  (best/worst/avg)        0       -4.94   +-0.07
> > >   
> > >   $ grep best mline-sd # examined: 1913/2125
> > >       2   (best/worst/avg)        0.51    -5.97   +-2.84
> > >       4   (best/worst/avg)        0.04    -2.54   +-0.89
> > >       8   (best/worst/avg)        0       -2.44   +-0.35
> > >       16  (best/worst/avg)        0       -2.44   +-0.12
> > >       32  (best/worst/avg)        0       -2.72   +-0.06
> > > 
> > >  
> > > 
> > > --strk;
> > > 
> > 
> 





More information about the postgis-devel mailing list