[postgis-devel] RE: geometry stats

Tue Feb 24 10:14:01 PST 2004

Hi strk,

> -----Original Message-----
> From: strk [mailto:strk at keybit.net] 
> Sent: 23 February 2004 12:59
> To: David Blasby; Mark Cave-Ayland
> Cc: postgis-devel at postgis.refractions.net
> Subject: geometry stats
> 
> 
> Hi Dave,
> I've prepared and committed a skeleton for
> PG75 integrated stats support.
> 
> What is done is:
> 
> 	1) histogram creation based on 300*attstatstarget
> 	   sample rows (if available).
> 
> 	2) null fraction computation
> 
> 	3) average width of column values:
> 	   SUM(samplegeom->size) / not_null samples 
> 
> What remains to do now is:
> 
> 	1) fill the histogram values (float4).
> 
> 	2) estimate the histogram.

Wow! I downloaded the hourly snapshot from CVS and had a look at the
changes you had put in. This really is great stuff - I think I'm
experiencing one of Dave's 'warm fuzzies' when someone works on a patch.
The work you've done on this is greatly appreciated :)

> I've seen the current code is pretty fuzzy, and I've 
> experienced this kind of algos to require many iterations on 
> fine-tuning.
> 
> I'd like to move the 'tunable' parts in the estimator, 
> keeping the builder as strict as possible.

Yup that sounds good. As you've probably realised, the only thing that
can't be configured in the estimator is the number of boxes per side :)

> Since we will use float instead of integers, we could use a 
> number in the range 0-1 to express the factor of overlapping 
> between a sample feature's box and an histogram cell. 
> Currently 1 is added to the cell value if at least 5% of a 
> feature overlaps it (correct me if I'm wrong). Finally we 
> should 'normalize' the histogram dividing the value of each 
> cell by the number of not-null (or total) samples handled.  
> This should give a tune-free histogram, what do you think? Mark?

Sounds good to me - i.e. if all geometries sampled were to fit within a
single box then the value of that histogram box would be 1.0. I don't
think Dave's idea of updating the histogram will work, since as you
rightly point out, you don't know which rows have been deleted between
samples...

> Then the estimator will need a change too... but I'd like to 
> discuss this later.

Was there anything in particular you had in mind? The only case I can
think of is that since we are working on a sample then we may have to
calculate the value for some extents that lie outside the bounds of our
histogram. However since the data is randomly sampled from the whole
table then we can be fairly sure that this number will need to be a
small fraction of the number of rows in the table - but we'll probably
have to determine the best value by trial and error. I'll try and
refresh my memory of the algorithm tomorrow to see if things will still
work as they should working on a sample.

Keep up the good work!

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.