[postgis-devel] Re: geometry stats

Mon Mar 1 07:41:13 PST 2004

Hi strk,

> -----Original Message-----
> From: strk [mailto:strk at keybit.net] 
> Sent: 01 March 2004 14:12
> To: Mark Cave-Ayland
> Cc: 'David Blasby'; postgis-devel at postgis.refractions.net; 
> 'Paul Ramsey'
> Subject: Re: [postgis-devel] Re: geometry stats

(cut)

> > The one thing that was apparent from the above was that we 
> currently 
> > need to increase the number of boxesPerSide when the 
> statistic target 
> > is increased, since the implication is that users would 
> increase this 
> > number if greater accuracy were required. Since the default 
> values of 
> > 40bps with a target of 10 worked quite well, it may be worth using 
> > this ratio as the basis of the boxesPerSide calculation, e.g.
> > 
> > 		boxesPerSide = sqrt(150 * statistics target);
> > 
> > This gives values 38 for the default value of 10, and a 
> value of 388 
> > if someone is concerned enough to use the maximum.
> 
> Yes. I think we can do that.
> Whats the maximum, statarget is an integer, isn't it ?
> We should give statistic_target a meaning and then conform to 
> it. Can we assume it as a level of accuracy in the range 0-100 ?

It's limited to 0 (no statistics) to 1000, with -1 meaning to use the
default value (currently 10). See
http://www.postgresql.org/docs/7.4/static/sql-altertable.html.

> > I think that what we need now is for people to try using the new 
> > algorithm with larger datasets of different types to see 
> when/if the 
> > algorithm breaks down when the sample drops below a certain 
> > percentage. Perhaps this will give more information as to 
> whether we 
> > need to use a probability curve outside of the sampled area? 
> > Regardless of this, it does appear that the random fetch routine 
> > really does a good job of picking values from all areas of the 
> > dataset.
> 
> I think we should also make the numer of rows dynamic.
> The statistics_target could be interpreted as the
> percent of table features to be used as sample
> (reltuples * statistics_target / 100)

We should not allow this as it means that the whole table could be
returned as part of the ANALYZE routine. Remember the case of our 70M
row table, this means a sequential scan *per column* in the table during
an ANALYZE operation - it's just not practical to do this for the larger
datasets especially when you bear in mind that btree indexes use 10
histogram buckets for the entire sample.

If we find that this approach doesn't work well using a sample of the
data set, then it is  the algorithm that is flawed since ANALYZE is
designed to work on a small, random statistical sample from the table.
It is up to us to devise an algorithm that works on a sample. The same
problems must be present in the current btree statistics since they
don't know whether we have the very beginning or very end of the
distribution in the sample. Perhaps instead of the current algorithm we
should look at extending what is already there  into two dimensions?

> > I did think of one more modification: if PostGIS is compiled with 
> > PG7.5 then doing a 'SELECT update_geometry_stats()' should return a 
> > notice indicating that statistics are automatically 
> generated during a 
> > VACUUM/VACUUM ANALYZE - strk, has any code been put in for this?
> 
> Nope.
> We are working on reducing the number of *sql* files in the 
> distrib before putting any PG75-specific code in.

OK I've seen some of the traffic on the list about this - I'll consider
it a work in progress :)

Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.