[postgis-devel] Re: ANALYZE patch applied to 7.5 CVS

Fri Feb 20 02:45:49 PST 2004

m.cave-ayland wrote:
> Hi strk,
> > 
> > Do you mean the ANALYZE function should return the number of 
> > rows that will be fed to the custom stats builder ? I guess 
> > for postgis this would be the full dataset, are there cases 
> > where this does not hold true ?
> 
> Yes and yes. Due to the way that the statistics process works, there is
> no way that we can pipe an entire dataset into a stats builder function.
> This is because the rows requested are shared between all the analysis
> function for all the different column types so each ANALYZE would then
> iterate over the entire dataset multiple times..... this approach just
> would not scale :(

Stat builders could work taking a single value and updating a state
(like aggregates do). That way you could make a single scan for
all the columns.

> The existing selectivity functions use sampling to approximate the
> distribution of the data, and they work with few problems, so all we
> have to do is to provide enough information to indicate whether our plan
> is better than the inbuilt ones, i.e. at a minimum level of accuracy
> required to do this. Remember the aim here is to provide an informed
> *estimate* to the planner.

Dave, do you think your histogram-based-estimator would work if
the histogram has been built on a random subset of the geometries
in the table ? I guess we can get to the whole set, but that would
be rude :|

> If you look at analyze.c the current functions use an estimate of 300 *
> the statistics target of the column as the number of rows to sample. I
> would imagine that this would be a fairly good starting point to begin
> with, although we should aim to give a 'reasonable' estimate using the
> in-built statistics default value of 10.

How is the statistics target set ?

--strk;