[postgis-devel] Re: geometry stats

Mon Mar 1 06:11:50 PST 2004

m.cave-ayland wrote:
> Hi strk,
>
> I had a chance to do some more testing over the weekend so I ran in a
> 280,000 point world gazetteer dataset into a 7.5dev PostGIS database to
> test the new statistics code, ran ANALYZE and then using a map as a
> reference, messed about with some explain analyze queries. This was done
> using the default statistics target of 10, so only 3,000 rows were
> sampled from the entire dataset (just over 1%).
> 
> The results were surprisingly accurate when zoomed in around single
> countries; I was seeing errors somewhere of less than 0.5%, e.g.
> estimated = 53000, actual = 55000. Zooming  much closer into the map,
> the percentage error became increasingly large, but the numbers were
> much smaller (e.g. estimated=20 actual=90) so it didn't have much impact
> on the query  plan.
> 
> The one thing that was apparent from the above was that we currently
> need to increase the number of boxesPerSide when the statistic target is
> increased, since the implication is that users would increase this
> number if greater accuracy were required. Since the default values of
> 40bps with a target of 10 worked quite well, it may be worth using this
> ratio as the basis of the boxesPerSide calculation, e.g.
> 
> 		boxesPerSide = sqrt(150 * statistics target);
> 
> This gives values 38 for the default value of 10, and a value of 388 if
> someone is concerned enough to use the maximum.

Yes. I think we can do that.
Whats the maximum, statarget is an integer, isn't it ?
We should give statistic_target a meaning and then conform to it.
Can we assume it as a level of accuracy in the range 0-100 ?

> I think that what we need now is for people to try using the new
> algorithm with larger datasets of different types to see when/if the
> algorithm breaks down when the sample drops below a certain percentage.
> Perhaps this will give more information as to whether we need to use a
> probability curve outside of the sampled area? Regardless of this, it
> does appear that the random fetch routine really does a good job of
> picking values from all areas of the dataset.

I think we should also make the numer of rows dynamic.
The statistics_target could be interpreted as the
percent of table features to be used as sample
(reltuples * statistics_target / 100)

Is there any literacy on statistics_target interpretation in pgsql ?

> All in all, things are looking really promising :)

happy to hear about positive tests.

> I did think of one more modification: if PostGIS is compiled with PG7.5
> then doing a 'SELECT update_geometry_stats()' should return a notice
> indicating that statistics are automatically generated during a
> VACUUM/VACUUM ANALYZE - strk, has any code been put in for this?

Nope.
We are working on reducing the number of *sql* files in the distrib
before putting any PG75-specific code in.

> Cheers,
> 
> Mark.

--strk;