[postgis-devel] Re: geometry stats

Mark Cave-Ayland m.cave-ayland at webbased.co.uk
Mon Mar 1 05:09:38 PST 2004


Hi strk,

> -----Original Message-----
> From: strk [mailto:strk at keybit.net] 
> Sent: 26 February 2004 16:48
> To: Mark Cave-Ayland
> Cc: David Blasby; postgis-devel at postgis.refractions.net; Paul Ramsey
> Subject: Re: [postgis-devel] Re: geometry stats
> 

(bits cut)

> In the histogram evaluator (1094):
> 
>                 x_idx_max = (box->high.x-geomstats->xmin) / 
> geow * bps;
>                 if (x_idx_max <0)
>                 {
>                         // should increment the value somehow
>                         x_idx_max = 0;
>                 }
>                 if (x_idx_max >= bps )
>                 {
>                         // should increment the value somehow
>                         x_idx_max = bps-1;
>                 }
> 
> Do you think it's worth actually incrementing the value ?
> What kind of curve should we apply in that case ?
> What we are detecting is that the query box goes outside the 
> histogram extent, we should find the values of approximate 
> cells ? The worst thing that can happen is that the estimator 
> gives a smaller estimation, which eventually could be 0.0. 
> Would that estimation just force the planner to make use of 
> the index or would it make it think there are no rows in the result ?
> 
> In the histogram evaluator:
> 
> 	I've re-introduced histogram cell value be divided
> 	by the fraction of cell that really overlaps query box.

I had a chance to do some more testing over the weekend so I ran in a
280,000 point world gazetteer dataset into a 7.5dev PostGIS database to
test the new statistics code, ran ANALYZE and then using a map as a
reference, messed about with some explain analyze queries. This was done
using the default statistics target of 10, so only 3,000 rows were
sampled from the entire dataset (just over 1%).

The results were surprisingly accurate when zoomed in around single
countries; I was seeing errors somewhere of less than 0.5%, e.g.
estimated = 53000, actual = 55000. Zooming  much closer into the map,
the percentage error became increasingly large, but the numbers were
much smaller (e.g. estimated=20 actual=90) so it didn't have much impact
on the query  plan.

The one thing that was apparent from the above was that we currently
need to increase the number of boxesPerSide when the statistic target is
increased, since the implication is that users would increase this
number if greater accuracy were required. Since the default values of
40bps with a target of 10 worked quite well, it may be worth using this
ratio as the basis of the boxesPerSide calculation, e.g.

		boxesPerSide = sqrt(150 * statistics target);

This gives values 38 for the default value of 10, and a value of 388 if
someone is concerned enough to use the maximum.

I think that what we need now is for people to try using the new
algorithm with larger datasets of different types to see when/if the
algorithm breaks down when the sample drops below a certain percentage.
Perhaps this will give more information as to whether we need to use a
probability curve outside of the sampled area? Regardless of this, it
does appear that the random fetch routine really does a good job of
picking values from all areas of the dataset.

All in all, things are looking really promising :)

I did think of one more modification: if PostGIS is compiled with PG7.5
then doing a 'SELECT update_geometry_stats()' should return a notice
indicating that statistics are automatically generated during a
VACUUM/VACUUM ANALYZE - strk, has any code been put in for this?


Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.





More information about the postgis-devel mailing list