[postgis-users] new estimation functions in PostGIS

Fri Oct 11 12:00:00 PDT 2002

I'm at the final stages of testing the histogram and new cost estimate
functions in PostGIS.

 The histogram2d number-of-rows-returned estimator is working very
well.  This should allow postgresql to choose the correct plan for joins
and other types of queries.

The estimate of the cost of doing an index scan has been vastly proved.
Postgresql has a really really bad estimate of index query cost.
Unfortunately, I dont have much access to how it actually makes its
estimates, but I did manage fiddle some numbers so its really quite
accurate on my machine.

Postgresql does its index use estimates like this:
    a. Calculate the selectivity (ie. use the 2d histogram) of the
index.
    b. Calculate the cost to traverse the index (basically, the size of
the index * the selectivity).  This is usually a fairly small cost
because the index is small relative to the data.
    c. Postgresql then estimates the cost of retrieving the actual
result tuples from the table.

I really only have access to (a), and (b) above.  Postgresql makes the
(c) estimate itself (and does a bad job).

A sequence scan takes constant time - read the entire table and filter
out the required rows.

The index scan is more complex (see the attached diagram - sorry for it
being so crude).  First there is the index traversal steps - thats the
round circles.  Then there's the tuples accessing step (thats the
vertical lines).

It estimates step (c) as follows (this is highly simplified!):

For each of the result tuples (total rows in the table * selectivity),
load the disk page that contains that tuple.

Currently, this "random_page_access" cost is 4 times the
"sequential_page_access" cost.

As you can see, if there are lots of tuples on a single disk page, and
you are querying a large number of tuples, this estimate will be very
very high.

This estimate is inaccurate for three big reasons: autocorrelation of
the selectivity of tuples, caching, and the "random_page_access" cost.
The autocorrelation is because the tuples that are spatially nearby are
often nearby on the disk.  The caching is important because the cache
means that the same disk page will not be loaded twice (or 10 times).
The random_page_access cost also appears to be too high.

I adjusted the indexCorrelation to be about 97%, and that seems to give
good numbers.  This is mostly a made-up-number, but it does help
postgresql to more effectively estimate the cache and autocorrelation
effects.

The end result is that the index cost vs the sequential scan cost seem
to agree with the actual (clock) time.

The side effect of this is that the spatial index will be used more
often than other indexes because these other indexes have such bad cost
estimates (always much higher than they should be - often 100* an
accurate assessment).

I'll be commit my changes to CVS quite soon (after I do a bit more
testing and ensure it doesnt do anything bad under 7.1).  You can
upgrade a current database, but it much easier to start with a fresh
one.  Basically the only thing you have to do is a:

SELECT UPDATE_GEOMETRY_STATS();
or
SELECT UPDATE_GEOMETRY_STATS(<table name>, <column name>);

every so often - it will update the histogram2d in the geometry_columns
table.  If you dont do this, it will continue to use the standard
(fixed-selectivity) row-size-estimate, but an improved total-cost
estimate.  This will only work under postgresql 7.2 since they changed
the statistical analyse from 7.1.  If you use the new code under 7.1 you
will not see any differences.

dave

-------------- next part --------------
A non-text attachment was scrubbed...
Name: index.gif
Type: image/gif
Size: 3912 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20021011/81f59124/attachment.gif>