[postgis-devel] Re: [postgis-users] Possible index stats problem?

Tue Sep 14 03:54:42 PDT 2004

(Alexandre, I've CC'd this to you as I'm not sure if you're on the
postgis-devel mailing list)

> -----Original Message-----
> From: strk at refractions.net [mailto:strk at refractions.net] 
> Sent: 14 September 2004 08:22
> To: Mark Cave-Ayland; postgis-devel at postgis.refractions.net
> Subject: Re: [postgis-devel] Re: [postgis-users] Possible 
> index stats problem?
> 

(lots cut)

> 
> I get it back.
> There is no change in cost estimation changing stawidth....
> The cost estimation is performed by gistcostestimate() 
> function, which is defined by standard postgresql 
> distribution and in turn calls the default cost estimation 
> routine. Now I need to know how to use a custom one (or patch 
> the gist one). --strk;

Hi strk,

I don't think that you need to be looking at changing gistcostestimate()
since this is  concerned with producing functions that work at the
higher index AM level; so any changes made here would affect all GiST
index accesses and not just those concerned with geometries.

In fact, scanning through postgis_estimate.c and postgis.sql.in, I've
noticed that in fact postgisgistcostestimate() and
genericcostestimate2(), which are replacement index AM cost functions
and so can't be used in PostGIS anyway, aren't even referenced! I can
only assume that Dave used them as a debugging aid when writing his
first cut of the index selectivity code.

It might be worth looking at the default settings in postgresql.conf to
see the effect of changing some of the cost parameters; for example, I
find that the default value of random_page_cost = 4 grossly
overestimates the cost of grabbing an index from the tuple on modern
hardware - I tend to reduce this to either 1 or 2 which has the effect
of reducing the overall cost of the index scan to the planner.

I'm also reasonably sure that this is not caused by using a sample of
geometries rather than the entire set; if this were the case, I would
expect to see wildly differing query plans for the same query with
repeated ANALYZE statements; in any case, Alexandre reports that there
are only 5260 rows in the table, and from testing here, I've still seen
good results with ~300k geometries using the default statistics target
of 10.

>    ->  Index Scan using municipios_gidx on municipios  
> (cost=0.00..475.56 rows=52 width=0) (actual time=75.169..6712.530 
> rows=413 loops=1)  Total runtime: 6718.763 ms

Now this part of the estimate seems really wrong; 52 estimated versus
413 actual. The only thing I can think of is that some characteristic of
the data is affecting the estimates, for example perhaps it contains
polygons which have a large area of overlap in a small area, or maybe
the data consists of large irregular polygons/linestrings?

To solve this, I think we need to ask Alexandre for a copy of his data
off-list so we can look at exactly how the selectivity code is using
this data in much greater detail.

Kind regards,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.