[postgis-users] Re: [PERFORM] Bad query optimizer misestimation because of TOAST

Thu Feb 3 02:30:32 PST 2005

Hi, Tom,

Tom Lane schrieb:

> What I would be inclined to do is to extend ANALYZE to make an estimate
> of the extent of toasting of every toastable column, and then modify
> cost_qual_eval to charge a nonzero cost for evaluation of Vars that are
> potentially toasted.

I currently do not have any internal knowledge of the query planner, but
that sounds good in my ears :-)

My (simpler) alternative would have been to simply add the number of
toast pages to the table size when estimating sequential scan costs.
This would clearly help in my case, but I now realize that it would give
rather bad misestimations when the TOASTed columns are never touched.

> This implies an initdb-forcing change in pg_statistic, which might or
> might not be allowed for 8.1 ... we are still a bit up in the air on
> what our release policy will be for 8.1.

Is it possible to add metadata table columns to an existing database? At
least when the database is offline (no postmaster running on it)?

You could make the query optimizer code work with the old and new
statistic schema (at least during the 8.x series). Thus users could
upgrade as normal (without dump/restore, and withut benefiting from this
change), and then manually change the schema to benefit (maybe using
some offline tool or special command). Of course, this should be clearly
documented. ANALYZE could spit out a warning message about the missing
columns.

The most convenient method might be to make ANALYZE automatically add
those columns, but'm somehow reluctant to accept such unexpected side
effects (metadata schema changes) .

> My first thought about what stat ANALYZE ought to collect is "average
> number of out-of-line TOAST chunks per value".  Armed with that number
> and size information about the TOAST table, it'd be relatively simple
> for costsize.c to estimate the average cost of fetching such values.

This sounds good.

> I'm not sure if it's worth trying to model the cost of decompression of
> compressed values.  Surely that's a lot cheaper than fetching
> out-of-line values, so maybe we can just ignore it.  If we did want to
> model it then we'd also need to make ANALYZE note the fraction of values
> that require decompression, and maybe something about their sizes.

Well, the first step is to generate those statistics (they may be of
interest for administrators and developers, too), and as we are already
changing the metadata schema, I would vote to add those columns, even in
case the query optimizer does not exploit them yet.

> This approach would overcharge for operations that are able to work with
> partially fetched values, but it's probably not reasonable to expect the
> planner to account for that with any accuracy.

I think it is impossible to give accurate statistics for this. We could
give some hints in "CREATE OPERATOR" to tell the query planner whether
the operator could make use of partial fetches, but this could never be
really accurate, as the amount of fetched data may vary wildly depending
on the value itself.

> A bigger concern is "what about lossy indexes"?  We currently ignore the
> costs of rechecking qual expressions for fetched rows, but this might be
> too inaccurate for situations like yours. I'm hesitant to mess with it
> though.  For one thing, to get it right we'd need to understand how many
> rows will be returned by the raw index search (which is the number of
> times we'd need to recheck).  At the moment the only info we have is the
> number that will pass the recheck, which could be a lot less ... and of
> course, even that is probably a really crude estimate when we are
> dealing with this sort of operator.

I do not know whether PostGIS actually rechecks against the real
geometry. If the app needs the bbox check (&& operator), then the lossy
index contains just all the information. If a real intersection is
needed, PostGIS users usually use "(column && bbox_of_reference) AND
intersects(column, reference)". This uses the bbox based index for
efficient candidate selection, and then uses the rather expensive
geometric intersection algorithm for real decision.

But maybe there'll be a real intersection Operator in the future, that
makes use of the bbox index in the first stage.

> Seems like a bit of a can of worms ...

Sorry :-)

I did not really expect the problem to be so complicated when I posted
my problem, I merely thought about a 10-liner patch to the query
estimator that I could backport and aply to my 7.4.6. Seems that my
personal problem-size estimator has some serious bugs, too...

Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios at logi-track.com | www.logi-track.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 256 bytes
Desc: OpenPGP digital signature
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20050203/4e1b8797/attachment.pgp>