[postgis-devel] GiST JoinSel Bogus?

Tue Nov 20 03:35:35 PST 2012

I think the handling of JOINS is highly dependent on geometry types.
For POINT vs POINT joins I think it's very unlikely to find a match,
for example, while for POINT vs. POLYGON the presence of 4 polygon
overlaps and 10 points in an histogram cell means that there will be
10 rows returned for each of those 4 polygons.

Thinking further along these lines I think what matters is the presence
of elements from BOTH datasets in a given cell.
As an extreme, if any histogram cell is EITHER occupied by one table
OR by the other (but never by BOTH) the estimate would be 0, right ?

So we could start by only counting cells where BOTH are found, and
take a product of that, considering that the count for POINT really
counts ONE, while the count for POLYGONS or LINES count slighly less
because the same POLYGON or LINE would intersect multiple cells. I
think we have code that accounts for that "diluition" in the straight
case using the concept of "intersection" between the average input
bounding box and the query cell, which in this case could be the
intersection of the average input bounding box of the two input tables.
See
http://trac.osgeo.org/postgis/browser/trunk/postgis/geometry_estimate.c#L553

I'm very proud of what came out from the work I've done (with help from
Mark) on the estimation code. It took a lot of neurons bakery.
It'll be very satisfactory once join selectivity is also done right.
Please put as much effort in comments as I did in there :)

One thing would be surely worth it: preparing a way to automate testing
for all of this.

--strk;

On Mon, Nov 19, 2012 at 04:55:40PM -0800, Paul Ramsey wrote:
> I've been looking over selectivity code, getting ready to attached
> selectivity on &&&, and also to try and figure out some of our knotty
> tickets on selectivity, and the GiST JoinSel code seems bogus to me,
> anyone care to look?
> 
> http://trac.osgeo.org/postgis/browser/trunk/postgis/geometry_estimate.c#L170
> 
> Basically, the core assumption is that, within the area that the
> tables overlap, every feature in one table will find two features in
> the other table it intersects with. Kind of an arbitrary scaling. The
> right way to do it, IMO, is to do the table-vs-table equivalent of the
> table-vs-extent calculation done in the plain GiST Sel function: add
> up the cell intersections individually, and adjust for proportion of
> feature overlapping. It's way more complex, but far more likely to
> reflect Truth.
> 
> Since the correct logic undercounts potential interactions quite
> dramatically, it might explain the bad plans reported out of here:
> http://gis.stackexchange.com/questions/41199/postgis-intermittent-index-performance,
> which look to be caused by a very aggressive selectivity number being
> fed to the planner by the && cost function.
> 
> Commentary?
> 
> P.
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-devel