[postgis-devel] GiST JoinSel Bogus?

Tue Nov 20 09:31:16 PST 2012

On Tue, Nov 20, 2012 at 3:35 AM, Sandro Santilli <strk at keybit.net> wrote:
> I think the handling of JOINS is highly dependent on geometry types.
> For POINT vs POINT joins I think it's very unlikely to find a match,
> for example, while for POINT vs. POLYGON the presence of 4 polygon
> overlaps and 10 points in an histogram cell means that there will be
> 10 rows returned for each of those 4 polygons.
>
> Thinking further along these lines I think what matters is the presence
> of elements from BOTH datasets in a given cell.
> As an extreme, if any histogram cell is EITHER occupied by one table
> OR by the other (but never by BOTH) the estimate would be 0, right ?

Yes, the reductio ad absurdum I was thinking of while looking at the
current code is two 2x2 matrixes like this:

2 0
0 2

0  10
10  0

The current code would say that join would return 48 entries. The
danger of changing it, of course, it getting worse results in the
other direction: I can see how cell-by-cell counting could
over-estimate quite easily.

> So we could start by only counting cells where BOTH are found, and
> take a product of that, considering that the count for POINT really
> counts ONE, while the count for POLYGONS or LINES count slighly less
> because the same POLYGON or LINE would intersect multiple cells. I
> think we have code that accounts for that "diluition" in the straight
> case using the concept of "intersection" between the average input
> bounding box and the query cell, which in this case could be the
> intersection of the average input bounding box of the two input tables.
> See
> http://trac.osgeo.org/postgis/browser/trunk/postgis/geometry_estimate.c#L553

Yes, the current approach for simple selectivity seems well tested and
proven, so I feel less worried about trying to apply it to the joinsel
problem.

> I'm very proud of what came out from the work I've done (with help from
> Mark) on the estimation code. It took a lot of neurons bakery.
> It'll be very satisfactory once join selectivity is also done right.
> Please put as much effort in comments as I did in there :)

Indeed, proud you should be: it's one of the things that is quite
unique about PostGIS and makes our project work so well across such a
wide range of queries. "Bogus" was a poorly chosen work: "suboptimal"
would be better. What we have is better than the alternative (a
constant) but not yet the ideal. The current simple selective code is
close to the ideal, the only thing left to improve is this nagging
ticket:

http://trac.osgeo.org/postgis/ticket/1828

Which, incidentally, despite some great discussion on the ticket, we
don't really have an answer for. The problem isn't the selectivity
*calculation*, which is fine, it's the interface with PostgreSQL and
ensuring a useful answer for the <col> && <func> case. I wish there
was some way we could return "I dunno" on that case.

P.

> One thing would be surely worth it: preparing a way to automate testing
> for all of this.
>
> --strk;
>
>
> On Mon, Nov 19, 2012 at 04:55:40PM -0800, Paul Ramsey wrote:
>> I've been looking over selectivity code, getting ready to attached
>> selectivity on &&&, and also to try and figure out some of our knotty
>> tickets on selectivity, and the GiST JoinSel code seems bogus to me,
>> anyone care to look?
>>
>> http://trac.osgeo.org/postgis/browser/trunk/postgis/geometry_estimate.c#L170
>>
>> Basically, the core assumption is that, within the area that the
>> tables overlap, every feature in one table will find two features in
>> the other table it intersects with. Kind of an arbitrary scaling. The
>> right way to do it, IMO, is to do the table-vs-table equivalent of the
>> table-vs-extent calculation done in the plain GiST Sel function: add
>> up the cell intersections individually, and adjust for proportion of
>> feature overlapping. It's way more complex, but far more likely to
>> reflect Truth.
>>
>> Since the correct logic undercounts potential interactions quite
>> dramatically, it might explain the bad plans reported out of here:
>> http://gis.stackexchange.com/questions/41199/postgis-intermittent-index-performance,
>> which look to be caused by a very aggressive selectivity number being
>> fed to the planner by the && cost function.
>>
>> Commentary?
>>
>> P.
>> _______________________________________________
>> postgis-devel mailing list
>> postgis-devel at lists.osgeo.org
>> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-devel
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-devel