[postgis-users] Points and Row Count Estimates

Wed Mar 1 08:44:46 PST 2023

> On Mar 1, 2023, at 7:23 AM, Gareth Bell <gareth.bell at outlook.com> wrote:
> 
> Hi postgis-users,
> 
> Please may I have some guidance. I have been looking in to bad row count estimates for a simple query, similar to the following:
>    SELECT area_id FROM area WHERE ST_INTERSECTS(area.polygon, ST_MakePoint(x, y))
> 
> I have found that the row count estimate is always '1', even if the point intersects multiple geometries and the actual row count is in the thousands. Is this expected behaviour?
> 
> To troubleshoot, I have tried:
>    1. Manually running `VACUUM ANALYZE`
>    2. Ensuring that stats are returned by `__postgis_stats`
>    3. Trying different but similar functions such as `ST_Contains`
> 
> I have also found that the following selectivity function returns `0` for any point geometry:
>    `__postgis_selectivity('area', 'polygon', ST_MakePoint(x, y))`

Yes, this is “expected” (the code is operating as written) though perhaps not correct. In general estimates will be estimates. The culpable line of code is 

		/* Add the pro-rated count for this cell to the overall total */
		total_count += cell_count * ratio;

Since we are usually checking things with area as our query keys, pro-rating coverage of our grid makes sense. 

Most cases of point-on-poly queries, the number of polys under a given point is one. So the selectivity of a single point is 1/N where N is the table size. Which converges nicely on zero.

You have some kind of fun oddball case with a lot of (presumably variable) polygon overlap such that this simplification doesn’t make sense for you. Unfortunately we don’t have any geometry typology to play with to special case points as the selectivity key. We just have the bounds. And we cannot reasonably treat a “zero size box” as a “point”, since a “zero size polygon” will presumably have zero selectivity and will also have a “zero size box". 

The “trouble” is that the grid is constructed quite deliberately knowing that we will be dong pro-rata calculations against it. We reduce the cell density in dimensions with low variability, and use pro-rating to get the correct values: if the variable is uniform and you cover half the variable range, you get half the expected results back. Once you stop pro-rating, by, for example, assigning all the cell’s contents to a “zero size box” you get a completely different flavour of wrong results.

We are trying to squeeze a lot of information out of a relatively small data structure. We get 30000 samples in an ANALYZE. That’s 30000 boxes, or 30000 * sizeof(float) * sizeof(float) bytes… 480K. That’s the raw sample! If we quantized that into a 1000x1000 uniform grid, we’d end up with 1Mb of “sample” grid data, our data structure would be larger than our inputs! 

So to try and squeeze more out of the grid, we make it non-uniform and apply pro-rata scaling in the selectivity estimation, but then we get the issue you’ve identified for zero dimension query inputs. 

If we had explicit column types (point, line, polygon) we could actually have different estimators for each, albeit at immense extra complexity. 

> After reading through https://www.crunchydata.com/blog/indexes-selectivity-and-statistics, I think a selectivity of `0` for a point might be the intended behaviour, but I'm not sure -- based on the blog post (and reading through the PostGIS source), spatial selectivity is the % of geometries covered by the search geometry. When the search geometry is a point (which has no dimensions), none of the geometry will be covered, giving a selectivity of `0`.

So, yes and no. It’s not really intended but it’s also not avoidable under the current constraints of the design. The “intent” would be to always have the selectivity estimate exactly match the actual returned rows, but that’s obviously not achievable. The question is where we get things wrong and by how much.

ATB,

P

> 
> This is where I've got to on the investigation so far. Could I have some guidance where to look next please? In particular:
>    1. Is always returning a specificity of `0` for a point geometry expected behaviour?
>    2. Is there a way to get more accurate row estimates for point-in-poly lookups such as `ST_INTERSECTS(area.polygon, ST_MakePoint(x, y)`?
> 
> Many thanks in advance,
> Gareth
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users