[postgis-users] Row estimations

Wed Dec 7 16:00:54 PST 2022

Experiments are here if you would like to try them.

https://github.com/postgis/postgis/pull/713

On Wed, Dec 7, 2022 at 2:22 PM Paul Ramsey <pramsey at cleverelephant.ca> wrote:
>
> So, this is rather hard to explain without pictures, but you have to first of all recognize that the selectivity estimator is an "estimator", so you cannot construct synthetic situations and expect things to match exactly. You can just expect them to be worse or better.
>
> In particular, things like "how many points are selected by this box" are little fraught.
>
> First, understand that underlying the spatial stats system is a (in the 2d case) a simple 2d grid. The database draws a sample and fills in the counts of features lying in each grid cell. Simple, yes? Except that spatial data sets can get quite big. There's definitely an extrema at which you don't want to make a grid any larger, for fear of making working with it too expensive. Right now we tend towards grids with about 20K cells at the highest end, and very few at the lowest end.
>
> As the code currently stands, it's possible for very small input tables to generate grids of 1x1 cell... which is what your case generates with the current code. In that case, the selectivity will end up being the proportion of the cell that the query box overlaps. So you can get (as you did) a selectivity of about 5% when the query box neatly covered 80% of the input rows.
>
> To an approxiimation, this can be improved by tweaking the analysis code to increase the number of cells in the histogram for small tables. However, the selectivity calculations are still necessarily proportional on area of cell overlapped, and in your example data, for, say a 2x2 grid, which does capture the outlying point in its own cell, the cell boundary will be 1/2 way between the extrema, so your first small query box will only pick up a selectivity of 0.2. Better than 0.05! But still not really right.
>
> The reasoning behind having the shrinking dimensions will strike you as odd: why not just a minimum 100x100 grid? Well, that goes down to our (my) philosophy of trying to extract as much analytical power out of each cell as possible. Basically, your data is quite variable on the X axis and not really variable on the Y axis... should we allocate as many cells in X in our stats grid as we do in Y? If you answer "no", then you're off down the rabbit hole with me: for each dimension, statsitically figure out how much variability there is in coverage, and then parcel out grid cells in each dimension to allocate the budget to the most variable dimensions, and less to homogeneous dimensions. This means that, in the limit, with a data set that is quite variable in X and quite homogenous in Y, you could end up with a 1x20000 grid, with lots of resolution in the axis we care about and none in the homogeneous axis.
>
> Very clever, but lots of moving parts, and in the limit as the number if incoming records goes down, not going to yield estimates that are as close to "reality" as we might like. But in return it extracts more precise estimates in the case where the number of rows is very very large.
>
> At least in theory. One possibility might be to push up the current maximum number of cells, and use some of that budget to put a higher floor on the minimum number of cells per dimension, so that the minimum isn't 1, as currently, but something more like 10, to cut down on the kind of divergence you're seeing in your small table case. I have some tweaks in hand that do that already, but now that I'm into the code I'm trying to dream up ways to cut it back while still maintaining a certain measure of dynamicism.
>
> How much operational difference does the mis-estimate create? For a small table, I really wonder? Because the number of tuples will always be small, it's not like you're likely to be dropped into an index scan by accident, even if the seletivity is coming back too high.
>
> I am way more interested in cases with larger numbers of rows generating mis-estimates. I assume that my code might also be vulnerable to tables with a lot of duplication, and also tables of points versus tables of polygons / lines.
>
> ATB,
> P
>
> > On Dec 6, 2022, at 6:06 PM, Regina Obe <lr at pcorp.us> wrote:
> >
> > Correction.  I was mistaken, I tested the wrong example for rows=1.
> >
> > My ST_Contains returns the same answer as &&
> >
> > explain analyze select * from test where ST_Contains(ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))'), p);
> >
> > Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual time=0.025..0.027 rows=1 loops=1)
> >   Filter: st_contains('01030000000100000005000000000000000000494000000000000000000000000000004A4000000000000000000000000000004A4000000000000000400000000000004940000000000000004000000000000049400000000000000000'::geometry, p)
> >   Rows Removed by Filter: 4
> > Planning Time: 0.121 ms
> > Execution Time: 0.047 ms
> >
> > So the fact that works for me and not you might be just dumb luck that I ran the && first and the stats are cached and using that.
> >
> > The difference in answer might also be if you modified your default_statistics_target.
> >
> > SHOW default_statistics_target;
> >
> > Mine shows 100.
> >
> > Thanks,
> > Regina
> >
> >
> > From: Regina Obe [mailto:lr at pcorp.us]
> > Sent: Tuesday, December 6, 2022 8:59 PM
> > To: 'PostGIS Users Discussion' <postgis-users at lists.osgeo.org>
> > Subject: RE: [postgis-users] Row estimations
> >
> > Igor,
> >
> > The stats estimation for PostGIS has always been not the greatest.  It’s more of a limitation with PostgreSQL as I understand it.
> > It’s been getting better over the years though.  Even in the best case scenario only the bounding box estimation would happen and Paul may be better able to answer what happens with the functions such as ST_Contains (which use the Function instrumentation).
> >
> > So a better test to confirm your stats are working correctly is an && test which I have below.
> >
> > I’m running  PostgreSQL 15.1, compiled by Visual C++ build 1914, 64-bit POSTGIS="3.3.2 3.3.2" [EXTENSION] PGSQL="150" GEOS="3.11.1-CAPI-1.17.1" SFCGAL="SFCGAL 1.4.1, CGAL 5.3, BOOST 1.78.0" PROJ="7.2.1" GDAL="GDAL 3.4.3, released 2022/04/22" LIBXML="2.9.9" LIBJSON="0.12" LIBPROTOBUF="1.2.1" WAGYU="0.5.0 (Internal)" TOPOLOGY RASTER
> >
> > explain analyze select * from test where ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))') &&  p;
> >
> >
> > which give me the expected answer of  –
> >
> > Seq Scan on test  (cost=0.00..1.06 rows=1 width=32) (actual time=0.019..0.024 rows=4 loops=1)
> >   Filter: ('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry && p)
> >   Rows Removed by Filter: 1
> > Planning Time: 0.131 ms
> > Execution Time: 0.046 ms
> >
> > And 4 rows are indeed returned for me.
> >
> > explain analyze select * from test where ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))') && p;
> >
> > Seq Scan on test  (cost=0.00..1.06 rows=1 width=32) (actual time=0.022..0.024 rows=1 loops=1)
> >   Filter: ('01030000000100000005000000000000000000494000000000000000000000000000004A4000000000000000000000000000004A4000000000000000400000000000004940000000000000004000000000000049400000000000000000'::geometry && p)
> >   Rows Removed by Filter: 4
> > Planning Time: 0.124 ms
> > Execution Time: 0.047 ms
> >
> > I confirm that I get the same answer of below when doing
> >
> > explain analyze select * from test where ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))') , p);
> >
> >
> > Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual time=0.026..0.038 rows=4 loops=1)
> >   Filter: st_contains('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry, p)
> >   Rows Removed by Filter: 1
> > Planning Time: 0.132 ms
> > Execution Time: 0.061 ms
> >
> >
> > So it does seem odd to me that ST_Contains would give a higher estimate than the && operation.  That could be a bug in the function instrumentation of  ST_Contains and ST_Intersects.
> >
> >
> > Thanks,
> > Regina
> >
> > From: postgis-users [mailto:postgis-users-bounces at lists.osgeo.org] On Behalf Of Igor ALBUQUERQUE SILVA
> > Sent: Thursday, December 1, 2022 4:37 AM
> > To: postgis-users at lists.osgeo.org
> > Subject: [postgis-users] Row estimations
> >
> > Hello everyone,
> >
> > I have a question regarding the row estimations/gist indexes. Here's a minimal reproduction of it:
> >
> > create table test(p geometry(point));
> > insert into test(p) values (st_makepoint(1,1));
> > insert into test(p) values (st_makepoint(1,2));
> > insert into test(p) values (st_makepoint(2,1));
> > insert into test(p) values (st_makepoint(2,2));
> > insert into test(p) values (st_makepoint(51,1));
> > analyze test;
> > explain analyze select * from test where ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))'), p);
> > explain analyze select * from test where ST_Contains(ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))'), p);
> >
> > The two queries get the same cost/row estimation, of 1 row. This is the EXPLAIN ANALYZE of the first query:
> >
> > Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual time=0.015..0.022 rows=4 loops=1)
> >    Filter: st_contains('01030000000100000005000000000000000000F0BF000000000000F0BF0000000000000040000000000000F0BF00000000000000400000000000000040000000000000F0BF0000000000000040000000000000F0BF000000000000F0BF'::geometry, p)
> >    Rows Removed by Filter: 1
> >  Planning Time: 0.072 ms
> >  Execution Time: 0.035 ms
> > (5 rows)
> >
> > What I was expecting is the first query to estimate 4 rows and the second to estimate 1, like what I get If I try the same thing using integers.
> >
> > create table test(x integer, y integer);
> > insert into test(x, y) values (0, 0);
> > insert into test(x, y) values (0, 1);
> > insert into test(x, y) values (1, 0);
> > insert into test(x, y) values (1, 1);
> > insert into test(x, y) values (50, 0);
> > analyze test;
> > explain analyze select * from test where x between 0 and 1 and y between 0 and 1;
> > explain analyze select * from test where x between 50 and 51 and y between 0 and 1;
> >
> > My question is: is this expected behaviour? I actually have a much larger table with a gist index where I found this occurring, and this causes the planner to make bad decisions: every query that I do will have the same estimation, and whenever this estimation is very wrong, the planner does not take the optimal decision when it has to compare with other indexes costs in a more complicated query.
> >
> > I'm using the official docker image, PostgreSQL 15.1 (Debian 15.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit POSTGIS="3.3.1 3786b21" [EXTENSION] PGSQL="150" GEOS="3.9.0-CAPI-1.16.2" PROJ="7.2.1" LIBXML="2.9.10" LIBJSON="0.15" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)" TOPOLOGY
> >
> > Best regards,
> > Igor
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users at lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>