[postgis-users] Row estimations

Igor ALBUQUERQUE SILVA i.albuquerque-silva at kayrros.com
Fri Dec 9 10:11:53 PST 2022

```Thanks a lot Paul and Regina!

This was a very clear overview of how the estimations work, it's much
clearer to me now. I've done a few more tests and what you said is true for
the test case I sent before. I've experimented with larger test cases with
more points and I think I found a solution for the problem that I was
having on my actual database (that uses a gist index): if I switch
ST_Contains to ST_Intersects I get better query plans, so I'm already happy
with this. But thank you so much for the help and insights!

If someone is curious, here's an example to illustrate this new affirmation
I'm giving: (I put around 700k random points inside a quarter of circle)

create table test(p geometry(point));
do \$\$
declare x double precision;
declare y double precision;
begin
for i in 1..1000000 loop
x := random();
y := random();
if x * x + y * y < 1 then
insert into test(p) values (st_makepoint(x,y));
end if;
end loop;
end \$\$;
create index index_gist on test using gist(p);
analyze test;

First test: explain analyze select * from test where
ST_Intersects(ST_GeomFromText('POLYGON((0 0,0.8 0,0.8 0.8,0 0.8,0 0))'), p);

Gather  (cost=18182.17..6499204.94 rows=615469 width=32) (actual
time=81.817..237.974 rows=621571 loops=1)
Workers Planned: 2
Workers Launched: 2
->  Parallel Bitmap Heap Scan on test  (cost=17182.17..6436658.04
rows=256445 width=32) (actual time=73.249..171.482 rows=207190 loops=3)
Filter:
st_intersects('01030000000100000005000000000000000000000000000000000000009A9999999999E93F00000000000000009A9999999999E93F9A9999999999E93F00000000000000009A9999999999E93F00000000000000000000000000000000'::geometry,
p)
Rows Removed by Filter: 0
Heap Blocks: exact=2032
->  Bitmap Index Scan on index_gist  (cost=0.00..17028.30
rows=615469 width=0) (actual time=64.899..64.900 rows=621572 loops=1)
Index Cond: (p &&
'01030000000100000005000000000000000000000000000000000000009A9999999999E93F00000000000000009A9999999999E93F9A9999999999E93F00000000000000009A9999999999E93F00000000000000000000000000000000'::geometry)
Planning Time: 0.411 ms
JIT:
Functions: 6
Options: Inlining true, Optimization true, Expressions true, Deforming
true
Timing: Generation 0.905 ms, Inlining 103.928 ms, Optimization 28.149
ms, Emission 15.308 ms, Total 148.290 ms
Execution Time: 255.509 ms

Second test: explain analyze select * from test where
ST_Contains(ST_GeomFromText('POLYGON((0 0,0.8 0,0.8 0.8,0 0.8,0 0))'), p);

Bitmap Heap Scan on test  (cost=176.04..21964.71 rows=615469 width=32)
(actual time=98.093..421.178 rows=621571 loops=1)
Filter:
st_contains('01030000000100000005000000000000000000000000000000000000009A9999999999E93F00000000000000009A9999999999E93F9A9999999999E93F00000000000000009A9999999999E93F00000000000000000000000000000000'::geometry,
p)
Heap Blocks: exact=5776
->  Bitmap Index Scan on index_gist  (cost=0.00..22.17 rows=785 width=0)
(actual time=97.300..97.301 rows=621571 loops=1)
Index Cond: (p @
'01030000000100000005000000000000000000000000000000000000009A9999999999E93F00000000000000009A9999999999E93F9A9999999999E93F00000000000000009A9999999999E93F00000000000000000000000000000000'::geometry)
Planning Time: 0.668 ms
Execution Time: 438.038 ms

In this example ST_Contains estimates 785 rows, while ST_Intersect
estimates 615469 which is closer to the real 621571 rows. Using the
bounding box operation I also obtain good estimations, and I also obtain
good estimations if I don't use the gist index.

Best regards,
Igor

On Thu, 8 Dec 2022 at 01:01, Paul Ramsey <pramsey at cleverelephant.ca> wrote:

> Experiments are here if you would like to try them.
>
> https://github.com/postgis/postgis/pull/713
>
> On Wed, Dec 7, 2022 at 2:22 PM Paul Ramsey <pramsey at cleverelephant.ca>
> wrote:
> >
> > So, this is rather hard to explain without pictures, but you have to
> first of all recognize that the selectivity estimator is an "estimator", so
> you cannot construct synthetic situations and expect things to match
> exactly. You can just expect them to be worse or better.
> >
> > In particular, things like "how many points are selected by this box"
> are little fraught.
> >
> > First, understand that underlying the spatial stats system is a (in the
> 2d case) a simple 2d grid. The database draws a sample and fills in the
> counts of features lying in each grid cell. Simple, yes? Except that
> spatial data sets can get quite big. There's definitely an extrema at which
> you don't want to make a grid any larger, for fear of making working with
> it too expensive. Right now we tend towards grids with about 20K cells at
> the highest end, and very few at the lowest end.
> >
> > As the code currently stands, it's possible for very small input tables
> to generate grids of 1x1 cell... which is what your case generates with the
> current code. In that case, the selectivity will end up being the
> proportion of the cell that the query box overlaps. So you can get (as you
> did) a selectivity of about 5% when the query box neatly covered 80% of the
> input rows.
> >
> > To an approxiimation, this can be improved by tweaking the analysis code
> to increase the number of cells in the histogram for small tables. However,
> the selectivity calculations are still necessarily proportional on area of
> cell overlapped, and in your example data, for, say a 2x2 grid, which does
> capture the outlying point in its own cell, the cell boundary will be 1/2
> way between the extrema, so your first small query box will only pick up a
> selectivity of 0.2. Better than 0.05! But still not really right.
> >
> > The reasoning behind having the shrinking dimensions will strike you as
> odd: why not just a minimum 100x100 grid? Well, that goes down to our (my)
> philosophy of trying to extract as much analytical power out of each cell
> as possible. Basically, your data is quite variable on the X axis and not
> really variable on the Y axis... should we allocate as many cells in X in
> our stats grid as we do in Y? If you answer "no", then you're off down the
> rabbit hole with me: for each dimension, statsitically figure out how much
> variability there is in coverage, and then parcel out grid cells in each
> dimension to allocate the budget to the most variable dimensions, and less
> to homogeneous dimensions. This means that, in the limit, with a data set
> that is quite variable in X and quite homogenous in Y, you could end up
> with a 1x20000 grid, with lots of resolution in the axis we care about and
> none in the homogeneous axis.
> >
> > Very clever, but lots of moving parts, and in the limit as the number if
> incoming records goes down, not going to yield estimates that are as close
> to "reality" as we might like. But in return it extracts more precise
> estimates in the case where the number of rows is very very large.
> >
> > At least in theory. One possibility might be to push up the current
> maximum number of cells, and use some of that budget to put a higher floor
> on the minimum number of cells per dimension, so that the minimum isn't 1,
> as currently, but something more like 10, to cut down on the kind of
> divergence you're seeing in your small table case. I have some tweaks in
> hand that do that already, but now that I'm into the code I'm trying to
> dream up ways to cut it back while still maintaining a certain measure of
> dynamicism.
> >
> > How much operational difference does the mis-estimate create? For a
> small table, I really wonder? Because the number of tuples will always be
> small, it's not like you're likely to be dropped into an index scan by
> accident, even if the seletivity is coming back too high.
> >
> > I am way more interested in cases with larger numbers of rows generating
> mis-estimates. I assume that my code might also be vulnerable to tables
> with a lot of duplication, and also tables of points versus tables of
> polygons / lines.
> >
> > ATB,
> > P
> >
> > > On Dec 6, 2022, at 6:06 PM, Regina Obe <lr at pcorp.us> wrote:
> > >
> > > Correction.  I was mistaken, I tested the wrong example for rows=1.
> > >
> > > My ST_Contains returns the same answer as &&
> > >
> > > explain analyze select * from test where
> ST_Contains(ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))'), p);
> > >
> > > Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual
> time=0.025..0.027 rows=1 loops=1)
> > >   Filter:
> st_contains('01030000000100000005000000000000000000494000000000000000000000000000004A4000000000000000000000000000004A4000000000000000400000000000004940000000000000004000000000000049400000000000000000'::geometry,
> p)
> > >   Rows Removed by Filter: 4
> > > Planning Time: 0.121 ms
> > > Execution Time: 0.047 ms
> > >
> > > So the fact that works for me and not you might be just dumb luck that
> I ran the && first and the stats are cached and using that.
> > >
> > > The difference in answer might also be if you modified your
> default_statistics_target.
> > >
> > > SHOW default_statistics_target;
> > >
> > > Mine shows 100.
> > >
> > > Thanks,
> > > Regina
> > >
> > >
> > > From: Regina Obe [mailto:lr at pcorp.us]
> > > Sent: Tuesday, December 6, 2022 8:59 PM
> > > To: 'PostGIS Users Discussion' <postgis-users at lists.osgeo.org>
> > > Subject: RE: [postgis-users] Row estimations
> > >
> > > Igor,
> > >
> > > The stats estimation for PostGIS has always been not the greatest.
> It’s more of a limitation with PostgreSQL as I understand it.
> > > It’s been getting better over the years though.  Even in the best case
> scenario only the bounding box estimation would happen and Paul may be
> better able to answer what happens with the functions such as ST_Contains
> (which use the Function instrumentation).
> > >
> > > So a better test to confirm your stats are working correctly is an &&
> test which I have below.
> > >
> > > I’m running  PostgreSQL 15.1, compiled by Visual C++ build 1914,
> 64-bit POSTGIS="3.3.2 3.3.2" [EXTENSION] PGSQL="150"
> GEOS="3.11.1-CAPI-1.17.1" SFCGAL="SFCGAL 1.4.1, CGAL 5.3, BOOST 1.78.0"
> PROJ="7.2.1" GDAL="GDAL 3.4.3, released 2022/04/22" LIBXML="2.9.9"
> LIBJSON="0.12" LIBPROTOBUF="1.2.1" WAGYU="0.5.0 (Internal)" TOPOLOGY RASTER
> > >
> > > explain analyze select * from test where ST_GeomFromText('POLYGON((0
> 0,3 -0,3 3,0 3,0 0))') &&  p;
> > >
> > >
> > > which give me the expected answer of  –
> > >
> > > Seq Scan on test  (cost=0.00..1.06 rows=1 width=32) (actual
> time=0.019..0.024 rows=4 loops=1)
> > >   Filter:
> ('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry
> && p)
> > >   Rows Removed by Filter: 1
> > > Planning Time: 0.131 ms
> > > Execution Time: 0.046 ms
> > >
> > > And 4 rows are indeed returned for me.
> > >
> > > explain analyze select * from test where ST_GeomFromText('POLYGON((50
> 0,52 0,52 2,50 2,50 0))') && p;
> > >
> > > Seq Scan on test  (cost=0.00..1.06 rows=1 width=32) (actual
> time=0.022..0.024 rows=1 loops=1)
> > >   Filter:
> ('01030000000100000005000000000000000000494000000000000000000000000000004A4000000000000000000000000000004A4000000000000000400000000000004940000000000000004000000000000049400000000000000000'::geometry
> && p)
> > >   Rows Removed by Filter: 4
> > > Planning Time: 0.124 ms
> > > Execution Time: 0.047 ms
> > >
> > > I confirm that I get the same answer of below when doing
> > >
> > > explain analyze select * from test where
> ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))') , p);
> > >
> > >
> > > Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual
> time=0.026..0.038 rows=4 loops=1)
> > >   Filter:
> st_contains('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry,
> p)
> > >   Rows Removed by Filter: 1
> > > Planning Time: 0.132 ms
> > > Execution Time: 0.061 ms
> > >
> > >
> > > So it does seem odd to me that ST_Contains would give a higher
> estimate than the && operation.  That could be a bug in the function
> instrumentation of  ST_Contains and ST_Intersects.
> > >
> > >
> > > Thanks,
> > > Regina
> > >
> > > From: postgis-users [mailto:postgis-users-bounces at lists.osgeo.org] On
> Behalf Of Igor ALBUQUERQUE SILVA
> > > Sent: Thursday, December 1, 2022 4:37 AM
> > > To: postgis-users at lists.osgeo.org
> > > Subject: [postgis-users] Row estimations
> > >
> > > Hello everyone,
> > >
> > > I have a question regarding the row estimations/gist indexes. Here's a
> minimal reproduction of it:
> > >
> > > create table test(p geometry(point));
> > > insert into test(p) values (st_makepoint(1,1));
> > > insert into test(p) values (st_makepoint(1,2));
> > > insert into test(p) values (st_makepoint(2,1));
> > > insert into test(p) values (st_makepoint(2,2));
> > > insert into test(p) values (st_makepoint(51,1));
> > > analyze test;
> > > explain analyze select * from test where
> ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))'), p);
> > > explain analyze select * from test where
> ST_Contains(ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))'), p);
> > >
> > > The two queries get the same cost/row estimation, of 1 row. This is
> the EXPLAIN ANALYZE of the first query:
> > >
> > > Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual
> time=0.015..0.022 rows=4 loops=1)
> > >    Filter:
> st_contains('01030000000100000005000000000000000000F0BF000000000000F0BF0000000000000040000000000000F0BF00000000000000400000000000000040000000000000F0BF0000000000000040000000000000F0BF000000000000F0BF'::geometry,
> p)
> > >    Rows Removed by Filter: 1
> > >  Planning Time: 0.072 ms
> > >  Execution Time: 0.035 ms
> > > (5 rows)
> > >
> > > What I was expecting is the first query to estimate 4 rows and the
> second to estimate 1, like what I get If I try the same thing using
> integers.
> > >
> > > create table test(x integer, y integer);
> > > insert into test(x, y) values (0, 0);
> > > insert into test(x, y) values (0, 1);
> > > insert into test(x, y) values (1, 0);
> > > insert into test(x, y) values (1, 1);
> > > insert into test(x, y) values (50, 0);
> > > analyze test;
> > > explain analyze select * from test where x between 0 and 1 and y
> between 0 and 1;
> > > explain analyze select * from test where x between 50 and 51 and y
> between 0 and 1;
> > >
> > > My question is: is this expected behaviour? I actually have a much
> larger table with a gist index where I found this occurring, and this
> causes the planner to make bad decisions: every query that I do will have
> the same estimation, and whenever this estimation is very wrong, the
> planner does not take the optimal decision when it has to compare with
> other indexes costs in a more complicated query.
> > >
> > > I'm using the official docker image, PostgreSQL 15.1 (Debian
> 15.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6)
> 10.2.1 20210110, 64-bit POSTGIS="3.3.1 3786b21" [EXTENSION] PGSQL="150"
> GEOS="3.9.0-CAPI-1.16.2" PROJ="7.2.1" LIBXML="2.9.10" LIBJSON="0.15"
> LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)" TOPOLOGY
> > >
> > > Best regards,
> > > Igor
> > > _______________________________________________
> > > postgis-users mailing list
> > > postgis-users at lists.osgeo.org
> > > https://lists.osgeo.org/mailman/listinfo/postgis-users
> >
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20221209/76227d00/attachment.htm>
```