[postgis-users] Row estimations

Tue Dec 6 17:58:46 PST 2022

Igor,

The stats estimation for PostGIS has always been not the greatest.  It’s more of a limitation with PostgreSQL as I understand it.

It’s been getting better over the years though.  Even in the best case scenario only the bounding box estimation would happen and Paul may be better able to answer what happens with the functions such as ST_Contains (which use the Function instrumentation).

So a better test to confirm your stats are working correctly is an && test which I have below.

I’m running  PostgreSQL 15.1, compiled by Visual C++ build 1914, 64-bit POSTGIS="3.3.2 3.3.2" [EXTENSION] PGSQL="150" GEOS="3.11.1-CAPI-1.17.1" SFCGAL="SFCGAL 1.4.1, CGAL 5.3, BOOST 1.78.0" PROJ="7.2.1" GDAL="GDAL 3.4.3, released 2022/04/22" LIBXML="2.9.9" LIBJSON="0.12" LIBPROTOBUF="1.2.1" WAGYU="0.5.0 (Internal)" TOPOLOGY RASTER

explain analyze select * from test where ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))') &&  p;

which give me the expected answer of  – 

Seq Scan on test  (cost=0.00..1.06 rows=1 width=32) (actual time=0.019..0.024 rows=4 loops=1)

  Filter: ('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry && p)

  Rows Removed by Filter: 1

Planning Time: 0.131 ms

Execution Time: 0.046 ms

And 4 rows are indeed returned for me.

explain analyze select * from test where ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))') && p;

Seq Scan on test  (cost=0.00..1.06 rows=1 width=32) (actual time=0.022..0.024 rows=1 loops=1)

  Filter: ('01030000000100000005000000000000000000494000000000000000000000000000004A4000000000000000000000000000004A4000000000000000400000000000004940000000000000004000000000000049400000000000000000'::geometry && p)

  Rows Removed by Filter: 4

Planning Time: 0.124 ms

Execution Time: 0.047 ms

I confirm that I get the same answer of below when doing

explain analyze select * from test where ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))') , p);

Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual time=0.026..0.038 rows=4 loops=1)

  Filter: st_contains('010300000001000000050000000000000000000000000000000000000000000000000008400000000000000080000000000000084000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry, p)

  Rows Removed by Filter: 1

Planning Time: 0.132 ms

Execution Time: 0.061 ms

So it does seem odd to me that ST_Contains would give a higher estimate than the && operation.  That could be a bug in the function instrumentation of  ST_Contains and ST_Intersects.

Thanks,

Regina

From: postgis-users [mailto:postgis-users-bounces at lists.osgeo.org] On Behalf Of Igor ALBUQUERQUE SILVA
Sent: Thursday, December 1, 2022 4:37 AM
To: postgis-users at lists.osgeo.org
Subject: [postgis-users] Row estimations

Hello everyone,

I have a question regarding the row estimations/gist indexes. Here's a minimal reproduction of it:

create table test(p geometry(point));
insert into test(p) values (st_makepoint(1,1));
insert into test(p) values (st_makepoint(1,2));
insert into test(p) values (st_makepoint(2,1));
insert into test(p) values (st_makepoint(2,2));
insert into test(p) values (st_makepoint(51,1));
analyze test;
explain analyze select * from test where ST_Contains(ST_GeomFromText('POLYGON((0 0,3 -0,3 3,0 3,0 0))'), p);
explain analyze select * from test where ST_Contains(ST_GeomFromText('POLYGON((50 0,52 0,52 2,50 2,50 0))'), p);

The two queries get the same cost/row estimation, of 1 row. This is the EXPLAIN ANALYZE of the first query:

Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual time=0.015..0.022 rows=4 loops=1)
   Filter: st_contains('01030000000100000005000000000000000000F0BF000000000000F0BF0000000000000040000000000000F0BF00000000000000400000000000000040000000000000F0BF0000000000000040000000000000F0BF000000000000F0BF'::geometry, p)
   Rows Removed by Filter: 1
 Planning Time: 0.072 ms
 Execution Time: 0.035 ms
(5 rows)

What I was expecting is the first query to estimate 4 rows and the second to estimate 1, like what I get If I try the same thing using integers.

create table test(x integer, y integer);
insert into test(x, y) values (0, 0);
insert into test(x, y) values (0, 1);
insert into test(x, y) values (1, 0);
insert into test(x, y) values (1, 1);
insert into test(x, y) values (50, 0);
analyze test;
explain analyze select * from test where x between 0 and 1 and y between 0 and 1;
explain analyze select * from test where x between 50 and 51 and y between 0 and 1;

My question is: is this expected behaviour? I actually have a much larger table with a gist index where I found this occurring, and this causes the planner to make bad decisions: every query that I do will have the same estimation, and whenever this estimation is very wrong, the planner does not take the optimal decision when it has to compare with other indexes costs in a more complicated query.

I'm using the official docker image, PostgreSQL 15.1 (Debian 15.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit POSTGIS="3.3.1 3786b21" [EXTENSION] PGSQL="150" GEOS="3.9.0-CAPI-1.16.2" PROJ="7.2.1" LIBXML="2.9.10" LIBJSON="0.15" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)" TOPOLOGY

Best regards,

Igor

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20221206/40cacb02/attachment.htm>