I have a table that contains many point rows. The points all fall within a polygon in another table. There are roughly 5 to 20 thousand points per polygon. The polygons will frequently overlap. (The data is annual.) I have to calculate aggregate values on the point data, and the query planner is getting horrible estimates for the number of point rows returned by the spatial index. I have create a psql script that will generate a database with similar data. (My apologies for the length.)<br>
<br><br>DROP DATABASE IF EXISTS test_db;<br>CREATE DATABASE test_db;<br><br>\c test_db<br><br>CREATE EXTENSION postgis;<br><br>-- SRID is UTM 15 with NAD 83 datum<br><br>CREATE TABLE polygon<br>(<br> polygon_id SERIAL PRIMARY KEY,<br>
geom GEOMETRY(MULTIPOLYGON, 26915)<br>);<br><br>CREATE TABLE point<br>(<br> point_id SERIAL PRIMARY KEY,<br> some_value DECIMAL(10,5),<br> polygon_id INTEGER NOT NULL REFERENCES polygon (polygon_id),<br> geom GEOMETRY(POINT, 26915)<br>
);<br><br>-- Randomly generate 100 rectangles<br>INSERT INTO polygon (geom)<br>SELECT ST_Multi(ST_SetSRID(ST_Envelope(('LINESTRING('||x||' '||y||','||(x+side)||' '||(y+side)||')')::GEOMETRY), 26915))<br>
FROM (SELECT random()*(823749.9411-176250.0589) + 176250.0589 AS x<br> , random()*(9106037.1690-1577463.0797) + 1577463.0797 AS y<br> , random()*(5000-1000) + 1000 AS side<br> FROM generate_series(1, 100)<br>
) box_data<br>;<br><br>-- Insert a lot of points contained by the rectangles<br>INSERT INTO point (some_value, polygon_id, geom)<br>SELECT random()*100 + 20<br> , polygon_id<br> , ST_SetSRID(('POINT('||x||' '||y||')')::GEOMETRY, 26915)<br>
FROM (SELECT polygon_id<br> , random()*(ST_XMax(geom) - ST_XMin(geom)) + ST_XMin(geom) AS x<br> , random()*(ST_YMax(geom) - ST_YMin(geom)) + ST_YMin(geom) AS y<br> , generate_series(1,(random()*10000+5000)::INTEGER)<br>
FROM polygon) num_points<br>;<br><br>CREATE INDEX polygon_index ON polygon USING GIST (geom);<br>CREATE INDEX point_index ON point USING GIST (geom);<br>CREATE INDEX fki_point_polygon_id_fkey ON point USING btree (polygon_id);<br>
<br><br>Now connect to the database and run this query:<br><br>EXPLAIN ANALYZE<br>SELECT SUM(some_value)<br>FROM point<br>JOIN polygon ON ST_Contains(polygon.geom, point.geom)<br>WHERE polygon.polygon_id = 50;<br><br>Here is an example result:<br>
<br>Aggregate (cost=45.91..45.92 rows=1 width=8) (actual time=303.788..303.793 rows=1 loops=1)<br> -> Nested Loop (cost=4.50..29.77 rows=6457 width=8) (actual time=77.441..247.409 rows=10180 loops=1)<br> -> Seq Scan on polygon (cost=0.00..4.25 rows=1 width=128) (actual time=0.024..0.039 rows=1 loops=1)<br>
Filter: (polygon_id = 50)<br> Rows Removed by Filter: 99<br> -> Bitmap Heap Scan on point (cost=4.50..25.50 rows=2 width=56) (actual time=77.382..151.578 rows=10180 loops=1)<br> Recheck Cond: (polygon.geom && geom)<br>
Filter: _st_contains(polygon.geom, geom)<br> -> Bitmap Index Scan on point_index (cost=0.00..4.50 rows=5 width=0) (actual time=1.869..1.869 rows=10180 loops=1)<br> Index Cond: (polygon.geom && geom)<br>
Total runtime: 304.766 ms<br><br>Note the "Bitmap Index Scan on point_index" line. The query planner estimates 5 rows will come back. In reality, over 10000 (a 2000 times increase) are returned. Is this a bug? Is there anything I can do to improve the estimated? I've been having performance issues on some very complex queries with my real data, and while this might not be the cause of them, it certainly doesn't help the query planner find the best plan. On my real data, increasing statistics did not help; I haven't tested increasing statistics on this sample.<br>