[postgis-users] Bad estimate from index on points

Wed Jun 12 01:23:18 PDT 2013

I have a table that contains many point rows. The points all fall within a
polygon in another table. There are roughly 5 to 20 thousand points per
polygon. The polygons will frequently overlap. (The data is annual.) I have
to calculate aggregate values on the point data, and the query planner is
getting horrible estimates for the number of point rows returned by the
spatial index. I have create a psql script that will generate a database
with similar data. (My apologies for the length.)

DROP DATABASE IF EXISTS test_db;
CREATE DATABASE test_db;

\c test_db

CREATE EXTENSION postgis;

-- SRID is UTM 15 with NAD 83 datum

CREATE TABLE polygon
(
    polygon_id SERIAL PRIMARY KEY,
    geom GEOMETRY(MULTIPOLYGON, 26915)
);

CREATE TABLE point
(
    point_id SERIAL PRIMARY KEY,
    some_value DECIMAL(10,5),
    polygon_id INTEGER NOT NULL REFERENCES polygon (polygon_id),
    geom GEOMETRY(POINT, 26915)
);

-- Randomly generate 100 rectangles
INSERT INTO polygon (geom)
SELECT ST_Multi(ST_SetSRID(ST_Envelope(('LINESTRING('||x||'
'||y||','||(x+side)||' '||(y+side)||')')::GEOMETRY), 26915))
FROM (SELECT random()*(823749.9411-176250.0589) + 176250.0589 AS x
           , random()*(9106037.1690-1577463.0797) + 1577463.0797 AS y
           , random()*(5000-1000) + 1000 AS side
      FROM generate_series(1, 100)
     ) box_data
;

-- Insert a lot of points contained by the rectangles
INSERT INTO point (some_value, polygon_id, geom)
SELECT random()*100 + 20
     , polygon_id
     , ST_SetSRID(('POINT('||x||' '||y||')')::GEOMETRY, 26915)
FROM (SELECT polygon_id
           , random()*(ST_XMax(geom) - ST_XMin(geom)) + ST_XMin(geom) AS x
           , random()*(ST_YMax(geom) - ST_YMin(geom)) + ST_YMin(geom) AS y
           , generate_series(1,(random()*10000+5000)::INTEGER)
      FROM polygon) num_points
;

CREATE INDEX polygon_index ON polygon USING GIST (geom);
CREATE INDEX point_index ON point USING GIST (geom);
CREATE INDEX fki_point_polygon_id_fkey ON point USING btree (polygon_id);

Now connect to the database and run this query:

EXPLAIN ANALYZE
SELECT SUM(some_value)
FROM point
JOIN polygon ON ST_Contains(polygon.geom, point.geom)
WHERE polygon.polygon_id = 50;

Here is an example result:

Aggregate  (cost=45.91..45.92 rows=1 width=8) (actual time=303.788..303.793
rows=1 loops=1)
  ->  Nested Loop  (cost=4.50..29.77 rows=6457 width=8) (actual
time=77.441..247.409 rows=10180 loops=1)
        ->  Seq Scan on polygon  (cost=0.00..4.25 rows=1 width=128) (actual
time=0.024..0.039 rows=1 loops=1)
              Filter: (polygon_id = 50)
              Rows Removed by Filter: 99
        ->  Bitmap Heap Scan on point  (cost=4.50..25.50 rows=2 width=56)
(actual time=77.382..151.578 rows=10180 loops=1)
              Recheck Cond: (polygon.geom && geom)
              Filter: _st_contains(polygon.geom, geom)
              ->  Bitmap Index Scan on point_index  (cost=0.00..4.50 rows=5
width=0) (actual time=1.869..1.869 rows=10180 loops=1)
                    Index Cond: (polygon.geom && geom)
Total runtime: 304.766 ms

Note the "Bitmap Index Scan on point_index" line. The query planner
estimates 5 rows will come back. In reality, over 10000 (a 2000 times
increase) are returned. Is this a bug? Is there anything I can do to
improve the estimated? I've been having performance issues on some very
complex queries with my real data, and while this might not be the cause of
them, it certainly doesn't help the query planner find the best plan. On my
real data, increasing statistics did not help; I haven't tested increasing
statistics on this sample.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20130612/710b9fe1/attachment.html>