[postgis-users] Bad estimate from index on points

BladeOfLight16 bladeoflight16 at gmail.com
Thu Jun 13 11:29:42 PDT 2013


Re: The factor of 10
There wasn't really anything substantive there. I was just musing about how
the query planner could get a better estimate and why that wouldn't work.
The number of points per polygon ranges from about 5,000 to 20,000. An
average would probably be somewhere around 10-15 thousand, and all the
point counts would fall well within average/10 to average*10. I recall
reading somewhere that estimates off by single digit factors aren't a cause
for concern when looking at query plans. So I was saying the average number
of points would probably be a good enough estimate for the query planner to
make intelligent decisions, but since that's a multi-table relationship,
the database wouldn't track that information and couldn't use it for query
planning. That's all.

On Thu, Jun 13, 2013 at 2:00 AM, Sandro Santilli <strk at keybit.net> wrote:

> On Wed, Jun 12, 2013 at 02:29:05PM -0400, BladeOfLight16 wrote:
>
> > Re: JOIN selectivity
> >
> > Not sure that's an option. In my real data, I'm joining to several
> > polygons. Many of my queries go off to different tables that have
> polygons
> > subdividing the main polygon table. So are you basically saying that JOIN
> > and spatial data don't mesh well? Is there anything that can be done
> about
> > that fact? Hm. My data is pretty regular; all the points in the point
> table
> > fall within a polygon in the polygon table. If there were a way for the
> > system to figure out some kind of average points per polygon or something
> > even close to that, that would probably be within a factor of 10 or so.
> Of
> > course, indexes don't usually track cross table relationships (ever, that
> > I'm aware of).
>
> I don't understand the factor of 10 here. Your case is really returning
> 10000 rows within a single polygon, right ? What's 10 here ?
> I'm not even sure the join selectivity code is being engaged, consider
> rebuilding postgis with --enable-debug to get traces of what the estimator
> thinks about that. Sharing the dataset could be also interesting.
>
> > I suppose what I really need is some kind of topology. Would the topology
> > libraries allow me to inform the system about the relationship between
> the
> > points and polygons and allow the query planner to take advantage of that
> > information? Even if so, topology functionality confuses me. Do you know
> of
> > a good introductory tutorial? I'm sure I won't have a problem with the
> > concepts; I just need some direction on how the tools in PostGIS work.
>
> Of course if you put all your points and your polygons into a topology
> those relationships will be explicit so things would be faster, but
> building
> the topology and maintaining it if changes occur will be more expensive.
> The postgis manual contains all the information we have about it, the
> closer to a tutorial is likely my slides, linked from the manual.
>
> --strk;
>
> >
> > Thanks for the help.
> >
> >
> > On Wed, Jun 12, 2013 at 5:32 AM, Sandro Santilli <strk at keybit.net>
> wrote:
> >
> > > On Wed, Jun 12, 2013 at 04:23:18AM -0400, BladeOfLight16 wrote:
> > >
> > > > the query planner is
> > > > getting horrible estimates for the number of point rows returned by
> the
> > > > spatial index.
> > >
> > > [...]
> > >
> > > > -- Insert a lot of points contained by the rectangles
> > > > INSERT INTO point (some_value, polygon_id, geom)
> > > > SELECT random()*100 + 20
> > > >      , polygon_id
> > > >      , ST_SetSRID(('POINT('||x||' '||y||')')::GEOMETRY, 26915)
> > > > FROM (SELECT polygon_id
> > > >            , random()*(ST_XMax(geom) - ST_XMin(geom)) +
> ST_XMin(geom) AS
> > > x
> > > >            , random()*(ST_YMax(geom) - ST_YMin(geom)) +
> ST_YMin(geom) AS
> > > y
> > > >            , generate_series(1,(random()*10000+5000)::INTEGER)
> > > >       FROM polygon) num_points
> > > > ;
> > > >
> > > > CREATE INDEX polygon_index ON polygon USING GIST (geom);
> > > > CREATE INDEX point_index ON point USING GIST (geom);
> > >
> > > [...]
> > >
> > > > EXPLAIN ANALYZE
> > > > SELECT SUM(some_value)
> > > > FROM point
> > > > JOIN polygon ON ST_Contains(polygon.geom, point.geom)
> > > > WHERE polygon.polygon_id = 50;
> > >
> > > [...]
> > >
> > > >               ->  Bitmap Index Scan on point_index  (cost=0.00..4.50
> > > rows=5
> > > > width=0) (actual time=1.869..1.869 rows=10180 loops=1)
> > > >                     Index Cond: (polygon.geom && geom)
> > >
> > > [...]
> > >
> > > > Note the "Bitmap Index Scan on point_index" line. The query planner
> > > > estimates 5 rows will come back. In reality, over 10000 (a 2000 times
> > > > increase) are returned. Is this a bug? Is there anything I can do to
> > > > improve the estimated?
> > >
> > > I didn't see an ANALYZE run between the INSERT and the EXPLAIN,
> > > does analyzing both the polygon table and the point table help
> > > in any way ? What version of PostGIS are you running ?
> > >
> > > Note that the JOIN selectivity estimator can't be that good (doesn't
> > > have enough information about which polygon you're going to pick from
> > > the polygons set, but makes a guess based on the whole table instead)
> > > so if you can turn that single polygon into a constant it should help
> > > the estimator.
> > >
> > > --strk;
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20130613/41b57fd2/attachment.html>


More information about the postgis-users mailing list