[postgis-users] Bad estimate from index on points

Wed Jun 12 11:29:05 PDT 2013

Re: Missing ANALYZE

Yeah, I saw that. I sent out a second message saying to add a VACUUM FULL
ANALYZE; on the whole database to the end of the script; it doesn't change
the outcome. (On my real data, I was most definitely doing this.) There's
no difference between analyzing individual tables and analyzing the full
database; is there?

Re: JOIN selectivity

Not sure that's an option. In my real data, I'm joining to several
polygons. Many of my queries go off to different tables that have polygons
subdividing the main polygon table. So are you basically saying that JOIN
and spatial data don't mesh well? Is there anything that can be done about
that fact? Hm. My data is pretty regular; all the points in the point table
fall within a polygon in the polygon table. If there were a way for the
system to figure out some kind of average points per polygon or something
even close to that, that would probably be within a factor of 10 or so. Of
course, indexes don't usually track cross table relationships (ever, that
I'm aware of).

I suppose what I really need is some kind of topology. Would the topology
libraries allow me to inform the system about the relationship between the
points and polygons and allow the query planner to take advantage of that
information? Even if so, topology functionality confuses me. Do you know of
a good introductory tutorial? I'm sure I won't have a problem with the
concepts; I just need some direction on how the tools in PostGIS work.

Thanks for the help.

On Wed, Jun 12, 2013 at 5:32 AM, Sandro Santilli <strk at keybit.net> wrote:

> On Wed, Jun 12, 2013 at 04:23:18AM -0400, BladeOfLight16 wrote:
>
> > the query planner is
> > getting horrible estimates for the number of point rows returned by the
> > spatial index.
>
> [...]
>
> > -- Insert a lot of points contained by the rectangles
> > INSERT INTO point (some_value, polygon_id, geom)
> > SELECT random()*100 + 20
> >      , polygon_id
> >      , ST_SetSRID(('POINT('||x||' '||y||')')::GEOMETRY, 26915)
> > FROM (SELECT polygon_id
> >            , random()*(ST_XMax(geom) - ST_XMin(geom)) + ST_XMin(geom) AS
> x
> >            , random()*(ST_YMax(geom) - ST_YMin(geom)) + ST_YMin(geom) AS
> y
> >            , generate_series(1,(random()*10000+5000)::INTEGER)
> >       FROM polygon) num_points
> > ;
> >
> > CREATE INDEX polygon_index ON polygon USING GIST (geom);
> > CREATE INDEX point_index ON point USING GIST (geom);
>
> [...]
>
> > EXPLAIN ANALYZE
> > SELECT SUM(some_value)
> > FROM point
> > JOIN polygon ON ST_Contains(polygon.geom, point.geom)
> > WHERE polygon.polygon_id = 50;
>
> [...]
>
> >               ->  Bitmap Index Scan on point_index  (cost=0.00..4.50
> rows=5
> > width=0) (actual time=1.869..1.869 rows=10180 loops=1)
> >                     Index Cond: (polygon.geom && geom)
>
> [...]
>
> > Note the "Bitmap Index Scan on point_index" line. The query planner
> > estimates 5 rows will come back. In reality, over 10000 (a 2000 times
> > increase) are returned. Is this a bug? Is there anything I can do to
> > improve the estimated?
>
> I didn't see an ANALYZE run between the INSERT and the EXPLAIN,
> does analyzing both the polygon table and the point table help
> in any way ? What version of PostGIS are you running ?
>
> Note that the JOIN selectivity estimator can't be that good (doesn't
> have enough information about which polygon you're going to pick from
> the polygons set, but makes a guess based on the whole table instead)
> so if you can turn that single polygon into a constant it should help
> the estimator.
>
> --strk;
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20130612/0b01a9cb/attachment.html>