[postgis-users] Help with Bad Query Plan

Oliver Snowden Oliver.Snowden at hpa.org.uk
Thu Jan 8 11:30:51 PST 2009


Dear all, I am having some problems with a bad query plan.  Essentially,
I have two tables.  A table of many different types of geography and
another table (reports) linking to those geographies.  We want to select
a small subset of reports (e.g. 30 reports) based on date and intersect
those report geographies (potentially historical geographies) with a
(current) selected geography.

I have (conceptually) worked out what I believe to be most efficient by
putting the query into canonical form, moving the restricts down and
putting the most restrictive item first; although I haven't actually
done any specific calculations - index/disk access speeds etc.

My view is that the query should be executed as shown below.

Any comments or suggestions will be greatly appreciated.  We have tried
various things inc. subqueries, using &&, and indexing almost everything
etc.  Still no luck.

Many thanks, Oliver.


                                       project 
                                 geolink.report_id
                                         |
                   intersect (geography.the_geom,
selected_geography.the_geom)
                                         |
                 ------------------------------------------------
                 |                                               |
               project                                       project
geolink.report_id, geography.the_geom
selected_geography.the_geom
                 |                                               |
                join                                          restrict
     geolink.geom_id = geography.gid
selected_geography.gid = 3
                 |                                               |
    ---------------------------                    [geography as
selected_geography]
    |                         |
 project                   project
report_id, geom_id      gid, the_geom
    |                         |
  restrict                    |
based on date                 |
    |                         |
 [geolink]                [geography]


Preparation:

1. First experienced the issue on PostGIS 1.3.3-1 on Ubuntu 8.10.
Installed PostGIS 1.3.5 (PostgreSQL 8.3) on Windows Vista and the
details, below, refer to that Windows installation.

2. Loaded continent, region, country, admin areas, and cities from the
ESRI Data & Maps CD (2006 data).
 Data was loaded with SPIT in Quantum GIS 0.11.0-Metis using SRID 4326.


##############################
#
# COMPLETE SCRIPT v4
#
##############################

-- 1.
CREATE SEQUENCE gdors_geography_pkey_seq;

CREATE TABLE gdors_geography (gid serial PRIMARY KEY,
 geotype varchar(32),
 published date,
 supplier varchar(64),
 objectid integer,
 name character varying(128),
 sqmi double precision,
 sqkm double precision,
 comment character varying(192));
SELECT
AddGeometryColumn('','gdors_geography','the_geom','4326','GEOMETRY',2);
ALTER TABLE gdors_geography ALTER COLUMN gid SET DEFAULT
NEXTVAL('gdors_geography_pkey_seq');


-- 2.
-- (Note: inserted the following geographies four times, each, to
increase the number of geometries)

INSERT INTO gdors_geography (geotype, published, supplier, the_geom,
objectid, name, sqmi, sqkm)
SELECT 'CONTINENT', '2006-01-01', 'ESRI', c.the_geom, c.objectid,
c.continent, c.sqmi, c.sqkm
FROM continent c;

INSERT INTO gdors_geography (geotype, published, supplier, the_geom,
objectid, name, sqmi, sqkm)
SELECT 'REGION', '2006-01-01', 'ESRI', r.the_geom, r.objectid, r.region,
r.sqmi, r.sqkm
FROM region r;

INSERT INTO gdors_geography (geotype, published, supplier, the_geom,
objectid, name, sqmi, sqkm)
SELECT 'COUNTRY', '2006-01-01', 'ESRI', c.the_geom, c.objectid,
c.cntry_name, c.sqmi, c.sqkm
FROM cntry06 c;

INSERT INTO gdors_geography (geotype, published, supplier, the_geom,
objectid, name, sqmi, sqkm)
SELECT 'ADMIN','2006-01-01', 'ESRI', a.the_geom, a.objectid,
a.admin_name, a.sqmi, a.sqkm
FROM admin a;

INSERT INTO gdors_geography (geotype, published, supplier, the_geom,
objectid, name)
SELECT 'CITY','2006-01-01', 'ESRI', c.the_geom, c.objectid, c.city_name
FROM city c;

-- 3.  Create table linking, say, reports to geography.
CREATE TABLE gdors_geolink
(
  report_id integer,
  geom_id integer,
  lastupdated date,
  temp_report_date date,
  CONSTRAINT gdors_geolink_pkey PRIMARY KEY (report_id, geom_id), 
  CONSTRAINT gdors_geolink_geom_id_fkey FOREIGN KEY (geom_id)
      REFERENCES gdors_geography (gid) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION
);


-- 4. Insert lots of report data to geography. For example:
INSERT INTO gdors_geolink(report_id, geom_id, lastupdated,
temp_report_date) VALUES (1,5061,'2006-06-01', '2005-10-01');
INSERT INTO gdors_geolink(report_id, geom_id, lastupdated,
temp_report_date) VALUES (2,20034,'2006-06-01', '1980-07-27');
INSERT INTO gdors_geolink(report_id, geom_id, lastupdated,
temp_report_date) VALUES (3,20848,'2006-06-01', '1991-07-15');
--...10,000 report entries

-- 5. Display table sizes.

-- FYI: 10000 Reports
SELECT COUNT(*) AS count
FROM gdors_geolink;

--FYI: 21620 Geographies
SELECT COUNT(*) AS count
FROM gdors_geography;


-- 6. Build up query.

-- STEP 1 (reports between dates)
-- 275 rows, 15ms.
SELECT geolink.report_id
FROM gdors_geolink geolink
WHERE geolink.temp_report_date BETWEEN '2008-01-06' AND '2009-01-06';

-- STEP 2 (join geography)
-- 275 rows, 16ms.
SELECT geolink.report_id, geography.gid
FROM gdors_geolink geolink, gdors_geography geography
WHERE geolink.temp_report_date BETWEEN '2008-01-06' AND '2009-01-06'
AND geolink.geom_id = geography.gid;

-- STEP 3 (select geography, e.g. Asia continent)
-- 275 rows, 26ms.
SELECT geolink.report_id, geography.gid
FROM gdors_geolink geolink, gdors_geography geography, gdors_geography
selected_geography
WHERE geolink.temp_report_date BETWEEN '2008-01-06' AND '2009-01-06'
AND geolink.geom_id = geography.gid
AND selected_geography.gid=3;

-- STEP 4 (intersect Asia continent over all other geographies)
-- 81 rows, ~13000ms - a large increase in query time.
SELECT geolink.report_id, geography.gid
FROM gdors_geolink geolink, gdors_geography geography, gdors_geography
selected_geography
WHERE geolink.temp_report_date BETWEEN '2008-01-06' AND '2009-01-06'
AND geolink.geom_id = geography.gid
AND selected_geography.gid=3
AND ST_Intersects(selected_geography.the_geom, geography.the_geom);


-- STEP 5 (count reports within geography)
-- 81 rows, ~13000ms.
SELECT COUNT(*) AS count
FROM gdors_geolink geolink, gdors_geography geography, gdors_geography
selected_geography
WHERE geolink.temp_report_date BETWEEN '2008-01-06' AND '2009-01-06'
AND geolink.geom_id = geography.gid
AND selected_geography.gid=3
AND ST_Intersects(selected_geography.the_geom, geography.the_geom);

-- 7. Create Index:
CREATE INDEX gdors_geography_geom_index ON gdors_geography USING GIST
(the_geom);
VACUUM ANALYZE;

-- 8. Execute intersect query.
-- EXECUTES FOR A VERY LONG TIME (hours? days?) - it hasn't ever
finished for me.
SELECT geolink.report_id, geography.gid
FROM gdors_geolink geolink, gdors_geography geography, gdors_geography
selected_geography
WHERE geolink.temp_report_date BETWEEN '2008-01-06' AND '2009-01-06'
AND geolink.geom_id = geography.gid
AND selected_geography.gid=3
AND ST_Intersects(selected_geography.the_geom, geography.the_geom);


-- 9. Compare query plan with and without the ST_Intersects statement:

EXPLAIN ANALYZE SELECT geolink.report_id, geography.gid
FROM gdors_geolink geolink, gdors_geography geography, gdors_geography
selected_geography
WHERE geolink.temp_report_date BETWEEN '2008-01-06' AND '2009-01-06'
AND geolink.geom_id = geography.gid
AND selected_geography.gid=3


"Nested Loop  (cost=0.00..1490.77 rows=265 width=8) (actual
time=1.063..16.169 rows=275 loops=1)"
"  ->  Index Scan using gdors_geography_pkey on gdors_geography
selected_geography  (cost=0.00..8.27 rows=1 width=0) (actual
time=0.047..0.051 rows=1 loops=1)"
"        Index Cond: (gid = 3)"
"  ->  Nested Loop  (cost=0.00..1479.85 rows=265 width=8) (actual
time=0.168..14.059 rows=275 loops=1)"
"        ->  Seq Scan on gdors_geolink geolink  (cost=0.00..205.00
rows=265 width=8) (actual time=0.112..5.842 rows=275 loops=1)"
"              Filter: ((temp_report_date >= '2008-01-06'::date) AND
(temp_report_date <= '2009-01-06'::date))"
"        ->  Index Scan using gdors_geography_pkey on gdors_geography
geography  (cost=0.00..4.80 rows=1 width=4) (actual time=0.014..0.017
rows=1 loops=275)"
"              Index Cond: (geography.gid = geolink.geom_id)"
"Total runtime: 16.927 ms"

(Used 'Explain query' in pgAdminIII to get result.  EXPLAIN ANALYZE too
slow - it would not finish within hours? days?)
SELECT geolink.report_id, geography.gid
FROM gdors_geolink geolink, gdors_geography geography, gdors_geography
selected_geography
WHERE geolink.temp_report_date BETWEEN '2008-01-06' AND '2009-01-06'
AND geolink.geom_id = geography.gid
AND selected_geography.gid=3
AND ST_Intersects(selected_geography.the_geom, geography.the_geom);


"Nested Loop  (cost=0.00..91.85 rows=1 width=8)"
"  ->  Nested Loop  (cost=0.00..16.56 rows=1 width=4)"
"        Join Filter: _st_intersects(selected_geography.the_geom,
geography.the_geom)"
"        ->  Index Scan using gdors_geography_pkey on gdors_geography
selected_geography  (cost=0.00..8.27 rows=1 width=13706)"
"              Index Cond: (gid = 3)"
"        ->  Index Scan using gdors_geography_geom_index on
gdors_geography geography  (cost=0.00..8.27 rows=1 width=13710)"
"              Index Cond: (selected_geography.the_geom &&
geography.the_geom)"
"              Filter: (selected_geography.the_geom &&
geography.the_geom)"
"  ->  Index Scan using gdors_geolink_pkey on gdors_geolink geolink
(cost=0.00..75.28 rows=1 width=8)"
"        Index Cond: (geolink.geom_id = geography.gid)"
"        Filter: ((geolink.temp_report_date >= '2008-01-06'::date) AND
(geolink.temp_report_date <= '2009-01-06'::date))"


-- 10. Created various indexes but could not seem to improve the
performance / influence the query plan.
-----------------------------------------
**************************************************************************
The information contained in the EMail and any attachments is
confidential and intended solely and for the attention and use of
the named addressee(s). It may not be disclosed to any other person
without the express authority of the HPA, or the intended
recipient, or both. If you are not the intended recipient, you must
not disclose, copy, distribute or retain this message or any part
of it. This footnote also confirms that this EMail has been swept
for computer viruses, but please re-sweep any attachments before
opening or saving. HTTP://www.HPA.org.uk
**************************************************************************



More information about the postgis-users mailing list