[postgis-users] PostGIS spatial index use optimization (Proposal)

Wed Oct 2 16:47:10 PDT 2002

Several people have reported index selection problems with PostGIS:
sometimes it ops to use the spatial (GiST) index when its inappropriate.

I suggest you also read
http://www.postgresql.org/idocs/index.php?planner-stats.html
since it gives a good introduction to the statistics collected and used by
the query planner.

Postgresql query planning
-------------------------

I'll start with a quick example, then move on to spatial indexing.
Lets start with a simple table:

name      location          age
-------------------------------
dave      james bay         31
paul      james bay         30
oscar     central park      23
chris     downtown          22

With some indexes:

CREATE index people_name_idx  on people (name);
CREATE index people_age_idx   on people (age);

We then start to execute some queries:

#1) SELECT * FROM people WHERE location = 'james bay';

Postgresql's only possible query plan is to do a sequential scan of the
table and check to see if location ='james bay' - there isnt an index to
consult.  The sequential scan is simple - it loads each row from the table
and checks to see if location='james bay'.

Here's what postgresql says about this query:

dblasby=# explain analyse SELECT * FROM people WHERE location = 'james bay';

NOTICE:  QUERY PLAN:

Seq Scan on people  (cost=0.00..1.05 rows=2 width=25) (actual
time=0.02..0.03 rows=2 loops=1)
Total runtime: 0.07 msec

Note that postgresql is using a "Seq Scan" and predicts 2 result rows.  The
planner is very accurate in this estimate because the statistics collected
explicitly say that 'james bay' is the most common value in this column (cf.
pg_stats and pg_statistics tables).

#2) SELECT * FROM people WHERE name ='dave';

Here the query planner has two option - it can do a sequential scan or it
can use the people_name_idx.  Here's what the query planner says (I
explicitly tell it to use the index in the 2nd run):

dblasby=# explain analyse SELECT * FROM people WHERE name ='dave';
NOTICE:  QUERY PLAN:

Seq Scan on people  (cost=0.00..1.05 rows=1 width=25) (actual
time=0.02..0.03 rows=1 loops=1)
Total runtime: 0.07 msec

dblasby=# set enable_seqscan=off;
dblasby=# explain analyse SELECT * FROM people WHERE name ='dave';
NOTICE:  QUERY PLAN:

Index Scan using people_name_idx on people  (cost=0.00..4.41 rows=1
width=25) (actual time=33.72..33.73 rows=1 loops=1)
Total runtime: 33.82 msec

In this case the sequential scan is faster because it only has to load one
page from the disk and check 4 rows.  The index scan will have to load in
the index, perform the scan, then load in the page with the data in it.
On a larger table, the index scan would probably be significantly faster.

#3)SELECT * FROM people WHERE age < 30;

Again, we have a chose of the index or sequential scan. The estimate of the
number of rows is calculated by the "<" operator for integers and the table
statistics. We'll talk about row # estimates later.

dblasby=# explain analyse SELECT * FROM people WHERE age < 30;
NOTICE:  QUERY PLAN:

Seq Scan on people  (cost=0.00..1.05 rows=3 width=25) (actual
time=0.03..0.03 rows=2 loops=1)
Total runtime: 0.10 msec

EXPLAIN
dblasby=#  set enable_seqscan=off;
SET VARIABLE
dblasby=# explain analyse SELECT * FROM people WHERE age < 30;
NOTICE:  QUERY PLAN:

Index Scan using people_age_idx on people  (cost=0.00..3.04 rows=3 width=25)
(actual time=0.06..0.07 rows=2 loops=1)
Total runtime: 0.15 msec

#4)  SELECT * FROM people WHERE  age < 30 AND  name='dave';

Here we have 3 plans - sequence scan and index scan (age) and index scan
(name).  The actual plan chosen will be determined by looking at all the
plans and estimating which is fastest.
If the 'people' table had a lot of rows, the "name='dave'" clause would
probably be much more selective (return less results) than the "age<30"
clause.  The planner would probably use the name index.

Spatial Indexing
----------------

For spatial indexing, queries look like:
    SELECT * FROM mygeomtable WHERE the_geom && 'BOX3D(...)'::box3d;

The planner has two options - do a sequential scan of the table (and check
each geometry against the BOX3D) or do an index scan using the GiST index.
Its further confused because TOASTed (large) geometries will be expensive to
test in the sequential scan because the entire geometry much be read.

The planner makes it's choice by estimating how many rows the " the_geom &&
'BOX3D(...)'::box3d " clause will return - if its just a few row, the index
will be extremely efficient.  If it returns a large number of rows, the cost
involved in consulting the index AND then actually reading the data from the
table will be high.

When PostGIS was first created, it was very difficult to get postgresql to
actually use the index because the estimate of the number of rows returned
was always high.  The planner thought the sequential scan was faster.

A later release of PostGIS tried to fix this by lowering this estimate.
This made postgresql use the spatial index almost always.  Since the GiST
index has low overhead, this worked well even if the index was used when it
wasnt the best plan.  The estimate was very simple - it was a fixed % of the
total data.

Everything was great until people started doing queries like:

SELECT * FROM mygeomtable WHERE
        the_geom && 'BOX3D(...)'::box3d AND
        road_type =27;

Postgresql now has to make a decision between the spatial index and the
attribute index (on 'road_type').  Since we were doing a very simple
estimate for the selectivity of the && operator, the spatial index is often
used when another index (ie. road_type) would be more appropriate.  When the
BOX3D is "large" the problem is applified because the estimate is very
wrong.
Second, problems with joins and other queries occur because the estimate of
the number of rows returned was much lower than in actuality and postgresql
choose poor table joining strategies.

New Proposal
------------

The current selectivity of the "&&" operator is fixed - it always reports (i
think 5%) the same number of rows no matter where the query BOX3D is or its
size.  This is as simple as it gets.

The next simplest method is to look at the extents of the data, and the size
of the query window.  If the query window is 1/4 the size of the data, then
we can estimate that 1/4 of the rows will be returned.  This is quite
workable, but has problem because spatial data is not uniformly distributed.

The proposed method extends the simple method - we use another index to
estimate the number of results for the "&&" operator.  WHEW - indexing our
indexes!  In a nutshell, the plan is to use a modified quad tree (ie. has
fixed cell sizes and location) to store a 2d histogram of the spatial data.
This is best explained in a few diagrams.  See the attached 2 diagrams.

In diagram one we see typical spatial data - road in the province of BC.  In
place like Vancouver, Victoria, Kelowna, Prince George there are LOTS of
roads.  Elsewhere there are hardly any.  There are a total of 200,000 rows
in the roads table.  The proposed method does a 3 stage pre-process.  (See
figure 2)

1. computes the extents of the data.  Basically
    SELECT extent(the_geom) FROM mygeometrytable;
2. make a fixed size grid from the extents - something like a 15*15 matrix
3. scan through each row of the geometry table and determine which of the
   grid's cells overlap the bounding box of the geometry.  Increment
   those grid cells by 1.

The result will be a 2d histogram of the spatial data.  Cells with lots of
geometries in them will have a high value, and cells with few will have a
low value.  This is stored somewhere (ie. the geometry_columns metatable but
it would be nice to have it siting in shared memory somewhere) - the
histogram will be small enough to sit on one disk page.

An incomming query's (ie. the_geom && 'BOX3D(..)') selectivity is calculated
this way:
1. Load the appropriate 2d histogram (one disk page)
2. find which cells the BOX3D overlaps and the % area of the cell that the
   BOX3D overlaps
3. sum ( % overlap * <number of geometries in that cell>)

This should give a 'reasonable' estimate of the particular "&&"
selectivity.  The estimate will be poor if the data is extreamly skewed.

An extention of this would be to make a proper quad tree (instead of fixed
cells), then simplify it.  Unfortunately this could be very difficult and
isnt proposed (yet).

Thoughts?

dave

-------------- next part --------------
A non-text attachment was scrubbed...
Name: postgis2.gif
Type: image/gif
Size: 31862 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20021002/b25bea50/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: postgis1.gif
Type: image/gif
Size: 17146 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20021002/b25bea50/attachment-0001.gif>