Pushing bbox and filter constraints into PostGIS subquery

Fri Sep 21 14:14:54 EDT 2007

I have a suggestion for improving performance of PostGIS queries for layers
that use subselect statements.  I don't see anything in the archives related
to this issue, so I thought I'd post the idea and see if anyone has any
thoughts about it.

The PostGIS DATA parameter allows users to specify data sources as either a
single table or as a sub-query that may join several tables.  Queries are
then built from this DATA statement into SQL of the following form
(simplified from mappostgis.c code to avoid some minor flavor differences):

DECLARE mycursor BINARY CURSOR FOR 
SELECT <columns_wanted> 
FROM <data_source>
WHERE (<filter_constraints>) AND (<geom_column> &&amp; <bounding_box>)

In the case where <data_source> is a simple table, the filter constraints
and bounding box constraints are used in the initial query planning. 
However, if the <data_source> is a subquery, the query ends up with this form:

DECLARE mycursor BINARY CURSOR FOR 
SELECT <columns_wanted> 
FROM (<sub_query>)  AS subquery
WHERE (<filter_constraints>) AND (<geom_column> &&amp; <bounding_box>)

The result is that the subselect is executed without the filter_constraints
or the bounding box constraints.  These constraints are applied to the
results of the subquery as a secondary step.

While this process gives the correct results, it can lead to suboptimal
performance.  For example, if one is querying for data in a small region
from a global dataset, the bounding box constraint could greatly reduce the
number of rows that need to be inspected.  In general, the PostgreSQL query
engine will nearly always do a better job of optimization if more
constraints can be given to the initial query.  

I was wondering if anyone had considered trying to push the filter and
bounding box constraints inside of the subquery?   This would require
spliting the subquery into pieces and merging in the filter and bounding box
constraints, resulting in something like this:

DECLARE mycursor BINARY CURSOR FOR 
SELECT <columns_wanted> 
FROM (<subquery_select_columns_from_tables>
WHERE (<filter_constraints>) AND (<geom_column> &&amp; <bounding_box>)
AND <subquery_constraints_order_and_limit>) as subquery

Here <subquery_select_columns_from_tables> would have the beginning part of
the subquery (i.e. SELECT, FROM, and all the table joins),  and
<subquery_constraints_order_and limit> would the end of the subquery (e.g.
any WHERE-clause constraints, ORDER BY, LIMIT, and OFFSET clauses)

I could see that this might not work properly if the subquery had a "GROUP
BY" statement.  In such cases, the original filter and bounding box would be
operating on the aggregated rows, while this new statement would operated on
pre-aggegated rows.  However, we could circumvent this optimization if we
detect the presence of a GROUP BY clause in the subquery.  We could do
similar checks to avoid other troublesome SQL, to ensure optmized and
unoptimized queries always return the same results.

Is there some reason that pushing the filtering into the subquery would be a
bad thing to do?  Looking at the code in mappostgis.c (prepare_database
function), it does not look like it would be too hard to implement.  Are
there other insertion points that would need to be supported?

If the community is interested in this enhancement, I'd be willing to take a
crack at it.

Steve Marshall