Pushing bbox and filter constraints into PostGIS subquery

Fri Sep 21 20:50:16 EDT 2007

Steve and Steve,

I already did something like this for Oracle spatial but 
fsimon at univali.br has not applied the patch to svn yet.

http://trac.osgeo.org/mapserver/ticket/2129

It is critical the we allow additional control over these things in out 
database connection. In the Oracle case (check the query the DBA came up 
with) we got a very significant performance improvement after I applied 
the patch attached to the ticket.

THis is a generic implementation that also could be applied to PostGIS. 
And I would like to see the filter extension applied to both database 
providers.

-Steve

Steve Lime wrote:
> Hi Steve: We love suggestions where folks offer to roll up their sleeves!
> 
> Personally I think it's a necessary next step (pushing filter conditions to the RDBMS)
> for OGC filter support. I've not thought about implementation but would like to see
> support for all supported databases (PostGIS, Oracle and SDE). Getting it working
> for one would likely serve for a model for others. I've been kicking the idea around
> to move CGI attribute queries to use OGC filters so a modification like this would have
> benefits beyond just WFS (MapScript would benefit too). Assefa should really have to 
> chime in though since the filter support was his. 
> 
> Perhaps the next step is to start putting an RFC together?
> 
> Steve
> 
>>>> Steve Marshall <smarshall at WSI.COM> 09/21/07 1:14 PM >>>
> I have a suggestion for improving performance of PostGIS queries for layers
> that use subselect statements.  I don't see anything in the archives related
> to this issue, so I thought I'd post the idea and see if anyone has any
> thoughts about it.
> 
> The PostGIS DATA parameter allows users to specify data sources as either a
> single table or as a sub-query that may join several tables.  Queries are
> then built from this DATA statement into SQL of the following form
> (simplified from mappostgis.c code to avoid some minor flavor differences):
> 
> DECLARE mycursor BINARY CURSOR FOR 
> SELECT <columns_wanted> 
> FROM <data_source>
> WHERE (<filter_constraints>) AND (<geom_column> && <bounding_box>)
> 
> In the case where <data_source> is a simple table, the filter constraints
> and bounding box constraints are used in the initial query planning. 
> However, if the <data_source> is a subquery, the query ends up with this form:
> 
> DECLARE mycursor BINARY CURSOR FOR 
> SELECT <columns_wanted> 
> FROM (<sub_query>)  AS subquery
> WHERE (<filter_constraints>) AND (<geom_column> && <bounding_box>)
> 
> The result is that the subselect is executed without the filter_constraints
> or the bounding box constraints.  These constraints are applied to the
> results of the subquery as a secondary step.
> 
> While this process gives the correct results, it can lead to suboptimal
> performance.  For example, if one is querying for data in a small region
> from a global dataset, the bounding box constraint could greatly reduce the
> number of rows that need to be inspected.  In general, the PostgreSQL query
> engine will nearly always do a better job of optimization if more
> constraints can be given to the initial query.  
> 
> I was wondering if anyone had considered trying to push the filter and
> bounding box constraints inside of the subquery?   This would require
> spliting the subquery into pieces and merging in the filter and bounding box
> constraints, resulting in something like this:
> 
> DECLARE mycursor BINARY CURSOR FOR 
> SELECT <columns_wanted> 
> FROM (<subquery_select_columns_from_tables>
> WHERE (<filter_constraints>) AND (<geom_column> && <bounding_box>)
> AND <subquery_constraints_order_and_limit>) as subquery
> 
> Here <subquery_select_columns_from_tables> would have the beginning part of
> the subquery (i.e. SELECT, FROM, and all the table joins),  and
> <subquery_constraints_order_and limit> would the end of the subquery (e.g.
> any WHERE-clause constraints, ORDER BY, LIMIT, and OFFSET clauses)
> 
> I could see that this might not work properly if the subquery had a "GROUP
> BY" statement.  In such cases, the original filter and bounding box would be
> operating on the aggregated rows, while this new statement would operated on
> pre-aggegated rows.  However, we could circumvent this optimization if we
> detect the presence of a GROUP BY clause in the subquery.  We could do
> similar checks to avoid other troublesome SQL, to ensure optmized and
> unoptimized queries always return the same results.
> 
> Is there some reason that pushing the filtering into the subquery would be a
> bad thing to do?  Looking at the code in mappostgis.c (prepare_database
> function), it does not look like it would be too hard to implement.  Are
> there other insertion points that would need to be supported?
> 
> If the community is interested in this enhancement, I'd be willing to take a
> crack at it.
> 
> Steve Marshall