[gdal-dev] Optimizing random access in SQL result set of OGR DB drivers

Wed Aug 26 17:54:58 EDT 2009

Martin,

I take the liberty to CC the list, as it is an interesting issue and it's better
to go public if we want to make progress on that.

In a few words, this is about how we could speed up GetFeature() on the layers
returned when issuing a SQL request on OGR drivers that rely on databases : PG
driver, MySQL driver, SQLite driver (... ?). For people wanting to take part,
you might be interested in the below email exchange first.

After examining Martin's patches and reading its explanations, I cann see 2
possibilities if we want to implement such optimization :

1) we change the GetNextFeature() of the ResultSetLayer of the PG, MySQL and
SQLite drivers so that they always return a "FID" which is an integer : 0, 1, 2,
3 instead of the current behaviour where sometimes it is the FID from the table
when the result is just a sub-set of the table, or this ordinal suite in other
cases. I don't think that we have a strict need to match the FID of a SQL result
set with one of the database table since it is not possible in all cases (see
the example of the PG driver which sometimes return a "true" FID, sometimes just
an ordinal). Always returning 0, 1, 2, etc. as an FID on such layers might be
acceptable I think. This way the current GetFeature() would match both its
advertized semantics (GetFeature(nFID)->GetFID() == nFID) and the use you want
to make from it. Perhaps we should have a capability (OLCFIDIsOrdinal) on that
layer that means that feature ids are guaranteed to be just ordinals, so that
the user can be confident in the fact that GetFeature(nFID) where 0 <= nFID <
nFeatureCount will always succeed. (Of course, the layer should also advertizes
the OLCRandomRead so that the user knows that GetFeature() is specialized and
fast)

Pro : no need of new API, existing apps can benefit from enhanced performance
for free
Cons : change in behaviour for existing applications that would rely
on the fact that the feature id of a sql layer matches - sometimes - the feature
id of the associated table layer. AFAIK, this assumption is not made in the doc
of the OGR API.

2) add a new method in OGRLayer, something like GetFeatureByIndex(nIndex) ? that
would always return the nIndex th element (starting with 0) returned by
iterating GetNextFeature() nIndex+1 times. A default implementation of that
would be trivial in OGRLayer, and it could be specialized in the interested
drivers.

Pro : doesn't break anything existing
Cons :
  * new API
  * might be confusing as close to GetFeature()
  * only useful in very few cases (drivers which don't have a concept of FID
just returns ordinals)

3) do nothing... The user application can iterate on the result set and stores
in its own array a pointer to the features returned by GetNextFeature()

Pro : no change in GDAL
Cons :
  * no change in GDAL ;-)
  * if the result set is significant, might require a lot of memory and time to
build this array

So, personnaly, I think solution 1) might be the best solution.

What do you think ?

Best regards,

Even

> Even,
>
> Ok, after reviewing the code what I did is essentially overload the virtual
> function GetFeature(long nFeatureId) in the resultset layer to provide
> myself fast random access functionality in a resultset returned by
> ExecuteSQL() for a resultset row "index" versus an FID.  GetFeature() on the
> table class uses an FID which is not the same as the row index for a
> resultset.  I really should have named it GetFeature(long index) instead of
> GetFeature(long nFeatureId).
>
> Now I understand better what I did.  If you call ExecuteSQL() with a spatial
> of attribute filter I believe FID's come back like the following:
>
> 3
> 20
> 25
> 26
> 100
> 48
> ...
>
> And I needed an ordinal offset regardless of resultset filter like
>
> 0
> 1
> 2
> 3
> ...
>
> The reason was because I had other data structures in my application (like
> arrays and an in-memory rtree structure) that only worked with ordinals.
> Thus, in order to link the multiple structures without sacrificing
> performance I needed to work with array offsets versus FID's.
>
> Moral of the story, GDAL may be able to use FID's in an efficient manor when
> being used alone, but when included in higher level applications that use
> additional data structures the FID concept is limiting when using other
> linked data structures that are limited to ordinal array offsets.
>
> Martin
>
>
>
> -----Original Message-----
> From: Even Rouault [mailto:even.rouault at mines-paris.org]
> Sent: Tuesday, August 25, 2009 4:22 PM
> To: chapmanm at pixia.com
> Subject: About your proposed optimizations in OGR db drivers
>
> Martin,
>
> don't ask me how, but I've just happened to read this page :
> http://postgis.refractions.net/pipermail/postgis-users/2008-April/019246.htm
> l
>
> Could you explain a bit more the background of those optimizations that at
> first
> sight looks interesting ? I see that the 3 ones are for the layer returned
> by
> ExecuteSQL on PG, MySQL and sqlite.
>
> 0) You mentionned that you have proposed them before, but I couldn't find
> where.
> (I haven't dig very deep in trac or the archives to be honest ;-))
>
> 1) I'm wondering why you choose implement random reading by GetFeature() on
> SQL
> results, and not use directly the one that exists for the layers that map to
> the
> DB tables (AFAIK, GetFeature() is properly implemented in OGRPGTableLayer,
> OGRSQLiteTableLayer and OGRMySQLTableLayer) ?
>
> 2) I think your optimizations would need some rework to be always correct :
>
>    a) For SQLite, it would only work if the user has created the layer with
> a
> "SELECT * FROM a_table WHERE ....". If the user only request a few columns,
> the
> poFeature->SetField() would probably try to feed fields that don't exist in
> the
> layer. If the SQL result is a result of something more complex like a join,
> there is really no table to query a FID in... So the analysis of the initial
> request would require to be restricted to a few selected cases. But then I
> go
> back to 1)
>
>   b) For Postgres, your optimization violates the contract of GetFeature()
> which
> says "If this function returns a non-NULL feature, it is guaranteed that its
> feature id (OGR_F_GetFID()) will be the same as nFID.". But "FETCH ABSOLUTE
> nFeatureId" will return the nFeatureId th feature of the result of the
> request.
> I've observed that when you do a "SELECT * FROM a_table WHERE....",
> GetNextFeature() manages to preserve the FID from the table, so the
> optimization
> wouldn't clearly respect the contract (it would return the nFeatureId th
> element, instead of the feature whose FID = nFeatureId). For a request like
> "SELECT a_column FROM a_table WHERE...", GetNextFeature() returns the
> features
> with fids of 0,1,2,..., so the optimization would almost work, except that
> it
> would have a off-by-one error
>
>   c) For MySQL, I haven't yet analyzed due to lack of testing env.
>
> Best regards,
>
> Even
>
>