[Qgis-developer] QEP/RFC sqlite virtual tables

Hugo Mercier hugo.mercier at oslandia.com
Tue Oct 28 04:15:44 PDT 2014


Hi Matthias,

Le 28/10/2014 09:46, Matthias Kuhn a écrit :
> Hi Hugo,
> 
> Sorry for my slow response, I was quite busy the last days.

With the 2.6 coming I know this is not the best time for this kind of
discussion :( So thanks for your time.

> 
> I'll try to explain a bit more what I have in mind.
> 
> For example there is the new expression function "getFeature()" which is
> an ad-hoc replacement for a join. I have been told that the performance
> for rendering is not very good, I assume this is caused by a lot of
> requests performed in the background. It would be great to be able to
> join this on the server side.

I agree. This getFeature() is a symptom that we need more advanced
relational-oriented functionalities.

> 
> I would be very happy to see more of this introduced. For example I
> could well imagine a new Expression (Syntax just made up, that can still
> be designed in a more appropriate way)
> 
>   avg( c.population ) { LEFT JOIN cities c ON c.country_code =
> $thistable.code }
> 
> This should then be locally evaluated (for shapefiles et al) but
> forwarded to the server for database services (postgres and the like).

How would it be "forwarded" for database services ? If the two tables
are from the same db, then ok, we are in the "simple optimization" case
that I was talking about previously : more or less a translation between
SQL dialects.
But what if the current table is a local file (shapefile) and cities is
a postgis table ?
You would then need a magical optimization engine that will say you need
to send the list of local country codes to a remote query like :

SELECT avg(population)
FROM cities
WHERE country_code IN ( $local_country_code_array )
GROUP BY country_code

But actually, the best optimization depends in the general case on the
actual data stored in tables. If the bandwith required to send the IN
part to PostGIS is too important, there may be cases where it is better
to fetch everything remotely and sort and group by locally or they may
be cases where another query would be better.

This is why writing an optimal SQL query is hard and why database
planners do lots of work, rely on lots of heuristics and statistics  and
are hard to implement.

> 
> Now, as we have different backends and their syntax does not always
> match (most do standard sql, but there are extensions, think of ST_* in
> postgis, SDO_* in oracle spatial etc, and there is no guarantee, that a
> dataprovider supports standard SQL) we need an abstraction layer and
> cannot just assume that every dataprovider will understand the same thing.

Yes, exactly, and this was to whole point of relying on SQLite and
Spatialite as an abstraction layer.

> 
> I am planning to implement this abstraction layer for ordinary
> expressions in the near future. And I would be very happy to see that
> there is a way to extend this with aggregate functions. This will all
> require a fallback level (if the server does not support a given
> functionality, for shapefiles, to be sure that the result is the very
> same regardless of server function implementation details...). We
> already have this for ordinary expressions as we can just use the
> QgsExpression implementation.
> 
> I can only repeat, that I will be more than happy to see this fallback
> level implemented by means of sqlite virtual tables. But in my humble
> opinion it would be very unfortunate if there would be no way to
> optimize this tomorrow or the day after tomorrow. That's why I am asking
> if sqlite developers are open to collaborate because if the will to
> collaborate there is missing we will end up having to reimplement this
> functionality.
> 
> I totally agree with your statement "A general cross-database SQL
> optimizer would need lots of work (in QGIS and/or SQLite) and may come
> later (if really needed).", but would append "and will require either a
> re-implementation in QGIS, ending up in duplicated functionality"
> (that's what Régis was worried about in the original mail) "or will need
> the possibility to get deeper access to sqlite's internal data
> structures or callbacks to optimize where required. This will require
> cooperation from SQLite developer's side or forking sqlite.".
> 

I really don't think the difficulty is here. Even if we have access to
the parsed SQL and even to the result of the SQLite planner, splitting
the execution plan in optimal local and remote operations is *very hard*
in the general case. And we would have to do it whatever the abstraction
layer we choose : SQLite or our own QgsExpression-based solution.
And I really don't think that would be a good idea to do this in QGIS.

The optimization plan could be: don't try to optimize in the general
case (premature optimization ...), only optimize specific
well-identified cases.
For now the only simple case I can see is when a join is done on tables
from the same database (and the user don't know or can't do a join on
the remote database), then yes we would have to know the syntax of the
original query and translate it into the desired SQL dialect.
And yes it would be better to have it from SQLite. It would require
either to work with the developers, to patch it or to somehow include
the parsing part of SQLite (we already ship sqlite / spatialite with
QGIS right ?). But I really don't see it as a blocker.

I really think using SQLite as our database engine has a good potential.
It could extend the abilities and expressivity of QGIS. And it could
also allow to use *less* code in QGIS (seeing every provider as a
virtual layer). And it could also paves the way for a native QGIS file
format (this is another discussion, but somehow related).

But whatever the implementation is, trying to be (automatically) as
performant as a well-designed query on a well-designed db is a waste of
energy.

On a more pragmatic side, people are already interested and might pay
for db-oriented functionalities in QGIS (the very first need is to be
able to filter joined tables). It does not have to be the only incentive
for design choices, but this is a good opportunity. Then a decision has
to be taken.

So, if it is hard for you and me to agree :), are there other opinions ?
Other arguments for one or the other side ?



More information about the Qgis-developer mailing list