[Qgis-developer] QEP/RFC sqlite virtual tables

Matthias Kuhn matthias.kuhn at gmx.ch
Tue Oct 28 05:21:49 PDT 2014


Hi Hugo

On 28.10.2014 12:15, Hugo Mercier wrote:
> Hi Matthias,
>
> Le 28/10/2014 09:46, Matthias Kuhn a écrit :
>> Hi Hugo,
>>
>> Sorry for my slow response, I was quite busy the last days.
> With the 2.6 coming I know this is not the best time for this kind of
> discussion :( So thanks for your time.
>
>> I'll try to explain a bit more what I have in mind.
>>
>> For example there is the new expression function "getFeature()" which is
>> an ad-hoc replacement for a join. I have been told that the performance
>> for rendering is not very good, I assume this is caused by a lot of
>> requests performed in the background. It would be great to be able to
>> join this on the server side.
> I agree. This getFeature() is a symptom that we need more advanced
> relational-oriented functionalities.
>
>> I would be very happy to see more of this introduced. For example I
>> could well imagine a new Expression (Syntax just made up, that can still
>> be designed in a more appropriate way)
>>
>>   avg( c.population ) { LEFT JOIN cities c ON c.country_code =
>> $thistable.code }
>>
>> This should then be locally evaluated (for shapefiles et al) but
>> forwarded to the server for database services (postgres and the like).
> How would it be "forwarded" for database services ? If the two tables
> are from the same db, then ok, we are in the "simple optimization" case
> that I was talking about previously : more or less a translation between
> SQL dialects.
> But what if the current table is a local file (shapefile) and cities is
> a postgis table ?

In this case I think the sqlite virtual table approach (or any other
local filtering method) should be preferred. That would involve too much
black magic.
There may be some cases where we can mix things (e.g. two parts of a
where clause combined with "AND" where only for one of the two WHERE
clauses a native SQL translation can be found could be performed as a
two-step filter, but that's really not the main point)

> You would then need a magical optimization engine that will say you need
> to send the list of local country codes to a remote query like :
>
> SELECT avg(population)
> FROM cities
> WHERE country_code IN ( $local_country_code_array )
> GROUP BY country_code
>
> But actually, the best optimization depends in the general case on the
> actual data stored in tables. If the bandwith required to send the IN
> part to PostGIS is too important, there may be cases where it is better
> to fetch everything remotely and sort and group by locally or they may
> be cases where another query would be better.
>
> This is why writing an optimal SQL query is hard and why database
> planners do lots of work, rely on lots of heuristics and statistics  and
> are hard to implement.
>
>> Now, as we have different backends and their syntax does not always
>> match (most do standard sql, but there are extensions, think of ST_* in
>> postgis, SDO_* in oracle spatial etc, and there is no guarantee, that a
>> dataprovider supports standard SQL) we need an abstraction layer and
>> cannot just assume that every dataprovider will understand the same thing.
> Yes, exactly, and this was to whole point of relying on SQLite and
> Spatialite as an abstraction layer.
>
>> I am planning to implement this abstraction layer for ordinary
>> expressions in the near future. And I would be very happy to see that
>> there is a way to extend this with aggregate functions. This will all
>> require a fallback level (if the server does not support a given
>> functionality, for shapefiles, to be sure that the result is the very
>> same regardless of server function implementation details...). We
>> already have this for ordinary expressions as we can just use the
>> QgsExpression implementation.
>>
>> I can only repeat, that I will be more than happy to see this fallback
>> level implemented by means of sqlite virtual tables. But in my humble
>> opinion it would be very unfortunate if there would be no way to
>> optimize this tomorrow or the day after tomorrow. That's why I am asking
>> if sqlite developers are open to collaborate because if the will to
>> collaborate there is missing we will end up having to reimplement this
>> functionality.
>>
>> I totally agree with your statement "A general cross-database SQL
>> optimizer would need lots of work (in QGIS and/or SQLite) and may come
>> later (if really needed).", but would append "and will require either a
>> re-implementation in QGIS, ending up in duplicated functionality"
>> (that's what Régis was worried about in the original mail) "or will need
>> the possibility to get deeper access to sqlite's internal data
>> structures or callbacks to optimize where required. This will require
>> cooperation from SQLite developer's side or forking sqlite.".
>>
> I really don't think the difficulty is here. Even if we have access to
> the parsed SQL and even to the result of the SQLite planner, splitting
> the execution plan in optimal local and remote operations is *very hard*
> in the general case. And we would have to do it whatever the abstraction
> layer we choose : SQLite or our own QgsExpression-based solution.
> And I really don't think that would be a good idea to do this in QGIS.
>
> The optimization plan could be: don't try to optimize in the general
> case (premature optimization ...), only optimize specific
> well-identified cases.
> For now the only simple case I can see is when a join is done on tables
> from the same database (and the user don't know or can't do a join on
> the remote database
... or a developer wants to develop a provider-independent plugin/core
functionality that has a good cross-table performance.
> ), then yes we would have to know the syntax of the
> original query and translate it into the desired SQL dialect.
> And yes it would be better to have it from SQLite. It would require
> either to work with the developers, to patch it or to somehow include
> the parsing part of SQLite (we already ship sqlite / spatialite with
> QGIS right ?). But I really don't see it as a blocker.

I don't think we do ship it but I may be wrong?
Please don't get me wrong, I didn't say it's a blocker. The original
discussion started from Régis statement "there's a QEP taking care of
relational queries, is there a need to duplicate functionality?" and my
response "I am not sure if sqlite virtual tables will be able to satisfy
all our needs". I was just wondering about the limitations of this
approach and I am still very open to hear about any experiences you had
with sqlite developers (I have none) because that is what counts when it
comes to working with upstream as you proposed.

>
> I really think using SQLite as our database engine has a good potential.
> It could extend the abilities and expressivity of QGIS. And it could
> also allow to use *less* code in QGIS (seeing every provider as a
> virtual layer). And it could also paves the way for a native QGIS file
> format (this is another discussion, but somehow related).
>
> But whatever the implementation is, trying to be (automatically) as
> performant as a well-designed query on a well-designed db is a waste of
> energy.
Yes, we cannot do that. But we can try to find a way that allows
developers to develop and users to create projects independent of
providers while still providing good performance on a capable backend.

>
> On a more pragmatic side, people are already interested and might pay
> for db-oriented functionalities in QGIS (the very first need is to be
> able to filter joined tables). It does not have to be the only incentive
> for design choices, but this is a good opportunity. Then a decision has
> to be taken.
>
> So, if it is hard for you and me to agree :), are there other opinions ?
> Other arguments for one or the other side ?

I don't want to disagree, I just wanted to raise questions / to
understand the limitations. I see the demand for this functionality and
I see the potential that sqlite virtual tables have to offer. I just
wonder what the performance will be like in a scenario where there's a
network (with latency for every request) involved. And what it would
require to overcome this issue (if there is any).

Best regards,
Matthias



More information about the Qgis-developer mailing list