[Qgis-developer] Table join branch

Mon Aug 2 06:32:26 EDT 2010

Hi Martin

Thanks for your feedback. The idea of the joining branch was indeed to collect 
feedback on the list in an early  stage (certainly before implementing any 
cache).

>Just to be sure with the table joins, do we refer only to left outer
>joins on one table where the join predicate checks for equality of
>attributes? I.e. something like this:
>SELECT * FROM layer LEFT OUTER JOIN table1 ON layer.a = table1.x [
>LEFT OUTER JOIN table2 ON layer.b = table2.y ... ]
>Or could there be other scenarios like other join types (inner, right
>outer), more complex join predicates (possibly involving geometries)? 

Left outer join is the most important one. Spatial joins are commonly used 
too. 
Right outer and inner joins seem a bit exotic to me in a desktop GIS.

>If we can assume that the joined table won't be modified (much), we
>could build a hash table for each joined table: key = join attribute,
>value = vector of joined attributes. So, later when the layer is being
>iterated, for each feature we would only have to look up the attribute
>in hash table and copy the joined attributes. (Additionally, if we
>wanted to save memory, the hash table values could be only feature ID
>in the joined table - the joined attributes would be loaded at once
>when starting iteration over layer and deleted again when the
>iteration over layer has finished).
>
>This way I hope we could achieve good performance when fetching
>features and to keep the design simple (and with no need for external
>attribute indexes).

I agree a hashtable is the good solution for most cases. 
But what to do in the few cases where the joined attributes don't fit into 
memory at a time? 
Without a memory cache, there is a number of things that can be done efficiently 
even on very large datasets (at least with attribute index):
- Usage of identify tool
- Looking at the attribute table
- Symbology classification on joined attribute on small spatial subset
- Saving joins to disk is much faster compared to the current possibility in 
the vector tools

Storing only the ids in the hashtable would be my preferred solution. Still, 
if we load all the joined attributes 'at once' before iterating, memory might 
be an issue. Not loading at once and calling 'featureAtId' for each feature 
might decrease performance. And yes, there could be cases where the assumption 
about the joined table not changing much is not appropriate (e.g. GIS for real 
time sensor data).

A possible solution would be to provide both 'hashtable in memory' mode and 
'no memory cache' mode. In the join dialog, the user could select what to use 
(with hashtable in memory as default preselected). That way, user cannot 
complain about wrong values if he uses a datasource that changes frequently or 
if the datasource is very large.

What do you think?

Regards,
Marco

Am Donnerstag, 29. Juli 2010, um 15.20:06 schrieb Martin Dobias:
> Hi Marco
> 
> finally I've found time to look at the code...
> 
> On Mon, Jul 19, 2010 at 9:10 AM, Marco Hugentobler
> 
> <marco.hugentobler at sourcepole.ch> wrote:
> > Internally, QgsVectorLayer stores the join info (fields and join table
> > provider key) and remaps the field ids (first the provider ids, then the
> > joined field ids and last the added field ids). When querying the joined
> > field values, a subset string is set to the table provider to query a
> > feature with the value of the join field.
> > 
> > As you might expect, there are still a number of issues:
> > - Joined fields cannot be edited (should they?). So attribute table and
> > info- tool should disable those rows. QgsVectorLayer probably needs a
> > method to tell attribute table and info tool which rows are not
> > editable.
> > - Performance is an issue. The current implementation tries to improve
> > performance by automatically creating attribute indices in the ogr
> > provider (could be implemented in other providers too). Still there is a
> > siginificant performance penalty when doing classifications based on
> > joined attributes or attributes searches (info tool and opening
> > attribute table are usually fast, because not many features are queried
> > at once).
> > One possibility to enhance this could be to load tables into memory
> > providers and implementing attribute index capability there. Are there
> > other possibilities?
> 
> Just to be sure with the table joins, do we refer only to left outer
> joins on one table where the join predicate checks for equality of
> attributes? I.e. something like this:
> SELECT * FROM layer LEFT OUTER JOIN table1 ON layer.a = table1.x [
> LEFT OUTER JOIN table2 ON layer.b = table2.y ... ]
> Or could there be other scenarios like other join types (inner, right
> outer), more complex join predicates (possibly involving geometries)?
> 
> (I will assume only left outer join from now on)
> 
> In my opinion, the current implementation is not suitable for
> production use. In general case, the joining needs linear time for
> fetching attributes for each feature. Using the attribute indexes can
> get this typically to logarithmic time, but still there's need that
> user creates the indexes and provider supports that (probably not many
> ogr drivers have this functionality, postgres provider might not have
> sufficient privileges etc). I think we should aim to find joined
> table's row for each feature in constant time, otherwise the
> performance will be still an issue.
> 
> If we can assume that the joined table won't be modified (much), we
> could build a hash table for each joined table: key = join attribute,
> value = vector of joined attributes. So, later when the layer is being
> iterated, for each feature we would only have to look up the attribute
> in hash table and copy the joined attributes. (Additionally, if we
> wanted to save memory, the hash table values could be only feature ID
> in the joined table - the joined attributes would be loaded at once
> when starting iteration over layer and deleted again when the
> iteration over layer has finished).
> 
> This way I hope we could achieve good performance when fetching
> features and to keep the design simple (and with no need for external
> attribute indexes).
> 
> I will not post comments on any particular source code issues, as it
> is probably too early for that and the focus should stay on the design
> level.
> 
> Regards
> Martin

-- 
Dr. Marco Hugentobler
Sourcepole -  Linux & Open Source Solutions
Webereistrasse 66, 8134 Adliswil, Switzerland
marco.hugentobler at sourcepole.ch http://www.sourcepole.ch
Technical Advisor QGIS Project Steering Committee