[Qgis-developer] Table join branch
Marco Hugentobler
marco.hugentobler at sourcepole.ch
Mon Aug 2 06:32:26 EDT 2010
Hi Martin
Thanks for your feedback. The idea of the joining branch was indeed to collect
feedback on the list in an early stage (certainly before implementing any
cache).
>Just to be sure with the table joins, do we refer only to left outer
>joins on one table where the join predicate checks for equality of
>attributes? I.e. something like this:
>SELECT * FROM layer LEFT OUTER JOIN table1 ON layer.a = table1.x [
>LEFT OUTER JOIN table2 ON layer.b = table2.y ... ]
>Or could there be other scenarios like other join types (inner, right
>outer), more complex join predicates (possibly involving geometries)?
Left outer join is the most important one. Spatial joins are commonly used
too.
Right outer and inner joins seem a bit exotic to me in a desktop GIS.
>If we can assume that the joined table won't be modified (much), we
>could build a hash table for each joined table: key = join attribute,
>value = vector of joined attributes. So, later when the layer is being
>iterated, for each feature we would only have to look up the attribute
>in hash table and copy the joined attributes. (Additionally, if we
>wanted to save memory, the hash table values could be only feature ID
>in the joined table - the joined attributes would be loaded at once
>when starting iteration over layer and deleted again when the
>iteration over layer has finished).
>
>This way I hope we could achieve good performance when fetching
>features and to keep the design simple (and with no need for external
>attribute indexes).
I agree a hashtable is the good solution for most cases.
But what to do in the few cases where the joined attributes don't fit into
memory at a time?
Without a memory cache, there is a number of things that can be done efficiently
even on very large datasets (at least with attribute index):
- Usage of identify tool
- Looking at the attribute table
- Symbology classification on joined attribute on small spatial subset
- Saving joins to disk is much faster compared to the current possibility in
the vector tools
Storing only the ids in the hashtable would be my preferred solution. Still,
if we load all the joined attributes 'at once' before iterating, memory might
be an issue. Not loading at once and calling 'featureAtId' for each feature
might decrease performance. And yes, there could be cases where the assumption
about the joined table not changing much is not appropriate (e.g. GIS for real
time sensor data).
A possible solution would be to provide both 'hashtable in memory' mode and
'no memory cache' mode. In the join dialog, the user could select what to use
(with hashtable in memory as default preselected). That way, user cannot
complain about wrong values if he uses a datasource that changes frequently or
if the datasource is very large.
What do you think?
Regards,
Marco
Am Donnerstag, 29. Juli 2010, um 15.20:06 schrieb Martin Dobias:
> Hi Marco
>
> finally I've found time to look at the code...
>
> On Mon, Jul 19, 2010 at 9:10 AM, Marco Hugentobler
>
> <marco.hugentobler at sourcepole.ch> wrote:
> > Internally, QgsVectorLayer stores the join info (fields and join table
> > provider key) and remaps the field ids (first the provider ids, then the
> > joined field ids and last the added field ids). When querying the joined
> > field values, a subset string is set to the table provider to query a
> > feature with the value of the join field.
> >
> > As you might expect, there are still a number of issues:
> > - Joined fields cannot be edited (should they?). So attribute table and
> > info- tool should disable those rows. QgsVectorLayer probably needs a
> > method to tell attribute table and info tool which rows are not
> > editable.
> > - Performance is an issue. The current implementation tries to improve
> > performance by automatically creating attribute indices in the ogr
> > provider (could be implemented in other providers too). Still there is a
> > siginificant performance penalty when doing classifications based on
> > joined attributes or attributes searches (info tool and opening
> > attribute table are usually fast, because not many features are queried
> > at once).
> > One possibility to enhance this could be to load tables into memory
> > providers and implementing attribute index capability there. Are there
> > other possibilities?
>
> Just to be sure with the table joins, do we refer only to left outer
> joins on one table where the join predicate checks for equality of
> attributes? I.e. something like this:
> SELECT * FROM layer LEFT OUTER JOIN table1 ON layer.a = table1.x [
> LEFT OUTER JOIN table2 ON layer.b = table2.y ... ]
> Or could there be other scenarios like other join types (inner, right
> outer), more complex join predicates (possibly involving geometries)?
>
> (I will assume only left outer join from now on)
>
> In my opinion, the current implementation is not suitable for
> production use. In general case, the joining needs linear time for
> fetching attributes for each feature. Using the attribute indexes can
> get this typically to logarithmic time, but still there's need that
> user creates the indexes and provider supports that (probably not many
> ogr drivers have this functionality, postgres provider might not have
> sufficient privileges etc). I think we should aim to find joined
> table's row for each feature in constant time, otherwise the
> performance will be still an issue.
>
> If we can assume that the joined table won't be modified (much), we
> could build a hash table for each joined table: key = join attribute,
> value = vector of joined attributes. So, later when the layer is being
> iterated, for each feature we would only have to look up the attribute
> in hash table and copy the joined attributes. (Additionally, if we
> wanted to save memory, the hash table values could be only feature ID
> in the joined table - the joined attributes would be loaded at once
> when starting iteration over layer and deleted again when the
> iteration over layer has finished).
>
> This way I hope we could achieve good performance when fetching
> features and to keep the design simple (and with no need for external
> attribute indexes).
>
> I will not post comments on any particular source code issues, as it
> is probably too early for that and the focus should stay on the design
> level.
>
> Regards
> Martin
--
Dr. Marco Hugentobler
Sourcepole - Linux & Open Source Solutions
Webereistrasse 66, 8134 Adliswil, Switzerland
marco.hugentobler at sourcepole.ch http://www.sourcepole.ch
Technical Advisor QGIS Project Steering Committee
More information about the Qgis-developer
mailing list