[Qgis-developer] Aggregates within expression engine

Wed Mar 16 00:58:08 PDT 2016

On Wed, Mar 16, 2016 at 6:17 AM, Nyall Dawson <nyall.dawson at gmail.com> wrote:

> I'm also torn regarding the best syntax to use for aggregates within
> expressions. I'm unsure if the traditional SQL "group by" clauses
> would be a good fit within the existing QGIS expression syntax (eg
> "sum("some_field") group by "some_other_field"). To me it doesn't fit
> with the existing functional approach that the expressions take. But
> on the other hand, trying to implement this as functions would result
> in some very clumsy expressions: "aggregate('sum', "some_field",
> "some_other_field")" or "sum("some_field", "some_other_field") ". Has
> anyone got any other ideas for syntax which would be a good fit?
>

 In R (stats programming language/package) these things are done with
functions, so you can get expressions like:

   means = summarize(group_by(filter(data,"Outcome"=="fatal"),"Disease"),
mean=mean("Age"))

where "Disease" is the name of a column in the "data" dataset. This
expression would return a vector (like a python "list") of the mean
ages for people who died with each disease. Now, once you get a few
more expressions in these things get messy, so people implemented a
new operator to express these like a pipeline. The above can be
written:

  means = data %>% filter("Outcome"=="fatal") %>% group_by("Disease")
%>% summarise(mean=mean("Age"))

which reads much more naturally from left to right than trying to
unwrap nested function calls. (Note the ugly-looking operator with
%-signs is because user-defined R operators have to be demarcated in %
signs). So if I wanted to know the range of the mean ages by disease
I'd just add:

   %>% range()

on the end of the above expression, rather than sticking `range(` at
the start and then going to the end and putting the closing `)` next
to, quite probably, a bunch of uncountable closing brackets.

 The operator works quite simply by rewriting the expression `foo %>%
bar(x,y,z)` as `bar(foo,x,y,z)`.

 This pipe operator has divided R users a bit, and it has been
overused and abused greatly. Its also slower than direct function
evaluation because of the re-arrangement of the expression, but its
unlikely that any pipeline chain will spend much of its time doing
that compared to doing the actual work of the pipeline.

 Anyway, if something like that might be useful in QGIS, take a look
at the documentation for R's `dplyr` package, or tutorials with it.

Barry