[postgis-devel] ST_Union Parallel Experiment

Tue Mar 13 09:09:43 PDT 2018

Not marking something as PARALLEL SAFE just plainly disables parallel for
everything in the query, this was the main reason for PARALLEL SAFE.
Without Paul's modification you can still get benefit from parallelism, in
form of say

SELECT class, ST_Union(ST_Buffer(geom, 1)) from sometable group by class;

- where a Buffer will be calculated in parallel, but Union will be
performed for all the table in main worker afterwards.

Here's schematic of parallel aggregate:
 [image: image.png]
 - you must have sfunc in any aggregate :)
 - if your transfer type does not match your output type, you make a
finalfunc;
 - if you define a combinefunc, Postgres starts having an option to run
several workers - without it it has no idea how to combine two outputs from
different workers;
 - serialfunc/deserialfunc default to in/out functions of datatype.

Schematically serialfunc in each worker looks almost like finalfunc, just
that it outputs not to user but to other worker.

Work is distributed just by giving relation page ranges to each worker.
They don't pass parts of their work to one another.

Each thing in this machinery has its cost, so you'd want it to start only
when you're sure it will get you results faster than usually. That's
exactly the reason why it's so hard to make parallelism kick in on small
tables, and why it shows not that good results on these :)

Whole part in the main worker becomes almost noop if all the rows for a
GROUP BY key get into only one worker's part of relation. Combinefunc is
then not called, and finalfunc skips GEOS call as it's got just one element
in array.

вт, 13 мар. 2018 г. в 18:15, Regina Obe <lr at pcorp.us>:

>
>
> >  I think the benchmark you do here does not cover a common case of big
> table grouped by some attributive column. If the dataset is reasonably
> clustered, and
>
> > number of threads is smaller than number of groups, one can expect a
> Parallel Seq Scan to bring all the rows for one group most of the time, so
> that Cascaded
>
> > Union is performed in parallel worker and then main worker is just
> passing the result upwards. Costs adjustments can be tricky for that though.
>
>
>
> I may have misunderstood how this works, but in the case you describe I
> thought the ST_Union would happen after data is partitoned to each worker
> node so the union step wouldn't be parallelized but would occur in each
> worker so would run in parallel for each set of groups.
>
> Since ST_Union is marked as safe, wouldn't it be already taking advantage
> of this?
>
>
>
> To be honest I've never tested this out but that was the main impetus for
> marking ST_Union parallel safe to allow the ST_Union to still happen in a
> worker node.
>
>
>
> Thanks,
>
> Regina
>
>
>
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20180313/45c3a9f2/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 91453 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20180313/45c3a9f2/attachment-0001.png>