[postgis-devel] ST_Union Parallel Experiment

Darafei "Komяpa" Praliaskouski me at komzpa.net
Tue Mar 13 07:11:32 PDT 2018


Hi!

I've tried to glue back my big subdivided polygon and see 21% improvement
on 3 workers on my laptop. I expect to be able to get result even faster if
I run it on machine with even more cores :)

Execution is also not well cancellable, inside GEOS UnaryUnion.

I think the benchmark you do here does not cover a common case of big table
grouped by some attributive column. If the dataset is reasonably clustered,
and number of threads is smaller than number of groups, one can expect a
Parallel Seq Scan to bring all the rows for one group most of the time, so
that Cascaded Union is performed in parallel worker and then main worker is
just passing the result upwards. Costs adjustments can be tricky for that
though.

Another observation I have is that main worker spends much time (or my
poormansprofiler happened to catch it there often) in building STRTree for
the final geometry. According to
https://lin-ear-th-inking.blogspot.com.by/2007/11/fast-polygon-merging-in-jts-using.html
STRTree
can be replaced with other spatial clustering method, and we've got
gbox_get_sortable_hash for quick-and-dirty method.

Also I get two different values for union area for subsequent calculations,
so there's a bug somewhere.

[local] gis at postgis_reg=# explain select ST_Area(ST_Union(geom)) from
big_polygon_subdivided;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                QUERY PLAN
                                │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Finalize Aggregate  (cost=69572.58..69572.61 rows=1 width=8)
                               │
│   ->  Gather  (cost=69562.37..69562.58 rows=2 width=32)
                                │
│         Workers Planned: 2
                               │
│         ->  Partial Aggregate  (cost=68562.37..68562.38 rows=1 width=32)
                               │
│               ->  Parallel Seq Scan on big_polygon_subdivided
(cost=0.00..1717.37 rows=26737 width=154) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(5 rows)

Time: 0,443 ms
[local] gis at postgis_reg=# select ST_Area(ST_Union(geom)) from
big_polygon_subdivided;
┌──────────────────┐
│     st_area      │
├──────────────────┤
│ 26700397396.7594 │
└──────────────────┘
(1 row)

Time: 38183,094 ms (00:38,183)
[local] gis at postgis_reg=# set max_parallel_workers_per_gather = 0;
SET
Time: 0,216 ms
[local] gis at postgis_reg=# select ST_Area(ST_Union(geom)) from
big_polygon_subdivided;
┌──────────────────┐
│     st_area      │
├──────────────────┤
│ 29321997075.9356 │
└──────────────────┘
(1 row)

Time: 46932,512 ms (00:46,933)


вт, 13 мар. 2018 г. в 1:48, Paul Ramsey <pramsey at cleverelephant.ca>:

> Hey all,
> I just wanted to record for posterity the results of my experiment in
> making a parallel version of ST_Union().
>
> The basic theory was:
>
> * add serialfn/deserialfn/combinefn to the aggregate
> * in the serialfn, do an initial cascaded union of everything
> collected by the worker
> * in the combinefn, do pairwise union of each set of partials
>
> The obvious drawback is, particularly for inputs that are a "coverage"
> (many polygons, covering an area, with no overlap) the workers won't
> be fed a neat contiguous area, so the main promise of cascaded union,
> that it eliminates the maximum number of vertices possible at each
> step, is broken.
>
> In fact, that is more-or-less what I observed. The union was quite a
> bit slower, even when it was using up twice as much CPU (two core
> laptop)
>
> (The debug messages are the parallel-only functions
> (serialfn/deserialfn/combinefn) being called in the parallel
> execution.)
>
> postgis25=# select st_area(st_union(geom)) from va_ply_17;
> DEBUG:  pgis_geometry_union_serialfn called
> DEBUG:  pgis_geometry_union_serialfn called
> DEBUG:  pgis_geometry_union_serialfn called
> DEBUG:  pgis_geometry_union_deserialfn called
> DEBUG:  pgis_geometry_union_deserialfn wkb size = 8526407
> DEBUG:  pgis_accum_combinefn called
> DEBUG:  pgis_geometry_union_deserialfn called
> DEBUG:  pgis_geometry_union_deserialfn wkb size = 4236637
> DEBUG:  pgis_accum_combinefn called
> DEBUG:  pgis_geometry_union_deserialfn called
> DEBUG:  pgis_geometry_union_deserialfn wkb size = 6526511
> DEBUG:  pgis_accum_combinefn called
>      st_area
> -----------------
>  1070123068374.1
> (1 row)
>
> Time: 106545.200 ms (01:46.545)
>
> Force the plan to be single-threaded, and run again.
>
> postgis25=# set max_parallel_workers_per_gather = 0;
> postgis25=# select st_area(st_union(geom)) from va_ply_17;
>      st_area
> ------------------
>  1070123068374.11
> (1 row)
>
> Time: 66527.914 ms (01:06.528)
>
> Damn, it's faster.
>
> It’s possible that if the partials were fed inputs in a spatially
> correlated order the final merge might be no worse than the usual
> top-level merge in a cascaded union. However, forcing an ordering in
> the aggregate strips out the parallel plans.
>
> postgis25=# set max_parallel_workers_per_gather = 2;
> postgis25=# explain select st_area(st_union(geom order by geom)) from
> va_ply_17;
>                                QUERY PLAN
> ------------------------------------------------------------------------
>  Aggregate  (cost=15860.58..15860.62 rows=1 width=8)
>    ->  Seq Scan on va_ply_17  (cost=0.00..1715.58 rows=5658 width=6181)
>
> If the order by trick worked, I'd hope that the parallel execution
> might win, but since it doesn't it's best to just leave it "as is".
>
> The branch is available here for anyone interested in perusing it.
>
> https://github.com/pramsey/postgis/tree/svn-trunk-parallel-union
>
> ATB,
>
> P
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20180313/5d69c4d6/attachment.html>


More information about the postgis-devel mailing list