[postgis-users] Forcibly parallelizing very expensive st_unaryunion in sequential scan on 20 row MultiLineString Table

Fri Nov 27 14:16:25 PST 2020

Workaround: 'pyodbc' or 'psycopg2'...

I am currently running a Hyper-V Ubuntu 20.04.1 VM instance on a dual 
Xeon E5-2680 v4 equipped workstation, 28C/56T total, and PostgreSQL set 
to 56 workers max.

I wrote a custom Python implementation using 
'concurrent.futures.ThreadPoolExecutor' that fully parallelizes INSERTS 
and UPDATES using Python multi-threading against geometry columns using 
PostGIS functions, and it both handles "minor" loads with heavy 
computation of a few dozen records, up to mega scale with the system 
going full throttle at 100% CPU usage, with pgAdmin showing all 56 
PostgreSQL workers active. I only recently acquired this system, and am 
still in a testing phase, but for INSERTs I already saw figures of 37,4k 
records/s or 135 M/h... UPDATEs involving some heavy computation on 
geometries showed in one case about 4,2 k records/s, so about 15M/h. Joy 
to see it run like that. OpenStreetMap data by the way.

This wasn't even with an unlogged table, just running from SSD.

All that said, if you only have a 4C/8T system, the starting and 
finishing overhead of parallelized processing, even with PostgreSQL's 
own options for that, may be significant and undo much of the benefit of 
it. It seems to be more beneficial on higher core count systems.

Marco

Op 27-11-2020 om 21:27 schreef Andrew Joseph:
>
> It appears it has nothing to do with TOAST or Postgis. Postgres simply 
> doesn’t seem to parallelize any operations on tables with few rows 
> regardless of function cost, as shown in the easily replicable example 
> below. Anyone know a workaround for this that doesn’t involve using 
> dblink?
>
> CREATE OR REPLACE FUNCTION very_expensive_operation(value anyelement, 
> sleep_time integer=2) RETURNS integer as $$
>
>     BEGIN
>
>         perform pg_sleep(sleep_time);
>
>         return sleep_time;
>
>     END;
>
> $$ LANGUAGE plpgsql immutable strict parallel safe cost 10000;
>
> CREATE UNLOGGED TABLE expensive_rows (
>
>     id  serial PRIMARY KEY,
>
>     value uuid
>
> ) WITH (parallel_workers = 8);
>
> INSERT INTO expensive_rows(value) select gen_random_uuid() identifier 
> from generate_series(1,16);
>
> EXPLAIN ANALYSE VERBOSE
>
>     SELECT
>
>         very_expensive_operation(value,2)
>
>     FROM
>
>       expensive_rows
>
> ;
>
> Gather  (cost=0.00..5312.12 rows=1700 width=4) (actual 
> time=2010.650..32042.558 rows=16 loops=1)
>
> "  Output: (very_expensive_operation(value, 2))"
>
>   Workers Planned: 8
>
>   Workers Launched: 7
>
>   ->  Parallel Seq Scan on public.expensive_rows  (cost=0.00..5312.12 
> rows=212 width=4) (actual time=286.078..4575.903 rows=2 loops=7)
>
> "        Output: very_expensive_operation(value, 2)"
>
>         Worker 0:  actual time=0.001..0.001 rows=0 loops=1
>
>         Worker 1:  actual time=0.001..0.001 rows=0 loops=1
>
>         Worker 2:  actual time=0.001..0.001 rows=0 loops=1
>
>         Worker 3:  actual time=2002.537..32031.311 rows=16 loops=1
>
>         Worker 4:  actual time=0.001..0.001 rows=0 loops=1
>
>         Worker 5:  actual time=0.002..0.002 rows=0 loops=1
>
>         Worker 6:  actual time=0.002..0.002 rows=0 loops=1
>
> Planning Time: 0.086 ms
>
> Execution Time: 32042.609 ms
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for 
> Windows 10
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20201127/01af33ba/attachment.html>