[postgis-users] How does PostGIS / PostgreSQL distribute parallel work to cores for GIS type data?

Fri Feb 28 14:06:33 PST 2020

Hi all,

With some of the recent discussions related to PostgreSQL 12.x / PostGIS 
3.x query execution performance and parallel query processing, I have 
been wondering how PostGIS / PostgreSQL actually distribute parallel 
work for GIS type data? Is there any form of load balancing to different 
cores?

GIS type data fundamentally differs from "traditional" data with simple 
text or numeric fields in one crucial aspect: the size distribution of 
geometries is vastly greater than any difference in size of other basic 
field types in a database.

E.g. polygon geometries may vary from very simple 3 vertex triangle 
shapes, to vast country boundary polygons having over 1M vertexes. This 
has a *huge* impact on processing time per record as well. With 
geometries over 100k vertices, you quickly end up with processing times 
- per record! - that may extend in the minutes or even hours and in bad 
cases a day for a single geometry.

This means that simply distributing parallel work by distributing e.g. 
100M records over 10M chunks for 10 parallel workers, each worker simply 
getting the next set of available records to process until all have been 
assigned to a core / parallel worker, can end up very bad in terms of 
load balancing.

Let's assume for example that the table indeed has multiple records with 
geometries over 1M vertexes in size, and that all of them happen to be 
in the first 10M chunk of records, whereas the rest of the records has 
vastly smaller geometries. This could mean a single parallel worker 
ending up needing to process the bulk of the data in terms of sheer 
vertex count, and it might as well be processing for days on it! Whereas 
the other parallel workers, maybe needing to process only 10% percent of 
all vertex complexity in the table, finish in a few hours.

I actually personally developed a Python multi-threaded generalization 
option that circumvents such issues: it uses the geometry's vertex count 
as returned by the PostGIS 'ST_NPoints' function to distribute records 
evenly across all parallel threads. By sorting the records by vertex 
count, and than interleaving all records from the highest to the lowest 
vertex count to all parallel threads, all threads end up processing 
approximately the same number of vertexes. This leads to pretty nice 
load balancing, with all threads finishing in approximately the same 
timespan, and all cores of my server maxed out.

I realize PostGIS is dependent on PostgreSQL for parallel query 
execution, and that something like described above may not be directly 
feasible in a PostgreSQL / PostGIS default installation. Nonetheless I 
have wondered how the current mechanism for parallel query with GIS data 
is implemented, and if there is any load balancing based on geometry 
complexity?

Marco Boeringa