[postgis-users] How does PostGIS / PostgreSQL distribute parallel work to cores for GIS type data?

Marco Boeringa marco at boeringa.demon.nl
Wed Mar 4 05:49:16 PST 2020


Hi Paul and Darafei,

Thanks for the insightful answers.

I may attempt to do a comparative test using both my multi-threaded 
implementation using the vertex count load balancing between threads, 
and having postgreSQL do its parallel work. But it will require some 
custom development, so I don't know when I will be able to show results.

Anyway, there is a secondary reason I implemented this.

As part of the multi-stage generalization process, a custom non-parallel 
safe function is called. This blocks the ability to use default 
PostgreSQL / PostGIS parallel query. By using the multi-threaded 
approach including a record-by-record UPDATE process, I am able to 
parallelize this overcoming the limitation of the non-parallel function. 
Added benefit of this approach is that I can display real-time progress 
of the query execution as a progress bar from 0-100% in my application 
using a Python thread safe counter. This is really nice to have for 
queries that may run for hours or days on +100M records. Hence I will 
keep using my implementation for this work, even if PostgreSQL / PostGIS 
could do it equally efficient (which still needs to be tested in my 
particular case).

Quote:

"One nuance I’m not 100% sure of is if the master hands out records to 
workers in matches of num_records, or batches of num_pages. If the 
latter, then the scheme would be very much like you propose anyways, 
since large records would take up more space on a page, and data volume 
would determine distribution, not record volume."

Indeed would be interesting to know this, about the "num_records versus 
num_pages" assignment. I have the feeling it is num_records based on my 
experience up to now, but if you find out otherwise, let us know on the 
list.

Marco Boeringa

Op 2-3-2020 om 18:08 schreef Paul Ramsey:
>
>> On Mar 1, 2020, at 3:36 AM, Marco Boeringa <marco at boeringa.demon.nl> wrote:
>>
>> Although it is hard to give figures here, because I do not have a fully equivalent non-multi threaded processing flow, I do see significant benefits from distributing records based on vertex complexity.
> Yes and no. The executor does say “I have N records and C cores, so every core gets N/C records”.
> It says “I still have records, here Core 1, have 10K”. “I still have records, here Core 2, have 10K”, and so on. The chunks are generally smaller than N/C, so the net effect over a large table is that all the cores stay busy most of the time.
> In theory, carefully optimizing by handing out records based on vertices is a thing, in practice, it’s not a big deal.
>    One nuance I’m not 100% sure of is if the master hands out records to workers in matches of num_records, or batches of num_pages. If the latter, then the scheme would be very much like you propose anyways, since large records would take up more space on a page, and data volume would determine distribution, not record volume.
>
> ATB,
> P
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users


More information about the postgis-users mailing list