[postgis-users] How does PostGIS / PostgreSQL distribute parallel work to cores for GIS type data?
Darafei "Komяpa" Praliaskouski
me at komzpa.net
Sun Mar 1 23:53:46 PST 2020
Hi,
There are different ways to push down complexity of each record.
Most popular ones are simplification and subdivision.
For polygons, you can create a new table which will have all polygons
split into small pieces with limited amount of vertices using
ST_Subdivide.
This way instead of more of O(N^2)-but-parallel processing you get
more entries to look up in index via O(NlogN).
Otherwise the mechanism for parallelization is provided by Postgres
and not much can be done on PostGIS side, apart from pre-clustering
table manually on whatever feels right for your query plan.
On Sun, Mar 1, 2020 at 2:36 PM Marco Boeringa <marco at boeringa.demon.nl> wrote:
>
> Hi Darafei,
>
> Sorry, but I do not have the intimate knowledge of the intricacies of
> PostgreSQL / PostGIS that you have, so my post was more of the layman's
> level regarding the database itself. My main expertise is more in the
> field of GIS.
>
> Anyway, regarding your post, I think this already answers some of my
> questions and thoughts.
>
> As I see it from a GIS perspective, there is one other main difference
> with "traditional" OLTP databases for e.g. a commercial bank: many GIS
> type - analytical - operations hit *all* records / tuples of a table as
> default. This is a really uncommon scenario for a "traditional" database
> system. E.g., if you want to calculate the area, perimeter, buffer or
> either types of geometry attributes or derivatives, you usually - but
> not always - do it for all records in an entire table.
>
> This means a sequential scan of the entire table, with all records being
> updated and all geometries read (yes, I know this is not recommended,
> and dropping and re-creating may be more efficient, but it is a really
> common operation from a GIS perspective, both QGIS and ArcGIS have these
> "Field Calculator" type tools to do that). This is why it might make
> sense to distribute records based on geometry complexity, to level out
> any differences in vertex load in tables having an unpredictable and
> uneven distribution of records with large geometries.
>
> Quote:
>
> "Parallel Seq Scan divides table in equal size and gives each part to
> its own worker. It is expected that any nuance in parallel scan is
> offset by multiple queries running at the same time."
>
> While I can see that logic being utterly reasonable for a "traditional"
> OLTP database, is that also true for GIS type data with the vast size
> distribution of geometries? Although it is hard to give figures here,
> because I do not have a fully equivalent non-multi threaded processing
> flow, I do see significant benefits from distributing records based on
> vertex complexity. This is OpenStreetMap data, and some of the records
> being processed (also due to some other intermediate processing I do),
> are well over 100k vertexes, and need processing in the minutes per
> record. I certainly do not see issues with "multiple queries running at
> the same time" as inhibiting the multi-threaded process flow. With 8-16
> connections opened (one for each processing thread) in my multi-threaded
> implementation, PostgreSQL / PostGIS seem to be handling the heavy
> processing fine (on SSD of course).
>
> Marco
>
> Op 1-3-2020 om 11:15 schreef Darafei "Komяpa" Praliaskouski:
> > Hi,
> >
> > Can you please be more specific with the question?
> >
> > Parallelism is tuple level. All load balancing is provided by Postgres
> > - different Parallel Scan types have different ordering properties and
> > different tuple passing graph.
> > Parallel index scan pushes out each tuple to next free worker,
> > essentially balancing it out. Parallel Seq Scan divides table in equal
> > size and gives each part to its own worker.
> >
> > It is expected that any nuance in parallel scan is offset by multiple
> > queries running at the same time.
> > Also, proper index usage beats parallel scan most of the time.
> >
> > Thanks.
> >
> > On Sat, Feb 29, 2020 at 1:06 AM Marco Boeringa <marco at boeringa.demon.nl> wrote:
> >> Hi all,
> >>
> >> With some of the recent discussions related to PostgreSQL 12.x / PostGIS
> >> 3.x query execution performance and parallel query processing, I have
> >> been wondering how PostGIS / PostgreSQL actually distribute parallel
> >> work for GIS type data? Is there any form of load balancing to different
> >> cores?
> >>
> >> GIS type data fundamentally differs from "traditional" data with simple
> >> text or numeric fields in one crucial aspect: the size distribution of
> >> geometries is vastly greater than any difference in size of other basic
> >> field types in a database.
> >>
> >> E.g. polygon geometries may vary from very simple 3 vertex triangle
> >> shapes, to vast country boundary polygons having over 1M vertexes. This
> >> has a *huge* impact on processing time per record as well. With
> >> geometries over 100k vertices, you quickly end up with processing times
> >> - per record! - that may extend in the minutes or even hours and in bad
> >> cases a day for a single geometry.
> >>
> >> This means that simply distributing parallel work by distributing e.g.
> >> 100M records over 10M chunks for 10 parallel workers, each worker simply
> >> getting the next set of available records to process until all have been
> >> assigned to a core / parallel worker, can end up very bad in terms of
> >> load balancing.
> >>
> >> Let's assume for example that the table indeed has multiple records with
> >> geometries over 1M vertexes in size, and that all of them happen to be
> >> in the first 10M chunk of records, whereas the rest of the records has
> >> vastly smaller geometries. This could mean a single parallel worker
> >> ending up needing to process the bulk of the data in terms of sheer
> >> vertex count, and it might as well be processing for days on it! Whereas
> >> the other parallel workers, maybe needing to process only 10% percent of
> >> all vertex complexity in the table, finish in a few hours.
> >>
> >> I actually personally developed a Python multi-threaded generalization
> >> option that circumvents such issues: it uses the geometry's vertex count
> >> as returned by the PostGIS 'ST_NPoints' function to distribute records
> >> evenly across all parallel threads. By sorting the records by vertex
> >> count, and than interleaving all records from the highest to the lowest
> >> vertex count to all parallel threads, all threads end up processing
> >> approximately the same number of vertexes. This leads to pretty nice
> >> load balancing, with all threads finishing in approximately the same
> >> timespan, and all cores of my server maxed out.
> >>
> >> I realize PostGIS is dependent on PostgreSQL for parallel query
> >> execution, and that something like described above may not be directly
> >> feasible in a PostgreSQL / PostGIS default installation. Nonetheless I
> >> have wondered how the current mechanism for parallel query with GIS data
> >> is implemented, and if there is any load balancing based on geometry
> >> complexity?
> >>
> >> Marco Boeringa
> >>
> >> _______________________________________________
> >> postgis-users mailing list
> >> postgis-users at lists.osgeo.org
> >> https://lists.osgeo.org/mailman/listinfo/postgis-users
> >
> >
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users
--
Darafei Praliaskouski
Support me: http://patreon.com/komzpa
More information about the postgis-users
mailing list