Qgis/Postgis : Multi Union (25 polygonal layers)

Wed Jul 23 06:38:25 PDT 2025

Good afternoon,

I 'm taking the liberty to add one additional question. The script proposed
in this post works. I "just" need two small changes:

   -

   In the final layer, I don't just want to retrieve the res_id,
   data_id_ori, num_sources, and geom fields, but also many other fields from
   my source layer, layer_union.Could you tell me at what level of the script
   I should add these fields that I ultimately want to keep? Is it in the
   section where the "res_assoc" table is created or the "res_final" table is
   created? The problem is that if it's at the "res_assoc" table level, I
   would have no choice but to add them to the group BY?
   -

   Second complexity. In the current script, I'm actually creating a union
   from a single source data: layer_union. But I need to combine this
   'layer_union' data with another source data, land use.

Thanks so much.

Le sam. 12 juil. 2025 à 11:04, Franck Theeten <
franck.theeten at africamuseum.be> a écrit :

> Hi,
>
> Out of curiosity, why are you indexing the intermediate table by
> ST_Pointonsurface ?
> The documentation doesn't guarantee that the returned point is always the
> same, so it introduces some degree of randomness.
> This should not have an influence on the number of rows in res, but could
> have one in the default sort order when joining it with layer_union.
> Maybe indexing on st_centroid and/or using an explicit order by on res_id
> when building res_assoc could make the processs more stabke ?
>
> ------------------------------
> *From:* Martin Davis <mtnclimb at gmail.com>
> *Sent:* Friday, July 11, 2025 21:24
> *To:* celati Laurent <laurent.celati at gmail.com>
> *Cc:* postgis-users at lists.osgeo.org <postgis-users at lists.osgeo.org>
> *Subject:* Re: Qgis/Postgis : Multi Union (25 polygonal layers)
>
> The reply on GIS StackExchange was a good one, so reproducing it here for
> this list:
>
> Can you investigate further as to the differences between two identical
> runs, apart from the number of features? Do you get the same behaviour
> working on a small subset of the data? Can you see if its "sliver" features
> or other tiny differences due perhaps to arithmetic rounding errors - these
> can result in non-reproducibility in parallel processing systems when the
> order of operations isn't well-defined (ie (a+b+c) could be (a+b)+c or
> a+(b+c) which might not be equal because of floating point precision...)
>
> I would add that this situation can happen even when there is no
> parallelism.  It can arise whenever there is not determinism in the order
> of inputs to operations throughout the process.  You might be able to
> provide this by sorting every input and intermediate result. But it's
> questionable whether this would be worth the time and effort.
>
> It would be interesting to know the difference in total result area
> between the different runs.  This should be very low.
>
> A more detailed test would be to match the two results (via
> point-in-polygon) and then compare matched polygon boundaries (via
> Hausdorff distance).  If there is a significant difference between two
> results, that would be worth investigation.
>
> On Fri, Jul 11, 2025 at 3:22 AM celati Laurent <laurent.celati at gmail.com>
> wrote:
>
> Good afternoon,
> Thanks so much for your message.
> I succeed in executing a postgis script without error message. It is
> working.
> However, i notice a lack of  reliability/stabilization. Because when I've
> rerun the same process several times, i never end up with exactly the same
> number of features in my intermediate/final result tables.
> I'm taking the liberty to share you the sql script.
> The screenshot compares the number of objects for test v1 and test v2,
> (which are identical tests). We can see that there is a difference in the
> res_final table, but also in the res table. I ran several tests agai. Still
> with different numbers of objects for the res and res_final tables. Often
> with larger differences than the one shown in the screenshot.
>
> Number of entities for each table:
> *Test 1*:
> layer_union table: 1026194
> res table : 1462661
>  res_assoc table : 1462661
>  res_final table : 1462661
>
> *Test 2*
>  layer_union table : 1026194
> res table 1462645
>  res_assoc table : 1462645
>  res_final table : 1462635
>
> I share below/and attach the script :
>
> --Import all shp in postgis db
> -- union ALL des geom et attributs des 28 data sources dans une seule table
> drop table if exists layer_union;
> create table layer_union as
> select inrae_2014.data_id, inrae_2014.geom from inrae_2014 UNION ALL
> select aesn_2006.data_id, aesn_2006.geom from aesn_2006 UNION ALL
> select aesn_2019.data_id, aesn_2019.geom from aesn_2019 UNION ALL
> --(...)etc.
>
> -- res table
> drop table if exists res;
> create table res as
> with tmp as
> (select st_union(ST_Force2D(st_boundary(geom))) as geom from layer_union
> )
> select (st_dump(st_collectionextract(st_polygonize(geom), 3))).path[1] as
> id,
>        (st_dump(st_collectionextract(st_polygonize(geom),
> 3))).geom::geometry(polygon, 2154) as geom
> from tmp;
>
> -- res table id unique
> alter table res add column res_id int generated always as identity primary
> key;
> -- res table index on pointOnSurfacee
> create index on res using gist (st_pointOnSurface(geom));
> analyze res;
>
> -- res_assoc table
> --JOIN simple for filter polygons without link with input polygons (for
> instance : holes for data input)
> drop table if exists res_assoc;
> create table res_assoc as
> select res.res_id, array_agg(l.data_id) as data_id_ori, count(distinct
> l.data_id) as num_sources
> from res join layer_union l on st_contains(l.geom,
> st_pointonsurface(res.geom))
> group by res.res_id;
> -- res_assoc table : index creation
> create index on res_assoc(res_id);
> analyse res_assoc;
>
> ----cleaning: we remove from the res table the polygons that did not match
> in res_assoc:
> -- these are polygons representing holes in the input layers
> delete from res
> where not exists (
> select null
> from res_assoc ra
> where ra.res_id = res.res_id);
>
>
> -- -- Final table with the new polygons and the source polygon
> information, as a join:
> -- Much faster to create a table than to update the res table (after
> adding the desired columns).
> drop table if exists res_final;
> create table res_final as
> select ra.res_id, data_id_ori, num_sources, geom::geometry(polygon, 2154)
> as geom
> from res_assoc ra join res r on ra.res_id = r.res_id;
>
> Thanks so much
>
>
> Le mar. 8 juil. 2025 à 02:54, <snorris at hillcrestgeo.ca> a écrit :
>
> Here is working example of Martin's suggestion, for a job that sounds
> fairly similar:
> https://github.com/bcgov/harvest-restrictions/blob/main/sql/overlay.sql
>
>
> On Jul 7, 2025, at 4:45 PM, Martin Davis <mtnclimb at gmail.com> wrote:
>
> I'd characterize your use case as "Overlay of overlapping polygonal
> datasets".  The basic state-of-the art for solving this using PostGIS is
> still the solution Paul outlined in
> https://blog.cleverelephant.ca/2019/07/postgis-overlays.html (or see
> https://dr-jts.github.io/postgis-patterns/overlay/overlay.html#count-overlap-depth-in-set-of-polygons
> for more ideas).
>
> Basically, you node and polygonize to make a flat coverage, and then join
> back to the parent layers to determine attribution (including counts).
>
> Doing this in a single query might be slow for very large datasets like
> yours, though.  You might be able to partition your large dataset and run
> smaller queries, possibly in parallel.  Also, it might be better to overlay
> the small layers first, and then overlay that with the big layer.  And if
> you don't care about overlaps in the big layer (or if there are none), that
> makes it much easier, since you can process each big-layer polygon
> independently (and ideally in parallel).
>
> On Mon, Jul 7, 2025 at 1:16 PM celati Laurent <laurent.celati at gmail.com>
> wrote:
>
> Dear all,
> I'm working with QGIS and PostGIS. As input, I have 25 polygonal layers
> covering a large area (multicities area). One of these data is a very large
> dataset (1 million objects). The other 24 are much smaller (a maximum of a
> hundred objects).
> For information, I should point out that some of these polygonal datasets
> are in "multi-part features" mode and others in "single-part features"
> mode. I imagine this may ultimately have a slight impact on the
> method/result. These 25 polygonal .shp files have highly variable,
> non-homogeneous/non-harmonized data structures. Each layer has a "data_id"
> field that allows  to define/link/reference, for each feature, its
> membership in the layer. For example, all values in the "data_id" field for
> the first layer have a value of '1'. For the second layer, the field values
> are '2', etc.
>
> My goal would be to be able to apply/adapt the existing QGIS geoprocessing
> tool called "Multiple Union":
>
> https://docs.qgis.org/3.40/en/docs/user_manual/processing_algs/qgis/vectoroverlay.html#union-multiple
>
> Below a screenshot from the QGIS documentation :
>
> <image.png>
>
> My goal would be to have an output file:
>
>
>    -  Which would be the result of the union/overlay of the 25 input
>    data. To use the terms of the QGIS documentation, the processing should
>    check for overlaps between features within the 25 layers and create
>    separate features for the overlapping and non-overlapping parts. This
>    "multiple union" geoprocessing seems interesting for my goal where there is
>    no overlap (a, NULL; b, NULL; c, NULL).
>
>
>    - For areas where there is an overlap, the QGIS union geoprocessing
>    creates as many identical overlapping features as there are features
>    participating in this overlap. This doesn't bother me. But since,
>    ultimately, I'd like a field in the result/output file to allow, for each
>    feature, to retrieve the list of input layers that participate/contribute
>    to this result feature (in order to retrieve the origin/source of the
>    data). I was wondering/thinking it might be better if only one feature was
>    created per overlapping area?
>
>
>    -  I'd like a field in the result file to allow, for each feature, to
>    retrieve the list of input layers that participate/contribute to this
>    result feature. In order to retrieve the origin/source of the data.
>
>
>    - Ideally, a field that allows you to retrieve the number (COUNT) of
>    layers that contribute to this feature (at least 1 layer, at most 25
>    layers).
>
>
>    - Regarding the non-geometric attributes/fields, I would like to be
>    able to specify the selection/define the list of fields I ultimately want
>    to keep. I don't want to keep all of the fields, but rather just some of
>    the fields for each of the 25 input layers.
>
>
> I imagine it's recommended to do this processing in PostGIS rather than
> QGIS? I can, if necessary, import my 25 SHP files into a PostGIS database.
> I also imagine it's important to keep in mind that the "multi-part
> features" / "single-part pieces/features" mode of the input layers can
> affect the result. If I'm using a PostGIS query, I was thinking it might be
> helpful to force all features to be in single-part mode (using the PostGIS
> 'st_dump' function?).
>
> In advance, Thanks so much for your help, guidance.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20250723/be095cbc/attachment.htm>