[postgis-users] topology.TopoGeo_AddLineString performance depends very on order of input lines in some cases.

Lars Aksel Opsahl Lars.Opsahl at nibio.no
Wed Mar 4 07:50:28 PST 2020



________________________________

>On Wed, Mar 04, 2020 at 07:04:57AM +0000, Lars Aksel Opsahl wrote:

>

>> When testing adding millions of linestrings by splitting in around 4000 content based cells, one cell was using hours for adding a few thousands lines , but by changing the order and adding the non-closed line strings first I reduced the time to around two minutes from many  hours for this cell. When running I started from an empty Topology for each cell.

>

>Reading your numbers I see:

>

>  - Non-closed first, big first:               8h40m

>  - Big first (no matter closed/non-closed):   0h02m

>  - Closed first, big first:                   0h02m

>

>Correct me if I'm wrong. I'll include your tests below.

>Note that "false" comes before "true" with order by on

>boolean (ascending, by default).

>

>> select topology.TopoGeo_AddLineString('test_topo_02',geom,0.000001) from test_topology_05

>> order by is_closed, num_points desc;

>> Time: 31234035.504 ms (08:40:34.036)

>

>The above adds non-closed first, and each set (non-closed, then

>closed) it will add the lines with the most points first.

>

>> select topology.TopoGeo_AddLineString('test_topo_01',geom,0.000001) from test_topology_05

>> order by num_points desc;

>> Time: 164403.033 ms (02:44.403)

>

>The above adds with most dense lines first, no matter closed/unclosed.

>

>> select topology.TopoGeo_AddLineString('test_topo_03',geom,0.000001) from test_topology_05

>> order by is_closed desc, num_points desc;

>> Time: 131125.236 ms (02:11.125)

>

>The above adds with closed first, and each set (closed, then

>non-closed) it will add the lines with most points first.

>

>So, I'd derive from the above that "big first" makes a difference, no

>matter closed/non-closed (although I'd expect the contrary, to be

>honest).

>

>> But another strange thing was that time could vary a lot for the same

>> dataset and order. In this test I did not care about closed or not but

>> only line length.

>

>Yes, I'd expect this.

>

>> select topology.TopoGeo_AddLineString('test_topo_01',geom,0.000001) from test_topology_02 order by num_points desc;

>> Time: 57915.860 ms (00:57.916)

>>

>> select topology.TopoGeo_AddLineString('test_topo_02',geom,0.000001) from test_topology_02 order by num_points asc;

>> Time: 42424.087 ms (00:42.424)

>>

>> select topology.TopoGeo_AddLineString('test_topo_03',geom,0.000001) from test_topology_02 order by id asc;

>> Time: 738242.464 ms (12:18.242)

>> Time: 1197808.563 ms (19:57.809)

>> Time: 1003501.683 ms (16:43.502)

>>

>> The file is here https://github.com/larsop/resolve-overlap-and-gap/blob/add_postgis_topology_TopoGeo_addLinestring_thred_grid/src/test/sql/regress/test_topology_02.dump.gz .

>

>The cost of each addition is the cost of:

>

>  1. Finding all edges and nodes intersecting the new line

>  2. Noding the new line with existing nodes/edges, split accordingly

>  3. For each resulting split-segment:

>      3.1. Finding all edges and nodes intersecting the segment

>           (to bail out if found an intersecting one)

>      3.2. Finding edge rings on both sides of the edge

>           (to determine if a new face is created)

>

>Roughly, the above is the algorithm. Can probably be improved in

>some steps, but you can see how the order in which things are

>added matters. In general, reducing (by splitting) the length

>of lines and (by adding connecting edges) the area of

>faces should speed things up. Assuming indexes are correctly used

>(is autovacuum on?).

>


Hi again


- Yes we have autovacum on.


- I made some very simple code where I tried group edges together but that did not seem to help to much on performance,  but that may be related to that I am not making clean directed graph.



DO $$

DECLARE

    tables_non_closed CURSOR FOR

    select id,geom,num_points from test_topology_02 where ST_Isclosed(geom) = false order by num_points desc;



    working_geo geometry;

BEGIN

drop table if exists test_tt;

create table test_tt as (select geom,num_points from test_topology_02 where ST_Isclosed(geom) = false order by num_points desc);

drop table if exists test_tt4;

create table test_tt4 as select * from test_tt limit 0;

alter table  test_tt4 add column id serial;

    INSERT INTO test_tt4(geom)  select geom from test_topology_02 where ST_Isclosed(geom) = true order by num_points desc;

working_geo = null;

    FOR table_record IN tables_non_closed LOOP

        working_geo = table_record.geom;

        LOOP

          Select ST_Union(geom) from test_tt where ST_Intersects(geom,working_geo) into working_geo;

          EXIT WHEN working_geo is null;

          INSERT into test_tt4 select * from test_tt where ST_Intersects(geom,working_geo) order by num_points;

          DELETE from test_tt where ST_Intersects(geom,working_geo);

  END LOOP ;

    END LOOP;

END$$;


select topology.CreateTopology ('test_topo_04', 4258,0.000001);

select topology.TopoGeo_AddLineString('test_topo_04',geom,0.000001) from  test_tt4 order by id;

Time: 138741.013 ms (02:18.741)



- The randomness for the three same calls with same order I did not understand. The time vary from almost 20 minutes to 12 minutes.


>> select topology.TopoGeo_AddLineString('test_topo_03',geom,0.000001) from test_topology_02 order by id asc;

>> Time: 738242.464 ms (12:18.242)

>> Time: 1197808.563 ms (19:57.809)

>> Time: 1003501.683 ms (16:43.502)



Thanks a lot.


Lars





>--strk;

>_______________________________________________

>postgis-users mailing list

>postgis-users at lists.osgeo.org

>https://lists.osgeo.org/mailman/listinfo/postgis-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20200304/0de9c539/attachment.html>


More information about the postgis-users mailing list