[postgis-users] topology.TopoGeo_AddLineString performance depends very on order of input lines in some cases.
Sandro Santilli
strk at kbt.io
Wed Mar 4 06:18:01 PST 2020
On Wed, Mar 04, 2020 at 07:04:57AM +0000, Lars Aksel Opsahl wrote:
> When testing adding millions of linestrings by splitting in around 4000 content based cells, one cell was using hours for adding a few thousands lines , but by changing the order and adding the non-closed line strings first I reduced the time to around two minutes from many hours for this cell. When running I started from an empty Topology for each cell.
Reading your numbers I see:
- Non-closed first, big first: 8h40m
- Big first (no matter closed/non-closed): 0h02m
- Closed first, big first: 0h02m
Correct me if I'm wrong. I'll include your tests below.
Note that "false" comes before "true" with order by on
boolean (ascending, by default).
> select topology.TopoGeo_AddLineString('test_topo_02',geom,0.000001) from test_topology_05
> order by is_closed, num_points desc;
> Time: 31234035.504 ms (08:40:34.036)
The above adds non-closed first, and each set (non-closed, then
closed) it will add the lines with the most points first.
> select topology.TopoGeo_AddLineString('test_topo_01',geom,0.000001) from test_topology_05
> order by num_points desc;
> Time: 164403.033 ms (02:44.403)
The above adds with most dense lines first, no matter closed/unclosed.
> select topology.TopoGeo_AddLineString('test_topo_03',geom,0.000001) from test_topology_05
> order by is_closed desc, num_points desc;
> Time: 131125.236 ms (02:11.125)
The above adds with closed first, and each set (closed, then
non-closed) it will add the lines with most points first.
So, I'd derive from the above that "big first" makes a difference, no
matter closed/non-closed (although I'd expect the contrary, to be
honest).
> But another strange thing was that time could vary a lot for the same
> dataset and order. In this test I did not care about closed or not but
> only line length.
Yes, I'd expect this.
> select topology.TopoGeo_AddLineString('test_topo_01',geom,0.000001) from test_topology_02 order by num_points desc;
> Time: 57915.860 ms (00:57.916)
>
> select topology.TopoGeo_AddLineString('test_topo_02',geom,0.000001) from test_topology_02 order by num_points asc;
> Time: 42424.087 ms (00:42.424)
>
> select topology.TopoGeo_AddLineString('test_topo_03',geom,0.000001) from test_topology_02 order by id asc;
> Time: 738242.464 ms (12:18.242)
> Time: 1197808.563 ms (19:57.809)
> Time: 1003501.683 ms (16:43.502)
>
> The file is here https://github.com/larsop/resolve-overlap-and-gap/blob/add_postgis_topology_TopoGeo_addLinestring_thred_grid/src/test/sql/regress/test_topology_02.dump.gz .
The cost of each addition is the cost of:
1. Finding all edges and nodes intersecting the new line
2. Noding the new line with existing nodes/edges, split accordingly
3. For each resulting split-segment:
3.1. Finding all edges and nodes intersecting the segment
(to bail out if found an intersecting one)
3.2. Finding edge rings on both sides of the edge
(to determine if a new face is created)
Roughly, the above is the algorithm. Can probably be improved in
some steps, but you can see how the order in which things are
added matters. In general, reducing (by splitting) the length
of lines and (by adding connecting edges) the area of
faces should speed things up. Assuming indexes are correctly used
(is autovacuum on?).
--strk;
More information about the postgis-users
mailing list