[postgis-devel] Postgis topology creation - O(n-squared)? - creates problems with large datasets.

Sandro Santilli strk at keybit.net
Mon Jan 13 13:11:48 PST 2014


On Fri, Jan 10, 2014 at 05:27:22PM +0000, Graeme B. Bell wrote:
> >> What about producing a syntetic dataset to serve as base for analisys
> >> and see how it performs on differently configured systems ?
> > 
> > I had the same thought, and I'm making one right now.
> 
> http://graemebell.net/topologish/million_poly_topo1.zip
> http://graemebell.net/topologish/million_poly_topo2.zip
> 
> If anyone has time to help, I would be super-interested to see other people's timings of building topologies with these datasets of topologish* data. > 1 million polys each.
> 
> e.g. pick polys, order by gid limit 5000, 10000, 20000, 40000, 80000, 160000, 320000, 640000.   SRID is 25833.

I did up to 160000, after which I stopped because it took ~10 hours,
and I've been running twice: once with ST_CreateTopoGeo and once
with TopoGeo_addPolygon.

= Environment =

 PostgreSQL 9.3.1 on x86_64-unknown-linux-gnu (default config)
 POSTGIS="2.2.0dev r12137" GEOS="3.4.3dev-CAPI-1.8.3 r3958"

= Data =

 million_poly_topo1: 1161235 polygons, 5/7/116 vertices (min/avg/max)

= Queries =

 SELECT ST_CreateTopoGeo('million_poly_topo1', ST_Collect(geom))
 FROM ( SELECT geom from million_poly_topo1 order by gid limit XXXX) as f;

 SELECT TopoGeo_addPolygon('million_poly_topo1', ST_GeometryN(geom,1))
 FROM ( SELECT geom from million_poly_topo1 order by gid limit XXXX) as f;


= Timing (ms) =

 Rows  | ST_CreateTopoGeo  | TopoGeo_addPolygon |
 ------+-------------------+--------------------|
  5000 |      129427.274   |      116104.691    |
 10000 |      327703.598   |      301611.556    |
 20000 |      936224.702   |      884240.862    |
 40000 |     2976178.984   |     2891650.767    |
 80000 |    10343526.099   |    10254117.319    | 80071 nodes, 80071 edges, 80071 faces
160000 |    38352283.535   |    38375126.950    | 160167 nodes, 160167 edges, 160167 faces

Attached is a plot of those numbers

= Stats =

These are the function that are run for insertion of a _single_ polygon when
the topology already contains 165000 faces.

-- Ordered by "self_time":
strk=# select * from pg_stat_user_functions order by self_time desc;

 funcid | schemaname |        funcname         | calls  | total_time | self_time
--------+------------+-------------------------+--------+------------+-----------
  62973 | topology   | _st_addfacesplit        |      2 |    424.356 |    192.07
  52356 | public     | st_lineinterpolatepoint | 165013 |    136.911 |   136.911
  52097 | public     | geometry_overlaps       | 165015 |     92.684 |    92.684
  54503 | topology   | st_addedgemodface       |      1 |    438.418 |    13.539
  54508 | topology   | st_addisonode           |      1 |       7.39 |     7.359
  54491 | topology   | getringedges            |      2 |      2.552 |     2.552
  54533 | topology   | topogeo_addlinestring   |      1 |    450.942 |     2.385
  54534 | topology   | topogeo_addpoint        |      2 |      9.287 |     1.837

Interesting enough, "geometry_overlaps" and "st_lineinterpolatepoint" are called
once for each and every already-loaded geometry, which suggests there's indeed a
problem with index usage. The "geometry_overlaps" function is the one that
implements operator&&, supposedly an index user!

--strk;
-------------- next part --------------
A non-text attachment was scrubbed...
Name: topotime.png
Type: image/png
Size: 27030 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20140113/5adb48d4/attachment.png>


More information about the postgis-devel mailing list