[postgis-devel] Postgis topology creation - O(n-squared)? - creates problems with large datasets.

Tue Jan 7 06:35:31 PST 2014

Hi everyone.

I tested postgis topology (2.1.0 r11822) by creating a topology from some national geometry datasets with respectively 1.6 million and 7.8 million polygons. 

One source geometry dataset was made by a transformation from an oracle topology into postgis geometry, the other is a polygonised raster (a natural topology). 

I selected out a fraction of the polygons randomly and created a topology in two ways, first using createtopogeo, and then manually using topogeo_addpolygon.

The data has spatial indices but I think these possibly aren't being used because of the need for a geometry collection in createtopogeo. 

The results looked like this:

1.7 million polygon dataset :   

1/512th of the data: 24 seconds
1/256th of the data: 76 seconds
1/128th of the data: 214 seconds
1/64th of the data: 707 seconds
1/32nd of the data: 2430 seconds

7.8 million polygon dataset: 

1/512th of the data: 509 seconds
1/256th of the data: 1905 seconds
1/128th of the data:  6944 seconds

Manually using topology's addpolygon produced CPU costs 50-100% higher than the createtopogeo function and growing at a similar rate. I did not complete testing with it. 

In both cases, the cost of creating the topology grows by 3-4x as the size of the source geometry set doubles. As the data becomes less sparse (e.g. 1/512th of a national dataset is pretty sparse) the trend seems to be towards 4x more CPU time for 2x extra data. 

We would like to use postgis topology but judging from the growth in costs, creating topologies would take e.g. years on the larger dataset and a month on the small dataset. These are not our largest geometry datasets. 

Does anyone have any ideas or suggestions about how we could proceed from here? Unfortunately I cannot share the datasets for testing purposes.

Graeme.