[postgis-devel] Postgis topology creation - O(n-squared)? - creates problems with large datasets.

Sandro Santilli strk at keybit.net
Tue Jan 21 02:58:00 PST 2014


On Tue, Jan 21, 2014 at 09:20:36AM +0000, Graeme B. Bell wrote:
> 
> Hi Sandro,
> 
> >> I'll wrap up the rbuild topology script neatly and put it on github, and post to the list when it's ready. 
> >> That way, people can generate topologies of any complexity they like, in any zone they like. 
> > 
> > That'd be perfect for a performance test. It could be only enabled
> > when "rbuild" is found and feed a version-controlled configuration
> > to it!
> 
> 2 issues
> 
> 1. Currently generates a random topology. We could control the seed but it's probably better to have some fixed files that get downloaded if you want to do performance tests. 30MB is not too big nowadays, somewhere between a few seconds and a minute for almost everyone. 
> 
> 2. It takes *lots* of time to build the topology from raster. I hide this by making the geometry fit neatly with the tiling system and letting rbuild parallelise like mad and using a reasonably fast system for builds. It wouldn't be fun for most devs. Better to download standard test data, I'd say. 

I'd rather see an SQL script to generate the data.
A "test data" download would be a new asset, nothing like that exist
at this time.

> > The asymptote on that curve is looking a lot more healthy :-)
> > 
> > You can see what I meant about "ST_CreateTopogeo" not being
> > optimized at all, it behaves the same as the incremental builder,
> > and often it even takes more time.
> 
> Indeed.
> 
> Let's think ... 
> 
> - Is there anything clever that can be done if you know all the polygons you want to add at once?   (in terms of the order they are added (left to right? big through small? randomise?)  - or in terms of how they're grouped for processing?)

The first operation performed is noding, so after that we could
completely skip re-checking for intersections/crossing etc.
The only real operation we need is edge-linking, that is writing
next left/right edge and left/right face. I've been thinking for
some time now that we should have a standalone function to do
just that. It should likely be topology.Polygonize doing that
(what that function currently does is I think less than it should
and it's still not optimized, there should be a ticket about that).

> - Currently they are passed as a collection. Would it help if they were passed or accessed in some other way?

No, it must be a collection, that's by ISO specs.

> - Could the user hint that the geometry is already closely topology-like? Is this useful or does it encourage risk taking? 

The performance problem is _after_ the noding step, so that 
won't help. We already known its topologically valid when we
spend the most time

> - What parameters / internal constants are used; how do they affect the resulting topology and the costs of building it.

I don't get this.

> - Profiling: currently, on large datasets, what 3-5 functions eat the most time in createTopo or the incremental approach, and why? 

ST_ModEdgeSplit is the most expensive function for the incremental approach.
It performs checks that are not needed when called from the internal population
function. An improvement would be splitting the function into an internal one
and an external one, and calling the internal from internal code. This has been
done successfully in other cases.

> - If createTopo is slower than the incremental builder, can we make them the same code?

If you want to completely give up the possible optimization...
(I would rather try them)

--strk;

 ()  ASCII ribbon campaign        - against html e-mail
 /\  http://www.asciiribbon.org   - against proprietary attachments



More information about the postgis-devel mailing list