[postgis-devel] Postgis topology creation - O(n-squared)? - creates problems with large datasets.

Fri Jan 10 08:29:42 PST 2014

Hi all,

I'm replying under the original thread title.
I broke the thread subject accidentally by replying to a digest email previously. Sorry.
We also have 2 other extra alternative subject lines that were added by others.

I suggest we bring the discussion back under the original title, to make it easier for anyone visiting this thread in future. Relevant messages are in:

> Postgis topology creation - O(n-squared)? - creates problems
>      with large datasets.
> Re: postgis-devel Digest, Vol 130, Issue 5

> Topology building performance

> Topology construction speed

Replying to Remi and Sandro's messages today: 

Remi,

> When you ask the question "what are the closest n polygons to this point"
> to the database, it is ln(n)  with indexes (or better wit cache), so that
> makes the whole process  n *ln(n).

Agree, probably ln(n) - I don't know enough about spatial indexing to know if you could do a pigeonhole sort or funky hash.
Still ln(n) looks pretty good. I can work with a DB that slows down a factor of 13 when I have 10 million items, but not so much with a DB that slows by 10 million.

> But I would guess the slower is not this but self join on the "edge" table
> to "walk" trough topology. Yet it has also been indexed.

Perhaps I am making a mistake with indexing on the topology. How are other people building indexes on their topologies? I have a spatial index on the source geometry and I'm just making the topology with the standard constructor. Should I be adding another index manually along the way?

> I don't know a way to analyse execution time inside a function in postgres,
> I'm gonna ask on postgres list.

Thanks, Remi!

Sandro,

>> 0. It is not necessary to check all other already existing topology when you add a polygon. It's only necessary to check the nearest neighbours to the centerpoint that overlap the bounding box. Checking against all neighbours that overlap the bounding box would be slower but OK too. We're talking about checking a constant 3-15 polygons, not 'n'.
> 
> You mean already existing topology _elements_, right ?
> The code only checks elements whose bounding box overlap
> the bounding box of the newly added geometry. The smaller
> your geometries bounding box, the easiest. I guess ring
> insertion could likely be optimized by cutting them in 4 or
> more parts, but there's currently no ready function to do
> that efficiently and without introducing drifts (ie: split
> on existing vertex).
> 
> You can uncomment the POSTGIS_TOPOLOGY_DEBUG define in
> topology/sql/populate.sql.in (then run "make -C topology"
> and psql -f topology/topology_upgrade.sql) to get DEBUG
> messages that'd tell you how many candidates are being
> considered.

Useful to know, thanks!

> 
>> 1. Not being exponential isn't the issue here. N-squared growth plus a slow starting point is already too high for postgis topology to be practical for anything other than trivially small datasets.
> 
> I hope you can help making it better.

I will if I can. First I'm trying to gain information about this problem and gauge interest in solving it. 

>> 2. In principal, there is absolutely no reason for this to be n-squared. We know where the polygon is going. We know, from the fact it is a topology, that it can only affect or be affected by the first few polygons that border the centerpoint - whether the map is a million or a hundred polygons, you'd expect each polygon to affect or be affected by only about 3-15 neighbours. It would be a strange topology indeed if most polygons had even 30 neighbours.
> 
> This is a duplicate of 0.

Yes, same idea, responding to two different comments.

>> Given that spatial indices allow fast lookups of what those neighbouring polygons are, adding a single polygon to the topology should therefore be taking place in a small constant or near-constant time.
> 
> It should be verified that the index is well balanced and
> analisys is up to date as the index grows. Doing everything
> in a single transaction doesn't help. The diagrams I was
> talking about included runs with batch loads, and it helped
> a lot. Can you try ?

1 - I am not using a transaction.

2 - I am using the standard API constructor call to build a topology from a collection of geometry. If that is the wrong way to build a topology, then I don't know what the correct way is.

I've looked back through my postgis-devel digests and I can't find the diagrams you're talking about. Is it this it?

> I remember someone posted a nice graph with timings which didn't look
> exponential. But I can't find the link to it right now.

> Q. Do postgis topology functions (including constructors) make use of a spatial index? Is there any way to put a spatial index on topology items? 
>> 
> They do, supposedly.

... any idea where I can find out more, other than digging around in the source code? (e.g. design spec?)

> Adding elements to a topology should both
> use the index _and_ modify it. There are indices on many tables,
> which could be part of the slowdown too.
> 
> The really hard part is profiling all of this...
> 
>> "Slow" would be hours or days, and would be totally acceptable. "Decades" means "broken". For a situation where it is only necessary to clip against or adjust a few closest intersecting neighbours, 5-20 years is not an OK run time for processing a few million polygons on a modern machine, if a thousand can be done in seconds. 
>> 
>> Why should adding the 5-millionth small polygon take days to compute, when it is only touching against 5 other polygons? There is definitely a fundamental problem here. 
> 
> What about producing a syntetic dataset to serve as base for analisys
> and see how it performs on differently configured systems ?

I had the same thought, and I'm making one right now.

> (huge improvement
> made after that to the ST_BuildArea function -- wonder if our Graeme
> had that in).

Could be. I'm not in a position to update our postgis at the moment but I will try to get from 2.1.0 to 2.1.1 asap.

> IIRC the profile graph we got was also made using a batch import
> and showing time of each set on the graph.
> Was pretty linear if I'm not wrong.

Thinking of Amdahl's law, the big-O of the slowest parts of your code may only begin having a noticeable effect  on total runtime when they becomes the biggest fraction of the runtime. 

Thanks for your help and thoughts guys,

Graeme.