[postgis-devel] GeomUnion speedups

Sat Jun 25 17:04:42 PDT 2005

On Sat, Jun 25, 2005 at 06:58:07PM -0400, Bill Binko wrote:
...
> As I mentioned before, I really think the extra overhead incurred by 
> MemGeomUnion (converting from PostGIS<->GEOS repeatedly) can be removed by 
> passing an opaque handle to the internal GEOS structures that you're 
> building.

Do you mean a memory pointer ?
It would leak on ERROR though...

> That solves the problem when you're not dealing with 50 complex 
> multi-polys (like your test), but with 330,000 simple polyes (like in 
> mine).  If you load all of those into memory and detoast them all, you 
> rapidly run into a memory issue, and start thrashing or failing.
>
> Perhaps we could build do it in chunks (have the sfunc have a counter that 
> trips a collect every X shapes) and have the ffunc do any that haven't 
> been collected yet and then do the buffer()?

Note that the collect() function does its work in couples.
Still by the end of the run you end up with a collection of
330,000 simple polys which is not much smaller then 330,000
distinct ones.

The only way to really reduce memory is dissolving nodes/edges
by means of Overlay operations, performed by GEOS/JTS.

Chunking would be useful for that, with the cost of more
graph constructions and conversions (not so slow).

> That would be nicely configurable (how many per chuck, etc).

I'd use memory size to define chuncks if possible.

> I am surprised that the new collect/buffer is not impacted by order: is 
> that a supposition?  Or did you test that?

It's the evidence ;)
GEOS operation is invoked only once, so a single graph is built
with all nodes/edges of all geometries, and a single conversion
is made. Order doesn't count then.

Both conversion and graph build (more the latter then the former)
have a cost related to size of input. Order influences the size
of input between calls. Imagine a complex polygon fully contained
inside a simple one. As soon as they are involved togheter in
a union operation the complex polygon will dissolve. If the
container polygon comes *last* probabilities are higher to have
this complex polygon laying around and being converted, noded
and stuffed in a graph for as many times as there are rows in your
table.

With the short-circuit we reduced number of graph-noding
operations in the disjoint-envelopes case, but there are many
cases in which bbox intersect but geometries doesn't or they
do but only for a small portion (ie. not dissolving many
vertexes).

So chunking would again have the ORDERING factor to take in
consideration.
I'm tempted to leave all of this in the user's hand ;)
Given currently available tools you can chose your algorithm
for a performant union operation. These are the ingredients:
	collect()
	buffer()
	ORDER BY
	LIMIT

--strk;