[geos-devel] fastunion aggregate

strk strk at keybit.net
Mon Oct 27 13:45:22 EST 2003


dblasby wrote:
> strk,
> 
> I assume your fast algorthim looks like this:
> 
> 1. Define a type for "Array of Geometry" (already automatically defined 
> by postgresql)
> 
> 2. have a modified collect() that just builds a big array of geometries 
> instead of building a single big geometry.   If you wanted to get really 
> fancy, you could probably just store the geometry's subobjects (but 
> thats probably extra effort for little gain).
> 
> 4. Then do the actual union (foldl).  Be aggressive in memory deallocation.
>      result = 1st geometry in array
>      for each other geometry in array
>         result = result union new geometry
> 
> 
> 5. convert the resulting geometry to postgis.

Yes. This is exactly how it works.

> Alternatively, we could define a pair-wise union (i.e. 1 union 2 = A, 3 
> union 4 = B, 5 union 6 = C,7 union 8 = D... then A union B , C union 
> D...)  This might work out better for certain types of data.

We can modify that if you think will be faster.

> This should be pretty fast since we're only converting postgis->GEOS 
> once and only one GEOS->postgis.

It seems faster to me.

> 
> Unfortunately, for large unites, we could have memory problems.
> The above process will use memory equal to all the geometries.  During 
> step #4, we should be holding about constant memory in the worst case.
> 
> In the worst case (uniting a set of disjoint polygons), this shouldnt 
> use more memory than any other method.

The other method (GeomUnion recursion) *might* release detoasted memory
sooner. If postgresql will release row-based memory in aggregate calls
(as I think I've been told) that would reduce memory (need to be tested)

> dave
> ps. Ideally it would be great to carry around GEOS geometries instead of 
> postgis geometries, but I think we'll run into problem with events like 
> aborted queries and the like.

It'll be greater to carry around PlanarGraph, but I'm not prepared
to this yet. Maybe Martin can help here. The SweepLine would ideally
be run only once. For this to work we need to construct the graph of
the whole geometry set in input. If the graph for a geometry is not
much bigger then the geometry itself this might be not-so
memory-consuming. 

Actually I don't know who would do this - don't look at me ;)

> pps.  The geometry-constructing routine used by collect is O(n^2), but I 
> bet it would be much faster to write your own (O(n)).  Basically, you'll 
> have to scan the Array of Geometry to find the number of sub-objects and 
> their sizes (make sure you account for memory alignment).  Then you just 
> create a big resulting geometry.
> 
> The current method looks like this:
> 	a. create a single-subobject geometry
> 	b. create a brand new 2-object geometry based on a + another sub-object
> 	c. create a brand new 3-object geometry based on b + another sub-object
> 	d. create a brand new 4-object geometry based on c + another sub-object
> 
> Each sub-object will be copied, on average, n/2 times.

Well. In this case memory usage can not really grow, we will
at most use geometry header * num_input_rows more then current
usage. I can take a look at it.

--strk;



More information about the geos-devel mailing list