[postgis-devel] GeomUnion Performance Info

Tue Jun 21 14:37:39 PDT 2005

On Sun, Jun 19, 2005 at 04:46:48PM -0400, bill wrote:
...
> First, the reason I had seen a performance increase was that the 
> GeomUnion implementation is dramatically influenced by the _order_ the 
> shapes are fed to it.  The query I had used had a where clause that used 
> the index, so the shapes had returned "ordered by" the spatial index -- 
> that is, clustered spatially.  I realised this when I looked at the 
> source, and confirmed it by unioning the same shapes in the following ways:
> 
> 1) As I received them from the county (which, as you will see has some 
> order to it)
> 2) Ordered by the spatial index (order by parcel_shape)

Watch out: ORDER BY will use btree index, not the GiST one.

> I realized that I was adding overhead by making the sort happen on each 
> iteration, so I removed the "order by" and did a "cluster on" to 
> actually reorder the rows on disk, and added these tests:
> 
> 3) Ordered by a random column that I added, indexed, and clustered on
> 4) Ordered by the spatial index, but again, clustered on disk
> 
> (I also tried the "MemGeomUnion" which doesn't try to load all of the 
> shapes into memory, but the impact of that change was trivial -- and 
> negative)
> 
> I've put up the results of the test (just an Excel export) on 
> http://www.mapshine.net/PerfData/
> 
> I'd be interested in any feedback.  The basic results can be summarized as:
> 
> 1) For todays PostGIS 1.x, the best GeomUnion results will on a table 
> clustered on a GIST Spatial Index.  that is:
> pgsql> create index shape_idx on tablename GIST (shape); -- vaccum as needed
> pgsql> cluster shape_idx on tablename; -- wierd syntax (it looks backwards)
> 
> If you cannot do that, then creating a subselect that sorts the items 
> before performing the shape:
> 
> pgsql> select GeomUnion(shape) from (select shape from tablename order 
> by shape) as stupidPGalias;

Again will use btree index, not GiST.
It would be interesting to understand WHAT makes this difference exactly.

> I also think that there is room for some pretty dramatic performance 
> improvements in the PostGIS GeomUnion code.  Even assuming that the GEOS 
> Union operation for two shapes is optimal (a big assumption), there are 
> far too many Union operations being done that are unnecessary.  For 
> example, we (at the PostGIS level) generally have BBoxes that can be 
> used to cull the number of Unions that are needed.  I believe this could 
> make a HUGE difference in the random addition case, and a significant 
> improvement even in the ordered case.

Maybe the ordered case would rather ensure there would be always work
for the unioner as consecutively ordered geometries would probably
intersect.

> If it doesn't step on anyone's toes, I'll plan on putting together a 
> test that shows what I'm thinking.
> 
> Hope this helps, and let me know what you think.

Sure it will help, so we have 2 lines:

	- understanding what kind of order helps union 

	- adding short-cuts in postgis union aggregate
	  bypassing GEOS in the disjoint-bboxes case

--strk;