[postgis-users] GeomUnion Performance Info

Bill Binko bill at binko.net
Mon Jun 20 11:24:31 PDT 2005


Hi everyone,

(I posted this to -devel, but the traffic is awfully low there, so I 
decided to cross-post it here.  Sorry if you get this twice.)

So I was chatting on the #mapserver IRC channel the other night, and 
someone (Primer) asked how to speed up GeomUnion performance which was 
killing him (taking literally days).

My answer was: create a spatial union on the data.

Another kind soul (westside) explained to me that I was an idiot, and 
that the PostGIS functions didn't have access to the indexes, only 
operators did.

After a quick look at the source, I realized he was right.  However, 
this irritated me to no end, as I _knew_ I had gotten performance 
improvements in GeomUnion my adding an index to my county parcels data.

So (hating to be wrong), I started performing some basic tests, and 
found a few interesting tidbits, I'd like to share.  

First, the reason I had seen a performance increase was that the 
GeomUnion implementation is dramatically influenced by the _order_ the 
shapes are fed to it.  The query I had used had a where clause that used 
the index, so the shapes had returned "ordered by" the spatial index -- 
that is, clustered spatially.  I realised this when I looked at the 
source, and confirmed it by unioning the same shapes in the following ways:

1) As I received them from the county (which, as you will see has some 
order to it)
2) Ordered by the spatial index (order by parcel_shape)

I realized that I was adding overhead by making the sort happen on each 
iteration, so I removed the "order by" and did a "cluster on" to 
actually reorder the rows on disk, and added these tests:

3) Ordered by a random column that I added, indexed, and clustered on
4) Ordered by the spatial index, but again, clustered on disk

(I also tried the "MemGeomUnion" which doesn't try to load all of the 
shapes into memory, but the impact of that change was trivial -- and 
negative)

I've put up the results of the test (just an Excel export) on 
http://www.mapshine.com/PerfData/

I'd be interested in any feedback.  The basic results can be summarized as:

1) For todays PostGIS 1.x, the best GeomUnion results will on a table 
clustered on a GIST Spatial Index.  that is:
pgsql> create index shape_idx on tablename GIST (shape); -- vaccum as needed
pgsql> cluster shape_idx on tablename; -- wierd syntax (it looks backwards)

If you cannot do that, then creating a subselect that sorts the items 
before performing the shape:

pgsql> select GeomUnion(shape) from (select shape from tablename order 
by shape) as stupidPGalias;

I also think that there is room for some pretty dramatic performance 
improvements in the PostGIS GeomUnion code.  Even assuming that the GEOS 
Union operation for two shapes is optimal (a big assumption), there are 
far too many Union operations being done that are unnecessary.  For 
example, we (at the PostGIS level) generally have BBoxes that can be 
used to cull the number of Unions that are needed.  I believe this could 
make a HUGE difference in the random addition case, and a significant 
improvement even in the ordered case.

If it doesn't step on anyone's toes, I'll plan on putting together a 
test that shows what I'm thinking.

Hope this helps, and let me know what you think.

Bill




More information about the postgis-users mailing list