[postgis-users] GeomUnion Performance Info
Bill Binko
bill at binko.net
Mon Jun 20 11:24:31 PDT 2005
Hi everyone,
(I posted this to -devel, but the traffic is awfully low there, so I
decided to cross-post it here. Sorry if you get this twice.)
So I was chatting on the #mapserver IRC channel the other night, and
someone (Primer) asked how to speed up GeomUnion performance which was
killing him (taking literally days).
My answer was: create a spatial union on the data.
Another kind soul (westside) explained to me that I was an idiot, and
that the PostGIS functions didn't have access to the indexes, only
operators did.
After a quick look at the source, I realized he was right. However,
this irritated me to no end, as I _knew_ I had gotten performance
improvements in GeomUnion my adding an index to my county parcels data.
So (hating to be wrong), I started performing some basic tests, and
found a few interesting tidbits, I'd like to share.
First, the reason I had seen a performance increase was that the
GeomUnion implementation is dramatically influenced by the _order_ the
shapes are fed to it. The query I had used had a where clause that used
the index, so the shapes had returned "ordered by" the spatial index --
that is, clustered spatially. I realised this when I looked at the
source, and confirmed it by unioning the same shapes in the following ways:
1) As I received them from the county (which, as you will see has some
order to it)
2) Ordered by the spatial index (order by parcel_shape)
I realized that I was adding overhead by making the sort happen on each
iteration, so I removed the "order by" and did a "cluster on" to
actually reorder the rows on disk, and added these tests:
3) Ordered by a random column that I added, indexed, and clustered on
4) Ordered by the spatial index, but again, clustered on disk
(I also tried the "MemGeomUnion" which doesn't try to load all of the
shapes into memory, but the impact of that change was trivial -- and
negative)
I've put up the results of the test (just an Excel export) on
http://www.mapshine.com/PerfData/
I'd be interested in any feedback. The basic results can be summarized as:
1) For todays PostGIS 1.x, the best GeomUnion results will on a table
clustered on a GIST Spatial Index. that is:
pgsql> create index shape_idx on tablename GIST (shape); -- vaccum as needed
pgsql> cluster shape_idx on tablename; -- wierd syntax (it looks backwards)
If you cannot do that, then creating a subselect that sorts the items
before performing the shape:
pgsql> select GeomUnion(shape) from (select shape from tablename order
by shape) as stupidPGalias;
I also think that there is room for some pretty dramatic performance
improvements in the PostGIS GeomUnion code. Even assuming that the GEOS
Union operation for two shapes is optimal (a big assumption), there are
far too many Union operations being done that are unnecessary. For
example, we (at the PostGIS level) generally have BBoxes that can be
used to cull the number of Unions that are needed. I believe this could
make a HUGE difference in the random addition case, and a significant
improvement even in the ordered case.
If it doesn't step on anyone's toes, I'll plan on putting together a
test that shows what I'm thinking.
Hope this helps, and let me know what you think.
Bill
More information about the postgis-users
mailing list