[postgis-devel] GeomUnion Performance Info

Tue Jun 21 19:59:43 PDT 2005

strk at refractions.net wrote:

>On Sun, Jun 19, 2005 at 04:46:48PM -0400, bill wrote:
>...
>  
>
>>First, the reason I had seen a performance increase was that the 
>>GeomUnion implementation is dramatically influenced by the _order_ the 
>>shapes are fed to it.  The query I had used had a where clause that used 
>>the index, so the shapes had returned "ordered by" the spatial index -- 
>>that is, clustered spatially.  I realised this when I looked at the 
>>source, and confirmed it by unioning the same shapes in the following ways:
>>
>>1) As I received them from the county (which, as you will see has some 
>>order to it)
>>2) Ordered by the spatial index (order by parcel_shape)
>>    
>>
>
>Watch out: ORDER BY will use btree index, not the GiST one.
>
>  
>
Actually, I do not have a btree index on parcel_shape.  Does having a 
GiST index implicitely create one?  I can't imagine that a manual sort 
of the stored strings (without an index) would speed things up.

>>pgsql> select GeomUnion(shape) from (select shape from tablename order 
>>by shape) as stupidPGalias;
>>    
>>
>
>Again will use btree index, not GiST.
>It would be interesting to understand WHAT makes this difference exactly.
>
>  
>
When I manually do a 'cluster on', it does order by the GiST, correct?

>>I also think that there is room for some pretty dramatic performance 
>>improvements in the PostGIS GeomUnion code.  Even assuming that the GEOS 
>>Union operation for two shapes is optimal (a big assumption), there are 
>>far too many Union operations being done that are unnecessary.  For 
>>example, we (at the PostGIS level) generally have BBoxes that can be 
>>used to cull the number of Unions that are needed.  I believe this could 
>>make a HUGE difference in the random addition case, and a significant 
>>improvement even in the ordered case.
>>    
>>
>
>Maybe the ordered case would rather ensure there would be always work
>for the unioner as consecutively ordered geometries would probably
>intersect.
>
>  
>
That is what I think: by growing the union in clustered stages, you have 
fewer intermediate "islands", so you have fewere union operations.

You also eliminate internal points early, meaning you don't have to test 
their segments later.

>>If it doesn't step on anyone's toes, I'll plan on putting together a 
>>test that shows what I'm thinking.
>>
>>Hope this helps, and let me know what you think.
>>    
>>
>
>Sure it will help, so we have 2 lines:
>
>	- understanding what kind of order helps union 
>
>  
>
One thought is that using ANY index eliminates the need to load the rest 
of the columns (or is only the bbox stored in the index?).  I'm not sure 
about that on TOAST tables, but it seems to make sense on non-spatial 
data, perhaps it translates.

>	- adding short-cuts in postgis union aggregate
>	  bypassing GEOS in the disjoint-bboxes case
>
>  
>
Just on a lark, I've colorized two maps based on the order they were 
being sorted (black are drawn first, white last): the first is after a 
cluster on parcel_shape_idx (at GiST index), and the second is on the 
random order.  You can see that the GiST clustered order is clearly 
geographically clustered.  You can access it at my map testing page at 
http://www.mapshine.com/mapsurfer/  You can use the username/password 
mapdev/GDAL if you'd like to play around with it.  Just turn on the 
layers "GIST Cluster" and "Random Cluster" at the left.

BTW: I tried to create a btree index on my parcels, but got this:

gis=# create index parcel_shape_btree ON parcels (parcel_shape);
ERROR:  index row requires 8452 bytes, maximum size is 8191

go figure.

Do I need to update GEOS or Postgis (or both) from CVS to get the 
bounding box shortcuts?

Thanks
Bill