[postgis-devel] GeomUnion Performance Info

Wed Jun 22 13:12:05 PDT 2005

On Wed, 22 Jun 2005 strk at refractions.net wrote:

> On Tue, Jun 21, 2005 at 10:59:43PM -0400, bill wrote:
> > strk at refractions.net wrote:
> 
> ...
> 
> > >Watch out: ORDER BY will use btree index, not the GiST one.
> > 
> > Actually, I do not have a btree index on parcel_shape.  Does having a 
> > GiST index implicitely create one? 
> 
> Index is not required, the ORDER BY will use operators defined by
> the btree operator class.

I see: so there are two "spatial" sets of operators: one is the r-tree 
(GiST), and the other is the "just test the 4 corners in order" one that 
you implemented to be compliant on 8.x.  I understand what you mean now.  
Interestingly, sorting by either seems to improve performance.  
Unfortunately, I cannot create a btree index to see what clustering by the 
btree algorithm does.
> 
> > >Again will use btree index, not GiST.
> > >It would be interesting to understand WHAT makes this difference exactly.
> > 
> > When I manually do a 'cluster on', it does order by the GiST, correct?
> 
> cluster on btree_index <-- order by btree
> cluster on gist_index <-- order by GiST
> 
> You might try both cases.

I'd love to, but 8192 seems to be a hard limit on the size of the btree 
index row and Postgresql seems to be storing the whole geometry (even 
though it only uses the 4 corners for comparison.  Is there some way 
around this?

> This is my guess as well, anyway the kind of intersection also
> makes the difference. I think geoms that intersect in  many places
> take more time.
> Imagine a geometry containing all the others (geom A):
> 	- output would be geom A
> 	- union of geom A with any geometry would not take much
> 	  (this is a guess, no intersections are found but only
> 	   full containment)
> 	- other geometries might have a lot of intersection which
> 	  would take more time to compute (again, a guess)
> 	- The sooner geom A is considered, the better the query goes.
> 
> You could make tests for this as well:
> 
> 	- define a geometry being
> 	  the envelope() of full table extent and assign it a
> 	  very high gid (geom A)
> 	- geomunion the table so that geom A comes last
> 	- geomunion the table so that geom A comes first

I thought about this (the fact that one large geometry would make the 
bbox short circuit less useful -- in fact, there are degenerate cases that 
would make it useless).  However, I think that on average it's still worth 
doing.

Also, your short-circuit test only checks the bbox of the entire target
geometry.. I think it would be worthwhile to check each of the geometries
in the collection until you hit one.  In other words, test each "island" 
until you actually hit one.  I think that would reduce the case where two 
small (but distant) geometries create one large bbox for the collection.

> ...
> > One thought is that using ANY index eliminates the need to load the rest 
> > of the columns (or is only the bbox stored in the index?).  I'm not sure 
> > about that on TOAST tables, but it seems to make sense on non-spatial 
> > data, perhaps it translates.
> 
> I think we should not mix access costs and union costs.
> We are analyzing union costs related to order of feed.

It's interesting that you say that :)

My initial approach to this was to assume that GEOS Union was already
optimal and see how we could use the extra information at the PostGIS
layer to make it faster.  Ordering was one, another was the fact that
often we had the bbox for the shapes pre-calculated.

Another was changing the access method to not load all the geometries into
memory at once (causing disk thrashing).  I saw that someone had tread
there before (MemGeomUnion), but that it seemed the overhead of pg<->geos 
was found to be too high.  There are other ways around that (pass back an 
opaque handle to the GEOS result as the sfunc result -- convert to pg in 
the ffunc).

Remember, all this started when I was sure that indexes helped, but was 
shown that they shouldn't... :)

> 
> > BTW: I tried to create a btree index on my parcels, but got this:
> > 
> > gis=# create index parcel_shape_btree ON parcels (parcel_shape);
> > ERROR:  index row requires 8452 bytes, maximum size is 8191
> > 
> > go figure.
> 
> Ach! I think we should visually see that btree order!
> Can you avoid clustering and just use ORDER BY ?
> 
> Actually... first thing would be evaluating difference
> between the two (GiST and btree) in terms of performance.

I have figured out a way to get this: it is running now.  I will post when 
it's done.

Bill