[postgis-devel] More work on Cascade Union

Wed Aug 20 04:48:13 PDT 2008

Martin,

Evidently I didn't look at enough of the code in the JTS code base.  I
was focusing on CascadedUnion class rather than the UnaryUnionOp class.

On closer inspection I see the CascadedUnion class is only called for
Polygons and not Points and Lines and for points and lines it does
presumably what it has always done - some sort of Overlay operation.

So in theory if unite_garray is doing what JTS is doing, performance on
points and lines shouldn't be any crummier than my cascaded union
approach.

But look at these stats, - doing the same in JTS is faster than both
standard Postgis union and cascade union thingy and too fast for me to
tell, but there is a point at which I can successfully crash JTS with
this exercise :) - just feed it the whole tmp_points and watch it
quickly crash in about 11 seconds.  I didn't wait for Postgis cascade
union to finish though so that wasn't a fair test.

One thing I will say is the ST_Union will internally order the
geometries where as the cascade approach I'm using will only internally
order if given a preordered set.  I presume having an internally sorted
geometry is a desirable thing.

I was going to try to add some sort of bubblesort algorithm as the items
are coming into the array in the aggregation process.  Trying to do the
sort after the fact with a generate_series hack seems to sometimes add
overhead although a lot of the time with the revised algorithm using
envelop check the overhead added by the pre sorting is significantly
offset by the gain in iterating over an ordered set so most of the times
I am seeing speed gains. So short-end of the stick - I can definitely
get better timings than the last algorithm set I submitted.

So I'll play around with array bubble sort vs. generate series ordered
by hack to see which performs better.

--Here are stats with just simple point geometries

CREATE TABLE tmp_points(the_geom geometry);
INSERT INTO tmp_points
SELECT ST_MakePoint(x,y) As the_geom FROM generate_series(1,1000) x,
generate_series(1,1000) y
order by random();

--734 ms (without order by hack)
 281 ms (with generate series ordered by hack in
st_unitecascade_garray_sort it performs even better - by a factor of 3)

SELECT ST_CascadeUnion(the_geom)
FROM (SELECT * FROM tmp_points limit 1000) as foo

-- 2594 ms, (with order hack 1359 ms) - 
SELECT ST_CascadeUnion(the_geom)
FROM (SELECT * FROM tmp_points limit 2000) as foo

--9610 ms
SELECT ST_Union(the_geom)
FROM (SELECT * FROM tmp_points limit 1000) as foo

--41625 ms
SELECT ST_Union(the_geom)
FROM (SELECT * FROM tmp_points limit 2000) as foo

Any rate I'm still trying to figure out what the Geos unite_garray is
doing. As far as I can tell its doing a combination of iterated unions
and collect buffers.  But that's just doing a very cursory glance at the
3.0, 3.1 code base.

Thanks,
Regina

-----Original Message-----
From: postgis-devel-bounces at postgis.refractions.net
[mailto:postgis-devel-bounces at postgis.refractions.net] On Behalf Of
Martin Davis
Sent: Tuesday, August 19, 2008 8:13 PM
To: PostGIS Development Discussion
Subject: Re: [postgis-devel] More work on Cascade Union

It looks like OJ *is* using the new JTS UnaryUnionOp (aka 
CascadedUnion).  This was a change made on July 26 by Michael Michaud.

I'll save looking at your pgsql code for a slow afternoon sometime - I'm

sure it will be educational.  In the meantime, good work on 
porting/modifying/enhancing the CascadedUnion concept for PostGIS!

Martin

Obe, Regina wrote:
>
> Martin,
>
> Do you know if OpenJump finally fully integrated the cascade union?  I

> downloaded the openjump nightly build.  I did timings and it was able 
> to complete usstatebounds in about 12 secs (whereas before remember it

> gave a memory overflow) and the original samply_poly I think it did in

> about 22 seconds.  I wasn't sure if this is because I am running on a 
> Vista pc (and probably more beefed up one than I ran last time).  I 
> also don't see the count down with the union thing anymore. 
>
>
> Attached are my latest efforts. This includes the envelop check.  The 
> bool_and aggregate was disappointing so I fell on good old EXISTS (in 
> this scenario especially when you can't really rely on indexes, exists

> performs better than a LEFT JOIN). 
>
>  Unfortunately in PostgreSQL since the geometries get swept into an 
> array, I can't really count on indexes to help out so I limited the 
> max check to be 1000 records and also do iterated checks.  Which I 
> think is fairly different from the JTS approach.
>
>
>  This time I changed the core of the cascade union function into a 
> plpgsql function since I figured people would have an easier time 
> reading it.  Not that I couldn't do it all with sql functions, but sql

> functions had the disadvantage of not allowing declaration of 
> variables or using pretty names for input variables.
>
> I have the st99_d00 set down to below 8 secs - but still sucks 
> compared to JTS cascade union (3-4 secs), but still much better than 
> the 48-55 second timings I get with the postgis union function.
>
> I fear I can't do much better than this without delving into GEOS 
> land, which I am not quite ready for.
>
> One thing I did discover is that the dump collect method is a very 
> important thing to have and would make the whole ST_Collect thing much

> more useful for people if it actually existed since I think that is 
> what people actually expect from that function.  I think part of the 
> frustration that people have is that st_collect returns geometry 
> collections more often than not.  I created a helper function that 
> does it since I had to use that technique so many times in the code, 
> but I suspect this would be done much more efficiently in the 
> lwpostgis code since it would then require only one trip instead of 
> many.  I haven't calculated to see how much of the time is wasted in 
> there.  I know that most of the lost time is still in the 
> ST_Unite_Garray function and the non-intersect check did not help 
> quite as much as I had hoped it would.
>
>
>
>
> --7855 ms, 7874 ms
> SELECT ST_CascadeUnion(the_geom) FROM st99_d00;
>
> This syntax which escapes the aggregate penalty gives me these timings
> --7195 ms, 6804 ms, 6996 ms
> SELECT st_unitecascade_garray_sort(ARRAY(SELECT the_geom FROM
st99_d00));
>
>
> --114282 ms -- SELECT 114282/1000.0/60 -- 1.90 minutes (I didn't 
> bother running this with ST_Union since I would be waiting till the 
> cows came home)
> SELECT st_unitecascade_garray_sort(ARRAY(SELECT the_geom FROM 
> usstatebounds));
>
> --152438 ms -- SELECT 152438/1000.0/60 -- 2.55 minutes
> SELECT ST_CascadeUnion(the_geom) FROM usstatebounds;
>
> My cursory glance of the stats looks like I am getting the right 
> results although the internal sorting may not be optimal.
>
> Any feedback would be appreciated.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
------------------------------------------------------------------------
>
> *The substance of this message, including any attachments, may be 
> confidential, legally privileged and/or exempt from disclosure 
> pursuant to Massachusetts law. It is intended solely for the 
> addressee. If you received this in error, please contact the sender 
> and delete the material from any computer. *
>
>
------------------------------------------------------------------------
>
> * Help make the earth a greener place. If at all possible resist 
> printing this email and join us in saving paper. *
>
> * *
>
> * *
>
>
------------------------------------------------------------------------
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-devel
>   

-- 
Martin Davis
Senior Technical Architect
Refractions Research, Inc.
(250) 383-3022

_______________________________________________
postgis-devel mailing list
postgis-devel at postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-devel