[postgis-devel] ST_Union() performance problem (with possiblefunding)

Wed Oct 16 19:16:18 PDT 2013

yes, true.. but with a thorough read you might notice that the 
gdal_retile.py experiment was largely ineffective, 
but if you click on the link at the top to the *next post*  

      Variable Buffers in PostGIS

you will find the one that really worked well.. in fact, we used that 
2nd post in production for months, to great effect. 
The trick on one machine was to split to work by some constant, and 
then make psycopg2 connections for each "bucket."

This worked very well.. 

Since then I have experimented only a tiny bit with SPARK from the 
Berkeley Amp Lab for a distributed work load on a Hadoop file system, 
but that world has no GEOS (yet) 

--
Brian M Hamlin
OSGeo California Chapter
blog.light42.com

On Wed, 16 Oct 2013 17:28:27 -0700, Bborie Park  wrote:

       Your best bet is to consider splitting the workload among 
several postgresql connections. 

darkblueb had a blog post about this... 

http://blog.light42.com/wordpress/?p=23

On Wed, Oct 16, 2013 at 5:21 PM, Brent Wood
  wrote:

       Hi,

Any advice appreciated!!

I'm undertaking a spatial analysis using Postgis (what else would I 
use!!!). The first part works well. 

I take a large number (potentially millions) of lines defined by start 
& end points & buffer them to create polygons. (I'm working in lat/long 
EPSG:4326 but transforming to a custom equal area projection for the 
buffering operation). 

I generate a grid of 5x5km cells (polygons) covering the region of interest. 

I clip the line based polygons to the grid, so I can generate 
statistics for each cell describing the lines that intersect with it, 
various quantitative measures such as ST_Union() the clipped line 
polygons to generate a footprint in each cell to work out how much 
is/is not covered, or sum the ST_Area() of the clipped polygons grouped 
by cell to calculate an aggregate cover, which can be several times the 
actual cell area. 

So far so good, it works well, the code is clear & transparent & 
provides a good result. At least as good as any commercial software can 
do. My test data subset is processed from scratch in about 30 minutes. 

Now I want to ST_Union() all the cell based polygons into an overall 
single multipolygon representing the footprint. The code is simple. The 
performance, even with my subset,  is a problem. 

I have thousands of cell based footprint multipolygons, each 
potentially with thousands of vertices to be ST_Union()ed. Runtime is 
weeks for an iteration. If I need separate total footprints for 20 
different species annually for 5 years, that is 100 iterations. Memory 
& I/O use is minimal - it is totally cpu bound. 

I am looking at trying to simplify the polygons to be unioned to reduce 
the number of vertices (& hence processing) involved, but to achieve 
any significant benefit I'm having to change the shape of the polygons 
to ST_Union() too much. 

Does anyone have any suggestions as to how this could be made 
significantly faster?
If I had $$ to throw at developers to work on the codebase (presumably 
GEOS?) could performance be significantly improved?

Thanks,

   Brent Wood

_______________________________________________
postgis-devel mailing list
postgis-devel at lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-devel

-------------------------

_______________________________________________
postgis-devel mailing list
postgis-devel at lists.osgeo.org
http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20131016/9afa4615/attachment.html>