[postgis-devel] ST_Union() performance problem (with possiblefunding)

Rémi Cura remi.cura at gmail.com
Thu Oct 17 00:52:54 PDT 2013


Hey,
I'm not sure I understand perfectly your problem,
for instance why exactly do you need an unioned footprint?
Could you not stick to function where f(footprint) = sum(f(cells)) (this
work for area, cover analysis etc.).
Meaning you could aggregate your result cell by cells.

Another think is what about your input data precision? I'm guessing it is
low precision, so you may cut corners during processing (for instance,
using buffer with quad_segs=2 to lower number of vertices in generated
buffer, so generated footprint).


If not, and for *very* big data you may need to change your way of working.
The ideal candidat (lots of memory, less cpu) is image processing!
Fortunately you can switch to image world, where operations are *way
way*faster, and easily made in parallel.
In image world you don't have to care about topology and exact computation,
you can just compute by pixels.
(I won't even talk about GPU processing, but it may be used depending on
your operations)

(If you don't require too much precision and can cope with aliasing effect)

Your data process would look like this :
using gdal <http://www.gdal.org/> directly (or several connection to
postgis) you rasterize your polygons cells by cells. You do it the parallel
way. You should have one image file per cell as output.

This images can be assembled into one big image depending on memory
limitation.
If it doesn't fit in memory, you can use very fast , very efficient out of
memory parallel processing software like Orfeo ToolBox, included in QGis
sextante, depending on what you want to do.

What is cool with images is that you can do things you can't with pure
geometry :
I'm guessing you are working on data that are observations of animals.
Instead of using a plain surface like "area where it is likely animals
live", you could use fuzzy
models<http://gitta.info/Suitabilityi/en/html/fuzzy_learningObject2.html>(in
the area, probability of animal living would be decreasing as the
distance with the observation of animal),
then generate more "precise" and interesting results.


I made a lot's of guess so my answer may be completely useless .
I can precise things unclear.

Cheers,
Rémi-C



2013/10/17 Brent Wood <pcreso at pcreso.com>

> These tend to assume each operation is constrained to a single cell, hence
> parallelizable, hence undertaking an operation on multiple cells
> concurrently.
>
> While I'm dealing with many cells - they are all being merged into a
> single multipolygon feature. I can't load two cells into the same
> multipolygon at the same time - two writes to the same record - so can't
> run concurrently.
>
> What may be possible is to perhaps run multiple processes on subsets to
> create intermediate (larger) merged polygons which can then be merged
> themselves to create the final single feature.
>
> This would probably allow better use of resources on a multi core
> system... at present I'm using 1.7% of memory on a 100% cpu process, so
> I'll look into this approach - 8 cores running concurrently giving giving
> close to 8x faster is useful.
>
> Brent
>
>
>   ------------------------------
>  *From:* "maplabs at light42.com" <maplabs at light42.com>
> *To:* Brent Wood <pcreso at pcreso.com>; PostGIS Development Discussion <
> postgis-devel at lists.osgeo.org>; Bborie Park <dustymugs at gmail.com>
> *Cc:* PostGIS Users Discussion <postgis-users at lists.osgeo.org>
> *Sent:* Thursday, October 17, 2013 3:16 PM
> *Subject:* Re: [postgis-devel] ST_Union() performance problem (with
> possiblefunding)
>
> yes, true.. but with a thorough read you might notice that the
> gdal_retile.py experiment was largely ineffective,
> but if you click on the link at the top to the *next post*
>       Variable Buffers in PostGIS
> you will find the one that really worked well.. in fact, we used that 2nd
> post in production for months, to great effect.
> The trick on one machine was to split to work by some constant, and then
> make psycopg2 connections for each "bucket."
>
> This worked very well..
>
> Since then I have experimented only a tiny bit with SPARK from the
> Berkeley Amp Lab for a distributed work load on a Hadoop file system, but
> that world has no GEOS (yet)
>
> --
> Brian M Hamlin
> OSGeo California Chapter
> blog.light42.com
>
>
>
> On Wed, 16 Oct 2013 17:28:27 -0700, Bborie Park <dustymugs at gmail.com>
> wrote:
>
> Your best bet is to consider splitting the workload among several
> postgresql connections.
>
> darkblueb had a blog post about this...
>
> http://blog.light42.com/wordpress/?p=23
>
>
>
> On Wed, Oct 16, 2013 at 5:21 PM, Brent Wood <pcreso at pcreso.com> wrote:
>
> Hi,
>
> Any advice appreciated!!
>
> I'm undertaking a spatial analysis using Postgis (what else would I
> use!!!). The first part works well.
>
> I take a large number (potentially millions) of lines defined by start &
> end points & buffer them to create polygons. (I'm working in lat/long
> EPSG:4326 but transforming to a custom equal area projection for the
> buffering operation).
>
> I generate a grid of 5x5km cells (polygons) covering the region of
> interest.
>
> I clip the line based polygons to the grid, so I can generate statistics
> for each cell describing the lines that intersect with it, various
> quantitative measures such as ST_Union() the clipped line polygons to
> generate a footprint in each cell to work out how much is/is not covered,
> or sum the ST_Area() of the clipped polygons grouped by cell to calculate
> an aggregate cover, which can be several times the actual cell area.
>
>
> So far so good, it works well, the code is clear & transparent & provides
> a good result. At least as good as any commercial software can do. My test
> data subset is processed from scratch in about 30 minutes.
>
> Now I want to ST_Union() all the cell based polygons into an overall
> single multipolygon representing the footprint. The code is simple. The
> performance, even with my subset,  is a problem.
>
> I have thousands of cell based footprint multipolygons, each potentially
> with thousands of vertices to be ST_Union()ed. Runtime is weeks for an
> iteration. If I need separate total footprints for 20 different species
> annually for 5 years, that is 100 iterations. Memory & I/O use is minimal -
> it is totally cpu bound.
>
> I am looking at trying to simplify the polygons to be unioned to reduce
> the number of vertices (& hence processing) involved, but to achieve any
> significant benefit I'm having to change the shape of the polygons to
> ST_Union() too much.
>
>
>
> Does anyone have any suggestions as to how this could be made
> significantly faster?
> If I had $$ to throw at developers to work on the codebase (presumably
> GEOS?) could performance be significantly improved?
>
>
> Thanks,
>
>    Brent Wood
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-devel
>
>
>
> ------------------------------
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-devel<http://webmail.light42.com/hwebmail/services/go.php?url=http%3A%2F%2Flists.osgeo.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fpostgis-devel>
>
>
>
>
>
>
>
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20131017/1775d971/attachment.html>


More information about the postgis-devel mailing list