[postgis-users] ST_Intersection very slow.

Thu Feb 19 17:52:09 PST 2015

> So I've was running this query for 866000 s (10 days) before I decided to kill it:

> One potential thing I've realized is that a few of the geometries in tazjan2 are multipolygons, not single polygons.  But it's only a few.  There are a few very large and complex polygons in lancover_polygons_snap, but again, most of the 998031 are simple small polygons, about half would be ST_CoveredBy the polygons in tazjan2 and most of the rest would only overlap two or three of the polygons in tazjan2.

A common offender - the multipolygons.

You can get an immediate performance improvement by ST_Dump(wkb_geometry).geom into a new table, with a new serial id, but retaining the original poly_id so you can don’t lose the multipolygon provenance of each polygon.
Complex operations on the multipolygon table, like ST_Intersection(), are extremely slow - Because the query has to access and compute the operation on each of the geometry elements contained within the multi.

Another massive performance gain can be achieved by “Tiling” your dumped polygon table - which has the effect of breaking your geometries into smaller features, which you can then “union” back together again, after running ST_Intersection on your vector tiles.  Don’t try to tile your multi polygon table - you’ll still be waiting days.

Take the coastline of Australia - a single multi polygon

SELECT Count(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11;
 num_rows | num_geoms | num_points 
-------------------+--------------
        1 |      6718 |    1622598

If you need to query the intersection of a polygon with land, and the water, its near impossible.  Many, many days.

Hence we need to take a different approach.

The first thing you can do is to dump the multi polygon table.
Then create a regular vector grid.
Then tile the both your Polygon tables using the vector grid.
Put indexes on the resultant tiled tables.
Then run your ST_Intersects.
The result is many more geometries, and points, but now we get the benefit of geometry indexes.

As a result of tiling Australia coastline, we now get

SELECT Count(tid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_tiled;
 num_rows | num_geoms | num_points 
-------------------+--------------
    17222 |     29020 |    6513770

Each geometry also has a tile_id, and the original poly_id.  The tile_id’s are really useful for subsetting your queries, and for using tile-id’s as an additional join condition.  At any time you can rapidly rebuild your original geometries doing ST_Union() GROUP BY poly_id.

Using the approach of dumping and tiling, queries that once took days now takes minutes at most.  And as Remi mentioned, you can parallelise the process.  The concept of vector tiling in PostGIS is analogous to Map Reduce and assigning Key Value Pairs.  Its about chunking your data, doing lots of fast operations on the smaller bits, and then consolidating your outputs.  You don’t necessarily have to write a SQL function to do the SQL parallelisation.   My preference is Bash and GNU Parallel because you can write vanilla SQL and execute via psql in parallel.   For me its now about the speed I can write the SQL.

So we’re now starting to apply Big Data concepts within PostGIS….    Holy smoke… and you can apply the same concepts to a wide range of PostGIS operations - scale up, scale out….

I will write up a tutorial explaining the benefits of vector tiling in PostGIS, with examples and parallelisation code patterns.  If not today, hopefully over the next week.  

Mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20150220/749ae60e/attachment.html>