[postgis-users] ST_Intersection very slow.

Tue Feb 24 11:36:19 PST 2015

Thanks Mark.  I will indeed do the st_dump, but I don't think it will help much (very few multis).  I think the tiling will help a lot.  What I wonder, though, is how long it will take to tile?  Afterall, it's still an st_intersection operation to tile each geometry against each tile.

Is there a quad tree type tiling algorithm in a function?  If I do 256 x 256 tiles, doing it all at once would be 65536 operations of st_intersection(complex_geom, square).  With a quad tree I'll only have 4 operations of st_intersection(complex_geom, square), and then 16 of (a_little_less_complex_geom, square), and then 64 of (even_less_complex_geom, square) and so on, for 8 nests.  The last one will still be 65536 operations, but the complex geoms should be a lot simpler by the time I get down to that level.  What do you think, is this worth trying?  Or is intersecting with a square fairly simple regardless of how complex the other geometry is?

(Your reply initially fell into my junk mail, so I missed it, and also haven't tried these things yet, partly because I missed your email.)

--
John Abraham
jea at hbaspecto.com
403-232-1060

On Feb 19, 2015, at 6:52 PM, Mark Wynter <mark at dimensionaledge.com> wrote:

> 
>> So I've was running this query for 866000 s (10 days) before I decided to kill it:
> 
> 
>> One potential thing I've realized is that a few of the geometries in tazjan2 are multipolygons, not single polygons.  But it's only a few.  There are a few very large and complex polygons in lancover_polygons_snap, but again, most of the 998031 are simple small polygons, about half would be ST_CoveredBy the polygons in tazjan2 and most of the rest would only overlap two or three of the polygons in tazjan2.
> 
> 
> A common offender - the multipolygons.
> 
> You can get an immediate performance improvement by ST_Dump(wkb_geometry).geom into a new table, with a new serial id, but retaining the original poly_id so you can don’t lose the multipolygon provenance of each polygon.
> Complex operations on the multipolygon table, like ST_Intersection(), are extremely slow - Because the query has to access and compute the operation on each of the geometry elements contained within the multi.
> 
> Another massive performance gain can be achieved by “Tiling” your dumped polygon table - which has the effect of breaking your geometries into smaller features, which you can then “union” back together again, after running ST_Intersection on your vector tiles.  Don’t try to tile your multi polygon table - you’ll still be waiting days.
> 
> Take the coastline of Australia - a single multi polygon
> 
> SELECT Count(ogc_fid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11;
>  num_rows | num_geoms | num_points 
> -------------------+--------------
>         1 |      6718 |    1622598
> 
> If you need to query the intersection of a polygon with land, and the water, its near impossible.  Many, many days.
> 
> Hence we need to take a different approach.
> 
> The first thing you can do is to dump the multi polygon table.
> Then create a regular vector grid.
> Then tile the both your Polygon tables using the vector grid.
> Put indexes on the resultant tiled tables.
> Then run your ST_Intersects.
> The result is many more geometries, and points, but now we get the benefit of geometry indexes.
> 
> As a result of tiling Australia coastline, we now get
> 
> SELECT Count(tid) as num_rows, SUM(ST_NumGeometries(wkb_geometry)) as num_geoms, SUM(ST_NPoints(wkb_geometry)) as num_points FROM abs_aus11_tiled;
>  num_rows | num_geoms | num_points 
> -------------------+--------------
>     17222 |     29020 |    6513770
> 
> Each geometry also has a tile_id, and the original poly_id.  The tile_id’s are really useful for subsetting your queries, and for using tile-id’s as an additional join condition.  At any time you can rapidly rebuild your original geometries doing ST_Union() GROUP BY poly_id.
> 
> Using the approach of dumping and tiling, queries that once took days now takes minutes at most.  And as Remi mentioned, you can parallelise the process.  The concept of vector tiling in PostGIS is analogous to Map Reduce and assigning Key Value Pairs.  Its about chunking your data, doing lots of fast operations on the smaller bits, and then consolidating your outputs.  You don’t necessarily have to write a SQL function to do the SQL parallelisation.   My preference is Bash and GNU Parallel because you can write vanilla SQL and execute via psql in parallel.   For me its now about the speed I can write the SQL.
> 
> So we’re now starting to apply Big Data concepts within PostGIS….    Holy smoke… and you can apply the same concepts to a wide range of PostGIS operations - scale up, scale out….
> 
> I will write up a tutorial explaining the benefits of vector tiling in PostGIS, with examples and parallelisation code patterns.  If not today, hopefully over the next week.  
> 
> Mark
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20150224/05bccdfa/attachment.html>