[postgis-users] Parallelisation provides powerful postgis performance perks (script + ppt slides) [x-posted: pgsql-performance]

Thu Jul 23 09:10:14 PDT 2015

> Hello,
> 
> I am a researcher at the University of Minnesota, US. who is working on a
> project that uses PostGIS as our platform. I have been researching a number
> of PostgreSQL platforms that claim to support parallelizing PostGIS or
> geographic analysis. I am interested to learn more about the wrapper you
> have written. Have you looked into the other platforms such at pg_shard
> (CitusDB), or postgres-xl? We have not found them effective for actually
> distributing a spatial query. However, I will test your code out in our use
> case.
> 
> When I briefly look at the text you have written in the "Quick Example" It
> seems that you are distributing your query by an ID field. I am wondering
> how your method would apply to raster datasets? Distributing geographic
> data by an ID can get you into problems because of the dependency for
> certain analytical functions.
> 
> This sounds great, hope to hear back from you soon.

Hi there,

I don't have time to give out much advice just now, all I can say is 'read the slides and see what you think' (http://graemebell.net/foss4gcomo.pdf). Particularly the par_psql slides. 
You probably want to check out par_psql (and the slides I mentioned) more than the fast_map_intersection code, which is just an example of metaprogramming to get parallelism.

In my work here, I am not looking to get parallelism in a clever way on the server side (because generally speaking, I know much better than the server where the best parallelism opportunities are, how they're structured, and where things can't be parallelised). 

Also the problem we have here isn't about making the DB scale out horizontally. The problem is simply getting as much value out of the 16-core / 32-coreHT server we have here - it has a few SSDs in it, 128GB RAM. That's already more than enough to take problem run time down by more than an order of magnitude. If you have problems where you need improvements of 2-3 orders of magnitude, sorry, can't help much with these tools. I just want to take the pretty much endless opportunities for super-easy parallelism you get with huge geometry/raster data sets and make my everyday work a lot faster than my desktop. 

For maps with dependency between data items - my colleague Lar Opsahl is doing other work with parallel algorithms for topology maps, where he isolates the data into two parts - things that don't overlap and things that do. Anything that doesn't overlap is embarrasingly parallel and we can scale it up to about 20x quite easily with parallel tiles; anything that DOES overlap, can still be dealt with quite quickly (since it's usually <10% of the map). 

"Distributing geographic data by an ID can get you into problems because of the dependency for certain analytical functions."

Unsure what you mean by this, have never encountered any problems whatsoever from using ID this way. In our work with large geometry sets we get great scaling from this. We're very dependent on spatial indices as a magic box that picks out only the bits we need (e.g. intersections). In fact, if we're processing all the rows, then using the ID/modulo method is great because it means incoming IO pages are being split between available processors in a pretty balanced way. 

For rasters. Well, postgis raster queries should parallelise much like any other query with par_psql unless there are hidden internal locks I don't know about or underlying IO contention. Otherwise, for GDAL stuff, take a look at rbuild (http://github.com/gbb/rbuild). I wrote it as a kind of framework for tiling/parallelism when we were processing vector maps via an intermediate stage of rasters.

The main trick I've found with raster parallel processing was to use a sensible size & number of tiles, good lossless compression and the tiniest datatype possible, because when parallelising raster tiles, the IO will kill your performance, not the compute cost. Also, parallelisation has its overheads, so going beyond e.g. 400 tiles per map was counteproductive. I gave a presentation here: http://graemebell.net/foss4g2013.pdf. Hope that helps with your raster work. 

The tools I'm making are simply handy tools for people who don't know parallel programming, but want their GIS code to run 16x faster with very little work on a single powerful server, or 4x faster on their local machine. They're not going to scale you out horizontally over a million AWS instances , or get you much more than 1 order of magnitude improvement in run time. But for most people, a 4-32x improvement is still a huge improvement. For very large projects, it won't be enough.  

Graeme.