[GRASS-dev] Spatial clustering of vector objects?

Benjamin Ducke benducke at fastmail.fm
Thu May 4 11:18:12 PDT 2017


On 04/05/17 19:22, Markus Neteler wrote:
> Hi,
> 
> in order to parallelize some heavy computation I was wondering how to
> do spatial clustering of vector objects, i.e. building footprints
> (vector polygons).
> 
> I have to perform zonal statistics on thousands of buildings and would
> like to split them up into "tiles" and then run the computation in
> parallel for each tile.
> 
> The examples in v.cluster look somehow promising
> https://grass.osgeo.org/grass72/manuals/v.cluster.html
> 
> but in the best case each "tile" would contain a similar amount of
> buildings in order to balance the computation across the CPUs.

Hi,

I think that you would need to partition
space into overlapping tiles, with the
amount of overlap depending on the maximum
distance parameter of the clustering algorithm.
Otherwise you would get a serious edge effect
in each tile.

Prior to spatial clustering, you could use a cluster
algorithm that aims to produce clusters with
(nearly) equal number of points for "tiling":

https://stats.stackexchange.com/questions/8744/clustering-procedure-where-each-cluster-has-an-equal-number-of-points

You would then select the points for each
cluster, buffer their convex hull by the max
distance of your spatial cluster algorithm
and set the working region for each "tile" to
be the bounding box of the buffered convex
hull (don't forget to catch all points from
all other clusters that fall within the "tile"
and add them to the working region's set).

If that works, please make it a GRASS add-on...

Regarding building footprints, I guess another
tricky part is how to represent them as
points: Centroids? Outer edge vertices? Both?

Oh, by the way: A fellow computer scientist
who works a lot with concurrent processing
once told me that the frequently used

number of processes = number of CPUs/cores

is actually not ideal! Apparently, modern
CPU schedulers are optimized to handle many
more processes than there are CPUs/cores,
and if the two counts match, then you can
get fringe situations where processes keep
getting transferred between cores, which
incurs a huge performance penalty. His
recommendation was to use a factor of
about 2.5 (times more processes than cores).

I never got around to testing his theory,
but if you have the time, I'd love to know!

Best,

Ben

> 
> Any idea?
> 
> thanks,
> Markus
> _______________________________________________
> grass-dev mailing list
> grass-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/grass-dev
> 



-- 
Dr. Benjamin Ducke
{*} Geospatial Consultant
{*} GIS Developer

Spatial technology for the masses, not the classes:
experience free and open source GIS at http://gvsigce.org


More information about the grass-dev mailing list