[GRASS-dev] How to calculate mean coordinates from big point datasets?

Markus Metz markus.metz.giswork at gmail.com
Fri Sep 20 08:38:23 PDT 2013


Glynn Clements wrote:
>
> Luca Delucchi wrote:
>
>> maybe v.median [0] could help?
>
> Not for large datasets. First, it requires that the data will fit into
> RAM. Second, numpy.median() sorts the entire array and takes the
> middle value, which is somewhere between O(n.log(n)) for the typical
> case and O(n^2) for the worst case (numpy.median uses the default
> sorting algorithm, which is quicksort).
>
> See r.quantile for a more efficient approach for large datasets. The
> first pass computes a histogram which allows upper and lower bounds to
> be determined for each quantile. The second pass extracts values which
> lie within those bounds and sorts them.

The approach of v.median and r.quantile regards each dimension
separately which is IMHO not correct in a spatial context. The median
of a point cloud would be a multivariate median for 2 or 3 dimensions.
You would need to sort the points first by the first dimension, then
by the second dimension etc, then pick the coordinates of the mid
point. Alternatively you would need to find the point with the
smallest distance to all other points, which is nightmare to calculate
( (n - 1) x (n - 1) distance calculations).

> Except for pathological cases
> (where the majority of the data lie within a tiny proportion of the
> overall range), only a small fraction of the data are sorted.

Coordinates of large point clouds can easily be such pathological cases.

>
> In any case: is this question about the mean or the median?

I guess the median could make sense to account for outliers.

A v.centerpoint module could thus have options for point mode (mean,
median, distance) and line mode (mid point, center of gravity). Area
centroids are already calculated by v.category type=area, even though
they are not centers of gravity.

Markus M


More information about the grass-dev mailing list