[postgis-devel] Bias in ClusterKmeans

Thu Dec 28 11:15:51 PST 2017

I’ll review a patch, but I won’t do anything about a ticket 😉
The seeding problem is very fiddly as I recall, not a straightforward problem, by any means. Not all inputs are uniform, for example, one size very much does not fit all.

P

> On Dec 28, 2017, at 7:24 AM, Tom van Tilburg <tom.van.tilburg at gmail.com> wrote:
> 
> When running ST_ClusterKmeans on a large amount (>100) of clusters it becomes clear that there is a uneven distribution in the clustering, even when the points are evenly distributed. 
> 
> Consider the following query:
> WITH 
> points AS (
>     SELECT (ST_DumpPoints(ST_generatePoints(ST_MakeEnvelope(0,0,1000,1000),100000))).geom geom
> )
> SELECT ST_ClusterKMeans(geom,1000) over () AS cid, geom
> FROM points;
> 
> This will generate the following clusters:
> <image.png>
> 
> Obviously, clusters on the lowleft, uppright diagonal are smaller then clusters further from this diagonal. Could this be an issue with the starting (random?) seeding?
> If people agree this is undesired behaviour (for me it is), I can file a report.
> 
> Best,
>  Tom
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel