[postgis-devel] Bias in ClusterKmeans

Thu Dec 28 07:24:37 PST 2017

When running ST_ClusterKmeans on a large amount (>100) of clusters it
becomes clear that there is a uneven distribution in the clustering, even
when the points are evenly distributed.

Consider the following query:
WITH
points AS (
    SELECT
(ST_DumpPoints(ST_generatePoints(ST_MakeEnvelope(0,0,1000,1000),100000))).geom
geom
)
SELECT ST_ClusterKMeans(geom,1000) over () AS cid, geom
FROM points;

This will generate the following clusters:
[image: Inline image 1]

Obviously, clusters on the lowleft, uppright diagonal are smaller then
clusters further from this diagonal. Could this be an issue with the
starting (random?) seeding?
If people agree this is undesired behaviour (for me it is), I can file a
report.

Best,
 Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20171228/0512497e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 186158 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20171228/0512497e/attachment-0001.png>