[postgis-devel] Bias in ClusterKmeans

Thu Dec 28 13:47:35 PST 2017

What do you think of the following seeding strategy:

 - make an array that tracks original order of elements O(N);
 - push nulls and zeros to the end of points buffer O(N);
 - sort points buffer by z-curve O(N logN);
 - move duplicates away to the start/end O(N);
 - if number of unique points left less than k fail like we do now;
 - seed cluster centers each round(i*N/k) elements O(k);
 - iterate KMeans (with N less by N_empty+N_duplicates);
 - assign duplicates clusters of their twins O(N), if there are duplicates;
 - sort back to original order O(N logN).

This way we get seed point in O(4N + 2NlogN + k), and distribution roughly
matches input points distribution.
Currently it's best case is o(k*N), if closest doesn't need a recheck.

By the way, shouldn't we j=0 here? It looks like we can seed a point twice.
Or maybe a comment why we don't need to recheck new point from start of
list?
https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwkmeans.c#L203

Also, why do we need to calculate centroids of all clusters separately,
instead of one pass over all the clusters summing to array of centroids?
https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwkmeans.c#L45

Tom, the line of code that does the diagonal seeding is this:
https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwkmeans.c#L187
At the very least to fix your case it may need to seed in a grid, so you
need to make it two loops for row and column and estimate range for them
from min/max height and width.

чт, 28 дек. 2017 г. в 23:22, Regina Obe <lr at pcorp.us>:

> But the patch should be in a ticket.
>
>
> -----Original Message-----
> From: postgis-devel [mailto:postgis-devel-bounces at lists.osgeo.org] On
> Behalf Of Paul Ramsey
> Sent: Thursday, December 28, 2017 2:16 PM
> To: PostGIS Development Discussion <postgis-devel at lists.osgeo.org>
> Subject: Re: [postgis-devel] Bias in ClusterKmeans
>
> I ll review a patch, but I won t do anything about a ticket ??
> The seeding problem is very fiddly as I recall, not a straightforward
> problem, by any means. Not all inputs are uniform, for example, one size
> very much does not fit all.
>
> P
>
> > On Dec 28, 2017, at 7:24 AM, Tom van Tilburg <tom.van.tilburg at gmail.com>
> wrote:
> >
> > When running ST_ClusterKmeans on a large amount (>100) of clusters it
> becomes clear that there is a uneven distribution in the clustering, even
> when the points are evenly distributed.
> >
> > Consider the following query:
> > WITH
> > points AS (
> >     SELECT
> > (ST_DumpPoints(ST_generatePoints(ST_MakeEnvelope(0,0,1000,1000),100000
> > ))).geom geom
> > )
> > SELECT ST_ClusterKMeans(geom,1000) over () AS cid, geom FROM points;
> >
> > This will generate the following clusters:
> > <image.png>
> >
> > Obviously, clusters on the lowleft, uppright diagonal are smaller then
> clusters further from this diagonal. Could this be an issue with the
> starting (random?) seeding?
> > If people agree this is undesired behaviour (for me it is), I can file a
> report.
> >
> > Best,
> >  Tom
> > _______________________________________________
> > postgis-devel mailing list
> > postgis-devel at lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/postgis-devel
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20171228/7daaec34/attachment.html>