<div dir="ltr">What do you think of the following seeding strategy:<div><br><div> - make an array that tracks original order of elements O(N);<br><div> - push nulls and zeros to the end of points buffer O(N);</div></div><div> - sort points buffer by z-curve O(N logN);</div><div> - move duplicates away to the start/end O(N);</div><div> - if number of unique points left less than k fail like we do now;</div><div> - seed cluster centers each round(i*N/k) elements O(k);</div><div> - iterate KMeans (with N less by N_empty+N_duplicates);<br> - assign duplicates clusters of their twins O(N), if there are duplicates;<br> - sort back to original order O(N logN).</div><div><br></div><div>This way we get seed point in O(4N + 2NlogN + k), and distribution roughly matches input points distribution. </div><div>Currently it's best case is o(k*N), if closest doesn't need a recheck.</div><div><br></div><div>By the way, shouldn't we j=0 here? It looks like we can seed a point twice.</div><div>Or maybe a comment why we don't need to recheck new point from start of list?</div><div><a href="https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwkmeans.c#L203">https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwkmeans.c#L203</a> <br><br>Also, why do we need to calculate centroids of all clusters separately, instead of one pass over all the clusters summing to array of centroids?<br><a href="https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwkmeans.c#L45">https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwkmeans.c#L45</a> <br></div><div><br></div><div>Tom, the line of code that does the diagonal seeding is this:</div></div><div><a href="https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwkmeans.c#L187">https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwkmeans.c#L187</a> <br></div><div>At the very least to fix your case it may need to seed in a grid, so you need to make it two loops for row and column and estimate range for them from min/max height and width.</div></div><br><div class="gmail_quote"><div dir="ltr">чт, 28 дек. 2017 г. в 23:22, Regina Obe <<a href="mailto:lr@pcorp.us">lr@pcorp.us</a>>:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">But the patch should be in a ticket.<br>

<br>

<br>

-----Original Message-----<br>

From: postgis-devel [mailto:<a href="mailto:postgis-devel-bounces@lists.osgeo.org" target="_blank">postgis-devel-bounces@lists.osgeo.org</a>] On Behalf Of Paul Ramsey<br>

Sent: Thursday, December 28, 2017 2:16 PM<br>

To: PostGIS Development Discussion <<a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a>><br>

Subject: Re: [postgis-devel] Bias in ClusterKmeans<br>

<br>

I ll review a patch, but I won t do anything about a ticket ??<br>

The seeding problem is very fiddly as I recall, not a straightforward problem, by any means. Not all inputs are uniform, for example, one size very much does not fit all.<br>

<br>

P<br>

<br>

> On Dec 28, 2017, at 7:24 AM, Tom van Tilburg <<a href="mailto:tom.van.tilburg@gmail.com" target="_blank">tom.van.tilburg@gmail.com</a>> wrote:<br>

><br>

> When running ST_ClusterKmeans on a large amount (>100) of clusters it becomes clear that there is a uneven distribution in the clustering, even when the points are evenly distributed.<br>

><br>

> Consider the following query:<br>

> WITH<br>

> points AS (<br>

>     SELECT<br>

> (ST_DumpPoints(ST_generatePoints(ST_MakeEnvelope(0,0,1000,1000),100000<br>

> ))).geom geom<br>

> )<br>

> SELECT ST_ClusterKMeans(geom,1000) over () AS cid, geom FROM points;<br>

><br>

> This will generate the following clusters:<br>

> <image.png><br>

><br>

> Obviously, clusters on the lowleft, uppright diagonal are smaller then clusters further from this diagonal. Could this be an issue with the starting (random?) seeding?<br>

> If people agree this is undesired behaviour (for me it is), I can file a report.<br>

><br>

> Best,<br>

>  Tom<br>

> _______________________________________________<br>

> postgis-devel mailing list<br>

> <a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a><br>

> <a href="https://lists.osgeo.org/mailman/listinfo/postgis-devel" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-devel</a><br>

_______________________________________________<br>

postgis-devel mailing list<br>

<a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a><br>

<a href="https://lists.osgeo.org/mailman/listinfo/postgis-devel" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-devel</a><br>

<br>

_______________________________________________<br>

postgis-devel mailing list<br>

<a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a><br>

<a href="https://lists.osgeo.org/mailman/listinfo/postgis-devel" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-devel</a></blockquote></div>