<div dir="ltr">Here's a test that I'd say our KMeans definitely fails - on a list of 2205 non-repeating points when asked to find 2205 clusters it only finds 1859, so K is not really honored:<div><br></div><div><div>select count(distinct cid) from </div><div>(WITH</div><div>points AS (</div><div> SELECT ST_MakePoint(x,y) geom from generate_series(-10,10) x, generate_series(-10,10) y</div><div> union all</div><div> SELECT ST_MakePoint(x+0.1,y) geom from generate_series(-10,10) x, generate_series(-10,10) y</div><div> union all</div><div> SELECT ST_MakePoint(x+0.2,y) geom from generate_series(-10,10) x, generate_series(-10,10) y</div><div> union all</div><div> SELECT ST_MakePoint(x,y+0.1) geom from generate_series(-10,10) x, generate_series(-10,10) y</div><div> union all</div><div> SELECT ST_MakePoint(x,y+0.2) geom from generate_series(-10,10) x, generate_series(-10,10) y</div><div>)</div><div>SELECT ST_ClusterKMeans(geom,2204) over () AS cid, geom</div><div>FROM points) z;</div></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr">пт, 29 дек. 2017 г. в 8:07, Regina Obe <<a href="mailto:lr@pcorp.us">lr@pcorp.us</a>>:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div lang="EN-US" link="blue" vlink="purple"><div class="m_2710172806220812398WordSection1"><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d">While you guys are grokking the code, perhaps you can figure out what is causing this test to fail once in a while.<u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d">Fails more on 32-bit more than 64-bit. Now I saw a similar failure on travis, might be Darafei's specially added testing sauce triggering that one </span><span style="font-size:11.0pt;font-family:Wingdings;color:#1f497d">J</span><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><a href="https://trac.osgeo.org/postgis/ticket/3700" target="_blank">https://trac.osgeo.org/postgis/ticket/3700</a><u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d">Anyway this failure is not repeatable unfortunately<u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d">Last one happened just recently on winnie's 64-bit run.<u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><a href="https://trac.osgeo.org/postgis/ticket/3700#comment:18" target="_blank">https://trac.osgeo.org/postgis/ticket/3700#comment:18</a><u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d">Darafei if you want to experiment with logbt as you mentioned in that ticket +1 from me for 2.5 branch. Don't wait for any more +1s, just do and ask for forgiveness later if anyone screams </span><span style="font-size:11.0pt;font-family:Wingdings;color:#1f497d">J</span><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d">Thanks,<u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d">Regina<u></u><u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal" style="margin-left:.5in"><b><span style="font-size:11.0pt;font-family:"Calibri",sans-serif">From:</span></b><span style="font-size:11.0pt;font-family:"Calibri",sans-serif"> postgis-devel [mailto:<a href="mailto:postgis-devel-bounces@lists.osgeo.org" target="_blank">postgis-devel-bounces@lists.osgeo.org</a>] <b>On Behalf Of </b>Tom van Tilburg<br><b>Sent:</b> Thursday, December 28, 2017 5:21 PM</span></p></div></div><div lang="EN-US" link="blue" vlink="purple"><div class="m_2710172806220812398WordSection1"><p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif"><br><b>To:</b> PostGIS Development Discussion <<a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a>><br><b>Subject:</b> Re: [postgis-devel] Bias in ClusterKmeans<u></u><u></u></span></p></div></div><div lang="EN-US" link="blue" vlink="purple"><div class="m_2710172806220812398WordSection1"><p class="MsoNormal" style="margin-left:.5in"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif"></span></p><p class="MsoNormal" style="margin-left:.5in"><u></u> <u></u></p><div><div><div><p class="MsoNormal" style="margin-left:.5in">I was just looking at that code and trying to wrap my head around the rationale. Why a diagonal seed?<u></u><u></u></p></div><p class="MsoNormal" style="margin-left:.5in"><u></u> <u></u></p></div></div></div></div><div lang="EN-US" link="blue" vlink="purple"><div class="m_2710172806220812398WordSection1"><div><div><p class="MsoNormal" style="margin-left:.5in">A grid would indeed solve my example but I have more irregular input data. <u></u><u></u></p></div><p class="MsoNormal" style="margin-left:.5in">Would a quick & dirty solution be to spread K points randomly in the convex-hull of the the input set?<u></u><u></u></p><div><div><div><p class="MsoNormal" style="margin-left:.5in"><u></u> <u></u></p></div><p class="MsoNormal" style="margin-left:.5in">Your proposed strategy looks much more stable than the current and as far as I can grasp it would give denser seeding where there is higher density of geometries. I wonder though why it wouldn't be enough to just randomly pick points from the input set, apart from having a different outcome every time. <u></u><u></u></p></div><div><p class="MsoNormal" style="margin-left:.5in"><u></u> <u></u></p></div><div><p class="MsoNormal" style="margin-left:.5in">Tom<u></u><u></u></p></div><div><div><p class="MsoNormal" style="margin-left:.5in"><br>On Thu, Dec 28, 2017 at 10:47 PM, Darafei "Komяpa" Praliaskouski <<a href="mailto:me@komzpa.net" target="_blank">me@komzpa.net</a>> wrote:<u></u><u></u></p><div><div><div><blockquote style="border:none;border-left:solid #cccccc 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in"><div><p class="MsoNormal" style="margin-left:.5in"><u></u> <u></u></p><div><div><p class="MsoNormal" style="margin-left:.5in">Tom, the line of code that does the diagonal seeding is this:<u></u><u></u></p></div></div><div><p class="MsoNormal" style="margin-left:.5in"><a href="https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwkmeans.c#L187" target="_blank">https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwkmeans.c#L187</a> <u></u><u></u></p></div><div><p class="MsoNormal" style="margin-left:.5in">At the very least to fix your case it may need to seed in a grid, so you need to make it two loops for row and column and estimate range for them from min/max height and width.<u></u><u></u></p></div></div><p class="MsoNormal" style="margin-left:.5in"><u></u> <u></u></p></blockquote></div></div></div></div></div></div></div></div></div>_______________________________________________<br>
postgis-devel mailing list<br>
<a href="mailto:postgis-devel@lists.osgeo.org" target="_blank">postgis-devel@lists.osgeo.org</a><br>
<a href="https://lists.osgeo.org/mailman/listinfo/postgis-devel" rel="noreferrer" target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-devel</a></blockquote></div>