[postgis-devel] Bias in ClusterKmeans

Darafei "Komяpa" Praliaskouski me at komzpa.net
Fri Dec 29 03:07:56 PST 2017


Here's a test that I'd say our KMeans definitely fails - on a list of 2205
non-repeating points when asked to find 2205 clusters it only finds 1859,
so K is not really honored:

select count(distinct cid) from
(WITH
points AS (
    SELECT ST_MakePoint(x,y) geom from generate_series(-10,10) x,
generate_series(-10,10) y
    union all
    SELECT ST_MakePoint(x+0.1,y) geom from generate_series(-10,10) x,
generate_series(-10,10) y
    union all
    SELECT ST_MakePoint(x+0.2,y) geom from generate_series(-10,10) x,
generate_series(-10,10) y
    union all
    SELECT ST_MakePoint(x,y+0.1) geom from generate_series(-10,10) x,
generate_series(-10,10) y
    union all
    SELECT ST_MakePoint(x,y+0.2) geom from generate_series(-10,10) x,
generate_series(-10,10) y
)
SELECT ST_ClusterKMeans(geom,2204) over () AS cid, geom
FROM points) z;


пт, 29 дек. 2017 г. в 8:07, Regina Obe <lr at pcorp.us>:

> While you guys are grokking the code, perhaps you can figure out what is
> causing this test to fail once in a while.
>
>
>
> Fails more on 32-bit more than 64-bit.  Now I saw a similar failure on
> travis, might be Darafei's specially added testing sauce triggering that
> one J
>
>
>
> https://trac.osgeo.org/postgis/ticket/3700
>
>
>
> Anyway this failure is not repeatable unfortunately
>
>
>
> Last one happened just recently on winnie's 64-bit run.
>
>
>
> https://trac.osgeo.org/postgis/ticket/3700#comment:18
>
>
>
> Darafei if you want to experiment with logbt as you mentioned in that
> ticket +1 from me for 2.5 branch.  Don't wait for any more +1s, just do and
> ask for forgiveness later if anyone screams J
>
>
>
>
>
> Thanks,
>
> Regina
>
>
>
>
>
>
>
>
>
>
>
> *From:* postgis-devel [mailto:postgis-devel-bounces at lists.osgeo.org] *On
> Behalf Of *Tom van Tilburg
> *Sent:* Thursday, December 28, 2017 5:21 PM
>
>
> *To:* PostGIS Development Discussion <postgis-devel at lists.osgeo.org>
> *Subject:* Re: [postgis-devel] Bias in ClusterKmeans
>
>
>
> I was just looking at that code and trying to wrap my head around the
> rationale. Why a diagonal seed?
>
>
>
> A grid would indeed solve my example but I have more irregular input data.
>
> Would a quick & dirty solution be to spread K points randomly in the
> convex-hull of the the input set?
>
>
>
> Your proposed strategy looks much more stable than the current and as far
> as I can grasp it would give denser seeding where there is higher density
> of geometries. I wonder though why it wouldn't be enough to just randomly
> pick points from the input set, apart from having a different outcome every
> time.
>
>
>
> Tom
>
>
> On Thu, Dec 28, 2017 at 10:47 PM, Darafei "Komяpa" Praliaskouski <
> me at komzpa.net> wrote:
>
>
>
> Tom, the line of code that does the diagonal seeding is this:
>
> https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwkmeans.c#L187
>
>
> At the very least to fix your case it may need to seed in a grid, so you
> need to make it two loops for row and column and estimate range for them
> from min/max height and width.
>
>
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20171229/6b16c642/attachment-0001.html>


More information about the postgis-devel mailing list