[postgis-devel] Bias in ClusterKmeans

Darafei "Komяpa" Praliaskouski me at komzpa.net
Thu Jan 4 10:38:01 PST 2018


Hi,

I fixed the lost clusters issue,
https://trac.osgeo.org/postgis/changeset/16212.

I also made an init algorithm that does it supposedly with less skew.
Patch: https://github.com/postgis/postgis/pull/181

Tom, can you please file a formal ticket to http://trac.osgeo.org/postgis/?
Would you be able to test it on your dataset to see if that solves your
issue?
We would also need a test case to freeze the behavior.

Paul, can you review it please and find a case that may be broken by such
fix? :)

пт, 29 дек. 2017 г. в 14:07, Darafei "Komяpa" Praliaskouski <me at komzpa.net>:

> Here's a test that I'd say our KMeans definitely fails - on a list of 2205
> non-repeating points when asked to find 2205 clusters it only finds 1859,
> so K is not really honored:
>
> select count(distinct cid) from
> (WITH
> points AS (
>     SELECT ST_MakePoint(x,y) geom from generate_series(-10,10) x,
> generate_series(-10,10) y
>     union all
>     SELECT ST_MakePoint(x+0.1,y) geom from generate_series(-10,10) x,
> generate_series(-10,10) y
>     union all
>     SELECT ST_MakePoint(x+0.2,y) geom from generate_series(-10,10) x,
> generate_series(-10,10) y
>     union all
>     SELECT ST_MakePoint(x,y+0.1) geom from generate_series(-10,10) x,
> generate_series(-10,10) y
>     union all
>     SELECT ST_MakePoint(x,y+0.2) geom from generate_series(-10,10) x,
> generate_series(-10,10) y
> )
> SELECT ST_ClusterKMeans(geom,2204) over () AS cid, geom
> FROM points) z;
>
>
> пт, 29 дек. 2017 г. в 8:07, Regina Obe <lr at pcorp.us>:
>
>> While you guys are grokking the code, perhaps you can figure out what is
>> causing this test to fail once in a while.
>>
>>
>>
>> Fails more on 32-bit more than 64-bit.  Now I saw a similar failure on
>> travis, might be Darafei's specially added testing sauce triggering that
>> one J
>>
>>
>>
>> https://trac.osgeo.org/postgis/ticket/3700
>>
>>
>>
>> Anyway this failure is not repeatable unfortunately
>>
>>
>>
>> Last one happened just recently on winnie's 64-bit run.
>>
>>
>>
>> https://trac.osgeo.org/postgis/ticket/3700#comment:18
>>
>>
>>
>> Darafei if you want to experiment with logbt as you mentioned in that
>> ticket +1 from me for 2.5 branch.  Don't wait for any more +1s, just do and
>> ask for forgiveness later if anyone screams J
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Regina
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* postgis-devel [mailto:postgis-devel-bounces at lists.osgeo.org] *On
>> Behalf Of *Tom van Tilburg
>> *Sent:* Thursday, December 28, 2017 5:21 PM
>>
>>
>> *To:* PostGIS Development Discussion <postgis-devel at lists.osgeo.org>
>> *Subject:* Re: [postgis-devel] Bias in ClusterKmeans
>>
>>
>>
>> I was just looking at that code and trying to wrap my head around the
>> rationale. Why a diagonal seed?
>>
>>
>>
>> A grid would indeed solve my example but I have more irregular input
>> data.
>>
>> Would a quick & dirty solution be to spread K points randomly in the
>> convex-hull of the the input set?
>>
>>
>>
>> Your proposed strategy looks much more stable than the current and as far
>> as I can grasp it would give denser seeding where there is higher density
>> of geometries. I wonder though why it wouldn't be enough to just randomly
>> pick points from the input set, apart from having a different outcome every
>> time.
>>
>>
>>
>> Tom
>>
>>
>> On Thu, Dec 28, 2017 at 10:47 PM, Darafei "Komяpa" Praliaskouski <
>> me at komzpa.net> wrote:
>>
>>
>>
>> Tom, the line of code that does the diagonal seeding is this:
>>
>>
>> https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwkmeans.c#L187
>>
>>
>> At the very least to fix your case it may need to seed in a grid, so you
>> need to make it two loops for row and column and estimate range for them
>> from min/max height and width.
>>
>>
>>
>> _______________________________________________
>> postgis-devel mailing list
>> postgis-devel at lists.osgeo.org
>> https://lists.osgeo.org/mailman/listinfo/postgis-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20180104/9a31fd1a/attachment.html>


More information about the postgis-devel mailing list