[postgis-users] Determining clusters of points

Wed Dec 8 18:41:39 PST 2010

On 12/8/2010 1:35 AM, Sébastien Lorion wrote:
> To give a specific number, assume there are about 1 *billion* points,
> not a million. The total number of clusters should over around 10
> millions with about 25% having a high concentration of points in them.
> It is not a query that will be run repeatedly (once every half hour),
> and the hardware can be scaled up without problem (up to a point of
> course ..).
>
> Sébastien

You can roll your own kmeans directly in the database (this is probably
what you would have done with R), but with that many clusters, it might
take many cycles to converge.

Here is some pseudo-code that will do kmeans, but you would have to
implement the function/query to return the closest cluster.

> create table tempdata (
>   position_id integer,
>   pos geometry,
>   cluster int
> );
>
> create table centriods (
>   cluster_id int,
>   pos geometry,
>   x double precision,
>   y double precision
> );
>
> CREATE OR REPLACE FUNCTION kmeans()
>   RETURNS void AS
> $BODY$
> DECLARE
>   row_changed integer;
>   cycles integer;
> BEGIN
>   delete from centriods;
>   //put in x random points...
>   for i in 1..x LOOP
>   insert into centriods values (i, POINT);
>
>   row_changed := 1;
>   cycles := 0;
>   while (row_changed > 0) LOOP
>     cycles := cycles + 1;
>     --Map ids to cluster
>     update tempdata set cluster = closestid(cost);  --Need a function
> to find closest point
>
>     --build new clusters
>     update centriods set pos = geomfromtext('POINT ('||avgy||' '||avgx')')
>     FROM (select avg(x) as avgx, avg(y) as avgy,cluster from tempdata
> group by cluster) as foo
>     where id = foo.cluster and (x <> avgx and y <> avgy);
>
>     GET DIAGNOSTICS row_changed = ROW_COUNT;
>     RAISE NOTICE 'Changed %',row_changed;
>   END LOOP;
> RAISE NOTICE 'Cycles %',cycles;
> END; $BODY$
>   LANGUAGE 'plpgsql' VOLATILE
>   COST 100;

But... if you are really handling that much data (like vehicle traffic),
you probably don't want to use circles anyway.  Every new posit (unless
it is in the exact center) will cause the cluster to move & need to be
reevaluated.

Ideally, you might want to consider using ellipses. Not necessarily the
postgis definition of an ellipse, but a polygon that can wrap close
positions with buffered rounding.  It helps if you have some direction
with positions.  Under this design, each ellipse should expand to
natural boundaries.  Any further positions, might expand the polygon, or
just fall within.  If too many fall within the same ellipse, you can
re-evaluate and have smaller ellipse in the ellipse for heavier
densities.  But, since they aren't moving, we aren't re-clustering every
half an hour.
Just another idea to think about.