[postgis-users] Determining clusters of points

Wed Dec 8 19:27:51 PST 2010

Hello Steve,

Thanks, I will check that solution as well.

I can add one more property of my use case: consider that not only the
points are moving without predictable direction, but that part of the
processing (the clustering) is about knowing if they can detect each other
as if they had a radar or some other similar instrument. So using ellipses
or rectangles would have a very big effect on my calculations and is
unacceptable in my scenario.

Sébastien

On Wed, Dec 8, 2010 at 21:41, Steve <steve at subwest.com> wrote:

> On 12/8/2010 1:35 AM, Sébastien Lorion wrote:
> > To give a specific number, assume there are about 1 *billion* points,
> > not a million. The total number of clusters should over around 10
> > millions with about 25% having a high concentration of points in them.
> > It is not a query that will be run repeatedly (once every half hour),
> > and the hardware can be scaled up without problem (up to a point of
> > course ..).
> >
> > Sébastien
>
> You can roll your own kmeans directly in the database (this is probably
> what you would have done with R), but with that many clusters, it might
> take many cycles to converge.
>
> Here is some pseudo-code that will do kmeans, but you would have to
> implement the function/query to return the closest cluster.
>
>
> > create table tempdata (
> >   position_id integer,
> >   pos geometry,
> >   cluster int
> > );
> >
> > create table centriods (
> >   cluster_id int,
> >   pos geometry,
> >   x double precision,
> >   y double precision
> > );
> >
> > CREATE OR REPLACE FUNCTION kmeans()
> >   RETURNS void AS
> > $BODY$
> > DECLARE
> >   row_changed integer;
> >   cycles integer;
> > BEGIN
> >   delete from centriods;
> >   //put in x random points...
> >   for i in 1..x LOOP
> >   insert into centriods values (i, POINT);
> >
> >   row_changed := 1;
> >   cycles := 0;
> >   while (row_changed > 0) LOOP
> >     cycles := cycles + 1;
> >     --Map ids to cluster
> >     update tempdata set cluster = closestid(cost);  --Need a function
> > to find closest point
> >
> >     --build new clusters
> >     update centriods set pos = geomfromtext('POINT ('||avgy||'
> '||avgx')')
> >     FROM (select avg(x) as avgx, avg(y) as avgy,cluster from tempdata
> > group by cluster) as foo
> >     where id = foo.cluster and (x <> avgx and y <> avgy);
> >
> >     GET DIAGNOSTICS row_changed = ROW_COUNT;
> >     RAISE NOTICE 'Changed %',row_changed;
> >   END LOOP;
> > RAISE NOTICE 'Cycles %',cycles;
> > END; $BODY$
> >   LANGUAGE 'plpgsql' VOLATILE
> >   COST 100;
>
> But... if you are really handling that much data (like vehicle traffic),
> you probably don't want to use circles anyway.  Every new posit (unless
> it is in the exact center) will cause the cluster to move & need to be
> reevaluated.
>
> Ideally, you might want to consider using ellipses. Not necessarily the
> postgis definition of an ellipse, but a polygon that can wrap close
> positions with buffered rounding.  It helps if you have some direction
> with positions.  Under this design, each ellipse should expand to
> natural boundaries.  Any further positions, might expand the polygon, or
> just fall within.  If too many fall within the same ellipse, you can
> re-evaluate and have smaller ellipse in the ellipse for heavier
> densities.  But, since they aren't moving, we aren't re-clustering every
> half an hour.
> Just another idea to think about.
>
>
>
>
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20101208/c3a0df61/attachment.html>