[postgis-users] Determining clusters of points

Sébastien Lorion sl at thestrangefactory.com
Wed Dec 8 19:46:09 PST 2010


I forgot to mention that to optimize the processing (to which point, I am
not sure), I will certainly mark the points that moved and try to
recalculate clusters only for those.

Sébastien

On Wed, Dec 8, 2010 at 22:27, Sébastien Lorion <sl at thestrangefactory.com>wrote:

> Hello Steve,
>
> Thanks, I will check that solution as well.
>
> I can add one more property of my use case: consider that not only the
> points are moving without predictable direction, but that part of the
> processing (the clustering) is about knowing if they can detect each other
> as if they had a radar or some other similar instrument. So using ellipses
> or rectangles would have a very big effect on my calculations and is
> unacceptable in my scenario.
>
> Sébastien
>
> On Wed, Dec 8, 2010 at 21:41, Steve <steve at subwest.com> wrote:
>
>> On 12/8/2010 1:35 AM, Sébastien Lorion wrote:
>> > To give a specific number, assume there are about 1 *billion* points,
>> > not a million. The total number of clusters should over around 10
>> > millions with about 25% having a high concentration of points in them.
>> > It is not a query that will be run repeatedly (once every half hour),
>> > and the hardware can be scaled up without problem (up to a point of
>> > course ..).
>> >
>> > Sébastien
>>
>> You can roll your own kmeans directly in the database (this is probably
>> what you would have done with R), but with that many clusters, it might
>> take many cycles to converge.
>>
>> Here is some pseudo-code that will do kmeans, but you would have to
>> implement the function/query to return the closest cluster.
>>
>>
>> > create table tempdata (
>> >   position_id integer,
>> >   pos geometry,
>> >   cluster int
>> > );
>> >
>> > create table centriods (
>> >   cluster_id int,
>> >   pos geometry,
>> >   x double precision,
>> >   y double precision
>> > );
>> >
>> > CREATE OR REPLACE FUNCTION kmeans()
>> >   RETURNS void AS
>> > $BODY$
>> > DECLARE
>> >   row_changed integer;
>> >   cycles integer;
>> > BEGIN
>> >   delete from centriods;
>> >   //put in x random points...
>> >   for i in 1..x LOOP
>> >   insert into centriods values (i, POINT);
>> >
>> >   row_changed := 1;
>> >   cycles := 0;
>> >   while (row_changed > 0) LOOP
>> >     cycles := cycles + 1;
>> >     --Map ids to cluster
>> >     update tempdata set cluster = closestid(cost);  --Need a function
>> > to find closest point
>> >
>> >     --build new clusters
>> >     update centriods set pos = geomfromtext('POINT ('||avgy||'
>> '||avgx')')
>> >     FROM (select avg(x) as avgx, avg(y) as avgy,cluster from tempdata
>> > group by cluster) as foo
>> >     where id = foo.cluster and (x <> avgx and y <> avgy);
>> >
>> >     GET DIAGNOSTICS row_changed = ROW_COUNT;
>> >     RAISE NOTICE 'Changed %',row_changed;
>> >   END LOOP;
>> > RAISE NOTICE 'Cycles %',cycles;
>> > END; $BODY$
>> >   LANGUAGE 'plpgsql' VOLATILE
>> >   COST 100;
>>
>> But... if you are really handling that much data (like vehicle traffic),
>> you probably don't want to use circles anyway.  Every new posit (unless
>> it is in the exact center) will cause the cluster to move & need to be
>> reevaluated.
>>
>> Ideally, you might want to consider using ellipses. Not necessarily the
>> postgis definition of an ellipse, but a polygon that can wrap close
>> positions with buffered rounding.  It helps if you have some direction
>> with positions.  Under this design, each ellipse should expand to
>> natural boundaries.  Any further positions, might expand the polygon, or
>> just fall within.  If too many fall within the same ellipse, you can
>> re-evaluate and have smaller ellipse in the ellipse for heavier
>> densities.  But, since they aren't moving, we aren't re-clustering every
>> half an hour.
>> Just another idea to think about.
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20101208/8f3ca712/attachment.html>


More information about the postgis-users mailing list