[postgis-users] Determining clusters of points
sl at thestrangefactory.com
Wed Dec 8 19:46:09 PST 2010
I forgot to mention that to optimize the processing (to which point, I am
not sure), I will certainly mark the points that moved and try to
recalculate clusters only for those.
On Wed, Dec 8, 2010 at 22:27, Sébastien Lorion <sl at thestrangefactory.com>wrote:
> Hello Steve,
> Thanks, I will check that solution as well.
> I can add one more property of my use case: consider that not only the
> points are moving without predictable direction, but that part of the
> processing (the clustering) is about knowing if they can detect each other
> as if they had a radar or some other similar instrument. So using ellipses
> or rectangles would have a very big effect on my calculations and is
> unacceptable in my scenario.
> On Wed, Dec 8, 2010 at 21:41, Steve <steve at subwest.com> wrote:
>> On 12/8/2010 1:35 AM, Sébastien Lorion wrote:
>> > To give a specific number, assume there are about 1 *billion* points,
>> > not a million. The total number of clusters should over around 10
>> > millions with about 25% having a high concentration of points in them.
>> > It is not a query that will be run repeatedly (once every half hour),
>> > and the hardware can be scaled up without problem (up to a point of
>> > course ..).
>> > Sébastien
>> You can roll your own kmeans directly in the database (this is probably
>> what you would have done with R), but with that many clusters, it might
>> take many cycles to converge.
>> Here is some pseudo-code that will do kmeans, but you would have to
>> implement the function/query to return the closest cluster.
>> > create table tempdata (
>> > position_id integer,
>> > pos geometry,
>> > cluster int
>> > );
>> > create table centriods (
>> > cluster_id int,
>> > pos geometry,
>> > x double precision,
>> > y double precision
>> > );
>> > CREATE OR REPLACE FUNCTION kmeans()
>> > RETURNS void AS
>> > $BODY$
>> > DECLARE
>> > row_changed integer;
>> > cycles integer;
>> > BEGIN
>> > delete from centriods;
>> > //put in x random points...
>> > for i in 1..x LOOP
>> > insert into centriods values (i, POINT);
>> > row_changed := 1;
>> > cycles := 0;
>> > while (row_changed > 0) LOOP
>> > cycles := cycles + 1;
>> > --Map ids to cluster
>> > update tempdata set cluster = closestid(cost); --Need a function
>> > to find closest point
>> > --build new clusters
>> > update centriods set pos = geomfromtext('POINT ('||avgy||'
>> > FROM (select avg(x) as avgx, avg(y) as avgy,cluster from tempdata
>> > group by cluster) as foo
>> > where id = foo.cluster and (x <> avgx and y <> avgy);
>> > GET DIAGNOSTICS row_changed = ROW_COUNT;
>> > RAISE NOTICE 'Changed %',row_changed;
>> > END LOOP;
>> > RAISE NOTICE 'Cycles %',cycles;
>> > END; $BODY$
>> > LANGUAGE 'plpgsql' VOLATILE
>> > COST 100;
>> But... if you are really handling that much data (like vehicle traffic),
>> you probably don't want to use circles anyway. Every new posit (unless
>> it is in the exact center) will cause the cluster to move & need to be
>> Ideally, you might want to consider using ellipses. Not necessarily the
>> postgis definition of an ellipse, but a polygon that can wrap close
>> positions with buffered rounding. It helps if you have some direction
>> with positions. Under this design, each ellipse should expand to
>> natural boundaries. Any further positions, might expand the polygon, or
>> just fall within. If too many fall within the same ellipse, you can
>> re-evaluate and have smaller ellipse in the ellipse for heavier
>> densities. But, since they aren't moving, we aren't re-clustering every
>> half an hour.
>> Just another idea to think about.
>> postgis-users mailing list
>> postgis-users at postgis.refractions.net
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the postgis-users