[postgis-users] Determining clusters of points
Sébastien Lorion
sl at thestrangefactory.com
Tue Dec 7 22:35:32 PST 2010
Thanks! I will check both your posts tomorrow (getting late here ...). To
give a specific number, assume there are about 1 *billion* points, not a
million. The total number of clusters should over around 10 millions with
about 25% having a high concentration of points in them. It is not a query
that will be run repeatedly (once every half hour), and the hardware can be
scaled up without problem (up to a point of course ..).
Sébastien
On Wed, Dec 8, 2010 at 00:20, Kevin Neufeld <kneufeld.ca at gmail.com> wrote:
> On 12/7/2010 5:15 PM, pcreso at pcreso.com wrote:
>
>> You might look at the newish (v8.4+ ?) RECURSIVE/WITH capabilities of
>> Postgres.
>>
>> Given Postgis allows you to create arbitrary buffer zones around your
>> points, the RECURSIVE query capability allows you to write an SQL that will
>> recursively track points which can interact with each other
>>
>
> So, just playing around a bit, I came up with this RECURSIVE query
> statement that will find all points in a particular "cluster". Every time I
> run the UPDATE statement, another cluster of points is identified and
> updated to have the same group id. I have no idea how this would perform
> over a table of a million points ...
>
>
> CREATE TABLE pts (id serial, grp int, geom geometry);
> CREATE SEQUENCE pts_grp_seq; -- to label cluster groups
>
> -- Sample data similiar to Sébastien's original post
> INSERT INTO pts (geom)
> SELECT (ST_Dump(
> 'MULTIPOINT (
> 6 129,
> 21 200,
> 30 66,
> 33 109,
> 41 137,
> 48 191,
> 76 119,
> 90 197,
> 175 212,
> 178 230,
> 182 53,
> 196 209,
> 259 124,
> 259 148
> )')).geom;
>
> -- Simple wrapper around an UPDATE statement that identifies a
> -- point cluster everytime its run
> CREATE OR REPLACE FUNCTION cluster_pts() RETURNS text AS
> $body$
> DECLARE
> results int;
> BEGIN
> LOOP
> -- Assign all records in the same cluster, the same group id
> UPDATE pts
> SET grp = f.grp
> FROM (
> WITH RECURSIVE t(id, geom, grp, path, cycle) AS (
> -- Grab the first record that doesn't have a group id
> (SELECT id, geom, nextval('pts_grp_seq'), ARRAY[id], false
> FROM pts
> WHERE grp IS NULL
> LIMIT 1
> )
> UNION ALL
> -- Find all the other records that are iteratively within some
> -- distance to the starting point.
> -- This includes A to B to C paths, but not A to B to A cyclic
> paths
> SELECT a.id, a.geom, b.grp, path || ARRAY[a.id], a.id =
> ANY(path)
> FROM pts a, t b
> WHERE a.id != b.id
> AND ST_DWithin(a.geom, b.geom, 80)
> AND NOT cycle
> )
> SELECT DISTINCT id, grp FROM t
> ) AS f
> WHERE pts.id = f.id;
>
> -- repeat until no more updates are performed.
> GET DIAGNOSTICS results = ROW_COUNT;
> EXIT WHEN results = 0;
> END LOOP;
>
> RETURN 'done.';
> END;
> $body$ LANGUAGE plpgsql;
>
> SELECT cluster_pts();
>
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20101208/8879729d/attachment.html>
More information about the postgis-users
mailing list