<div>Thanks! I will check both your posts tomorrow (getting late here ...). To give a specific number, assume there are about 1 *billion* points, not a million. The total number of clusters should over around 10 millions with about 25% having a high concentration of points in them. It is not a query that will be run repeatedly (once every half hour), and the hardware can be scaled up without problem (up to a point of course ..).<br>
</div><div><br></div><div>Sébastien</div><br><div class="gmail_quote">On Wed, Dec 8, 2010 at 00:20, Kevin Neufeld <span dir="ltr"><<a href="http://kneufeld.ca">kneufeld.ca</a>@<a href="http://gmail.com">gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">On 12/7/2010 5:15 PM, <a href="mailto:pcreso@pcreso.com" target="_blank">pcreso@pcreso.com</a> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
You might look at the newish (v8.4+ ?) RECURSIVE/WITH capabilities of Postgres.<br>
<br>
Given Postgis allows you to create arbitrary buffer zones around your points, the RECURSIVE query capability allows you to write an SQL that will recursively track points which can interact with each other<br>
</blockquote>
<br></div>
So, just playing around a bit, I came up with this RECURSIVE query statement that will find all points in a particular "cluster". Every time I run the UPDATE statement, another cluster of points is identified and updated to have the same group id. I have no idea how this would perform over a table of a million points ...<br>
<br>
<br>
CREATE TABLE pts (id serial, grp int, geom geometry);<br>
CREATE SEQUENCE pts_grp_seq; -- to label cluster groups<br>
<br>
-- Sample data similiar to Sébastien's original post<br>
INSERT INTO pts (geom)<br>
SELECT (ST_Dump(<br>
'MULTIPOINT (<br>
6 129,<br>
21 200,<br>
30 66,<br>
33 109,<br>
41 137,<br>
48 191,<br>
76 119,<br>
90 197,<br>
175 212,<br>
178 230,<br>
182 53,<br>
196 209,<br>
259 124,<br>
259 148<br>
)')).geom;<br>
<br>
-- Simple wrapper around an UPDATE statement that identifies a<br>
-- point cluster everytime its run<br>
CREATE OR REPLACE FUNCTION cluster_pts() RETURNS text AS<br>
$body$<br>
DECLARE<br>
results int;<br>
BEGIN<br>
LOOP<br>
-- Assign all records in the same cluster, the same group id<br>
UPDATE pts<br>
SET grp = f.grp<br>
FROM (<br>
WITH RECURSIVE t(id, geom, grp, path, cycle) AS (<br>
-- Grab the first record that doesn't have a group id<br>
(SELECT id, geom, nextval('pts_grp_seq'), ARRAY[id], false<br>
FROM pts<br>
WHERE grp IS NULL<br>
LIMIT 1<br>
)<br>
UNION ALL<br>
-- Find all the other records that are iteratively within some<br>
-- distance to the starting point.<br>
-- This includes A to B to C paths, but not A to B to A cyclic paths<br>
SELECT <a href="http://a.id" target="_blank">a.id</a>, a.geom, b.grp, path || ARRAY[<a href="http://a.id" target="_blank">a.id</a>], <a href="http://a.id" target="_blank">a.id</a> = ANY(path)<br>
FROM pts a, t b<br>
WHERE <a href="http://a.id" target="_blank">a.id</a> != <a href="http://b.id" target="_blank">b.id</a><br>
AND ST_DWithin(a.geom, b.geom, 80)<br>
AND NOT cycle<br>
)<br>
SELECT DISTINCT id, grp FROM t<br>
) AS f<br>
WHERE <a href="http://pts.id" target="_blank">pts.id</a> = <a href="http://f.id" target="_blank">f.id</a>;<br>
<br>
-- repeat until no more updates are performed.<br>
GET DIAGNOSTICS results = ROW_COUNT;<br>
EXIT WHEN results = 0;<br>
END LOOP;<br>
<br>
RETURN 'done.';<br>
END;<br>
$body$ LANGUAGE plpgsql;<br>
<br>
SELECT cluster_pts();<br>
<br>
<br>
<br>_______________________________________________<br>
postgis-users mailing list<br>
<a href="mailto:postgis-users@postgis.refractions.net">postgis-users@postgis.refractions.net</a><br>
<a href="http://postgis.refractions.net/mailman/listinfo/postgis-users" target="_blank">http://postgis.refractions.net/mailman/listinfo/postgis-users</a><br>
<br></blockquote></div><br>