<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;"><div>
<div>Hi,</div>
<div>I try to implement a simple distance based clustering in PostgreSQL.</div>
<div>It's working the following way:</div>
<div>1. loop as long as there are non clustered points</div>
<div> 2. took one point that is not clustered</div>
<div> 3. define this as a new cluster</div>
<div> 4. get all surrounding points in a defined distance and add them into the cluster</div>
<div>5. calculate the centroid for each cluster and count cluster points</div>
<div> </div>
<div>The calculation speed depends mostly on the count of points and the scale used. In the moment i'am using a python script and it took around 1sec for 1000 points for TMS zoomlevel 5 (distance = 4891.96981025m * 40 pixel) and around 3sec at zoomlevel 15 (distance = <span style="color: rgb(42, 42, 42); font-family: 'Segoe UI', 'Lucida Grande', Verdana, Arial, Helvetica, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 18px; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; display: inline !important; float: none;">4.78m</span> * 40 pixel). Implementing it with pgsql doesn't speeds it up.</div>
<div>Is it possible to do the work completely with SQL and maybe with just one query? I'am not sure if I understood the WITH RECURSIVE concept completely...but shouldn't it by possible to replace the iteration(2.-4.) with a WITH RECURSIVE statement? Are there any other suggestion to speed it up or a better solution?</div>
<div> </div>
<div>Greetings</div>
<div> </div>
<div> </div>
<div>CREATE TABLE points<br/>
(<br/>
id_point serial PRIMARY KEY,<br/>
geom geometry(Point,3857),<br/>
cluster_nr integer<br/>
);<br/>
INSERT INTO points (geom)<br/>
SELECT ST_SetSrid(ST_MakePoint(12.5 + random()*5, 51 + random()*5), 3857)<br/>
FROM generate_series(0, 1000);</div>
<div> </div>
<div>CREATE TABLE pg_clusters<br/>
(<br/>
gid integer,<br/>
cluster_feature_ids integer[],<br/>
count integer,<br/>
geom geometry<br/>
);</div>
<div> </div>
<div>-- select first point, which isn't clustered yet<br/>
SELECT id_point<br/>
FROM points<br/>
WHERE cluster_nr IS NULL<br/>
LIMIT 1;</div>
<div>-- use first point and get all surrounding points in defined distance including the point itself</div>
<div>-- use them as first cluster<br/>
WITH first_cluster AS (<br/>
SELECT a.id_point AS cluster_nr,<br/>
b.id_point,<br/>
b.geom AS geom<br/>
FROM points a,<br/>
points b<br/>
WHERE ST_Distance(a.geom, b.geom) <= 4891.96981025 * 40<br/>
AND a.id_point = 185<br/>
AND a.cluster_nr IS NULL<br/>
)<br/>
UPDATE points AS a<br/>
SET cluster_nr = b.cluster_nr<br/>
FROM first_cluster AS b<br/>
WHERE a.id_point = b.id_point;</div>
<div>--> choose next id<br/>
--> calculate next cluster<br/>
etc.</div>
<div>-- calculate the centroid of a cluster and count cluster points<br/>
DELETE FROM pg_clusters;<br/>
WITH clusters AS (<br/>
SELECT cluster_nr, array_agg(id_point) AS feature_ids, count(id_point), ST_Centroid(ST_Collect(geom)) AS geom<br/>
FROM points<br/>
GROUP BY cluster_nr<br/>
)<br/>
INSERT INTO pg_clusters(gid, cluster_feature_ids, count, geom)<br/>
SELECT * FROM clusters;</div>
</div></div></body></html>