[postgis-users] Determining clusters of points

Fri Jan 7 13:02:30 PST 2011

Sorry, I forgot to also include the SQL for the 2 temporary tables
used by the cluster_points_buffer.sql script.

create table clusters1(id serial, geom geometry);
create index clusters1_geom_gist on clusters1 using gist(geom);

create table clusters2(id serial, geom geometry);
create index clusters2_geom_gist on clusters2 using gist(geom);

Sébastien

On Fri, Jan 7, 2011 at 15:54, Sébastien Lorion <sl at thestrangefactory.com> wrote:
> Happy New Year everyone!
>
> I got time to test out the st_buffer+st_intersect solution (which I
> will name "buffer") and the st_dwithin+with recursive solution
> ("recursive"). I also came up with another one based on nested cursor
> FOR loops and st_dwithin ("cursor"). I have not tested the kmean
> solution yet.
>
> To quickly recap the problem at hand: I need to determine all the
> clusters of points that are within range of each other. For example,
> say the points represent vehicles with a radar: if one can be seen by
> the other, then together they form a cluster. Over time, the points
> are moving, so I need to be able to update the results with only the
> point that moved. My data set size is about 100M+ points. One thing to
> note is that I do not expect to process 100M points at once, rather it
> will be a gradual process with points added over time. There may be
> sharp peaks of changes thought, so speed is still very important.
>
> Briefly, the running time of all 3 solutions *seems* to be
> exponentially related to the average number of points inside a cluster
> and memory does not seems to be a bottleneck for any. The "recursive"
> solution has a very steep curve while the "buffer" solution scales
> better, but it will also choke on semi-large data sets (1M or more).
> The "cursor" solution fares much better and seems to be able to handle
> any data set size in a reasonable time (1M points in about 2h30 on my
> Lenovo T61p with a SSD). Once the cloud cluster is fully in place, I
> will be able to test with larger data sets.
>
> Some limitations of the "buffer" solution are that it requires
> additional processing to get the points that are part of each cluster
> afterward, it is rather difficult to update only a part of the points
> (only those that moved) and also, I did not came up with a way to
> handle points with different ranges (but did not spend a lot of time
> thinking about it either).
>
> I made only a small tweak to the "recursive" solution, so that
> isolated points are quickly discarded.
>
> For the "buffer" solution, I first create a buffer (ST_Buffer) around
> all points with a radius equals to half the unit range and then
> iteratively merge (ST_Union) all intersecting buffers (ST_Intersects).
> This requires 2 additional tables containing temporary result
> geometries. First the buffers are created in table1, then each
> iteration puts the result of the merging operation in the other unused
> table (truncating it first), i.e. points->table1, table1->table2,
> table2->table1, and so on. The function returns the name of table
> where the final result is stored.
>
> As for the "cursor" solution, here is the algorithm:
>
> for each ptA in points
>  for each ptB in reachable points from ptA
>    if ptA has no group and ptB has no group
>      ptA.grp = ptB.grp = new_id()
>    else if ptA has no group and ptB has a group
>      ptA.grp = ptB.grp
>    else if ptA has a group and ptB has no group
>      ptB.grp = ptA.grp
>    else if ptA.grp != ptB.grp
>      update all points with a group equals to ptB.grp with ptA.grp
>
> One might further optimize it by first dividing the points in a grid
> and for the inner loop, process only the points that are in adjacent
> cells of the point of origin. The size of cells should be so that
> reachable points are never located 2 cells away (ie they are in
> immediately adjacent cells), to ease/speed up the processing.
>
> I am currently working on a Scala version which will allow me to add
> multi-threading and offload the processing to multiple nodes instead
> of being restrained only to the DB server itself if I do it in pl/sql.
> With the grid optimization, it is already about an order of magnitude
> faster.
>
> I have attached the code of the 3 functions I used for testing. You
> will also need to create a sequence named "seq_pts_clusterId". I also
> included a data set with 100k points if you want to play around with
> the fonctions. Update the "range" column to form bigger clusters.
>
> You will notice that the "cursor" code (dwithin2) have some commented
> out code. It is because I tried to update the data using the current
> position of the cursors, instead of doing it via a standard WHERE
> query, but I had data refresh problems, ie the outer loop cursor was
> still using old data when I reached a record that was previously
> updated by the inner loop. I am not sure why, I tried a couple of
> different syntaxes, but to no avail. If someone has an idea, feel free
> to share it ...
>
> Sébastien
>