[postgis-users] Determining clusters of points

Fri Jan 7 12:54:10 PST 2011

Happy New Year everyone!

I got time to test out the st_buffer+st_intersect solution (which I
will name "buffer") and the st_dwithin+with recursive solution
("recursive"). I also came up with another one based on nested cursor
FOR loops and st_dwithin ("cursor"). I have not tested the kmean
solution yet.

To quickly recap the problem at hand: I need to determine all the
clusters of points that are within range of each other. For example,
say the points represent vehicles with a radar: if one can be seen by
the other, then together they form a cluster. Over time, the points
are moving, so I need to be able to update the results with only the
point that moved. My data set size is about 100M+ points. One thing to
note is that I do not expect to process 100M points at once, rather it
will be a gradual process with points added over time. There may be
sharp peaks of changes thought, so speed is still very important.

Briefly, the running time of all 3 solutions *seems* to be
exponentially related to the average number of points inside a cluster
and memory does not seems to be a bottleneck for any. The "recursive"
solution has a very steep curve while the "buffer" solution scales
better, but it will also choke on semi-large data sets (1M or more).
The "cursor" solution fares much better and seems to be able to handle
any data set size in a reasonable time (1M points in about 2h30 on my
Lenovo T61p with a SSD). Once the cloud cluster is fully in place, I
will be able to test with larger data sets.

Some limitations of the "buffer" solution are that it requires
additional processing to get the points that are part of each cluster
afterward, it is rather difficult to update only a part of the points
(only those that moved) and also, I did not came up with a way to
handle points with different ranges (but did not spend a lot of time
thinking about it either).

I made only a small tweak to the "recursive" solution, so that
isolated points are quickly discarded.

For the "buffer" solution, I first create a buffer (ST_Buffer) around
all points with a radius equals to half the unit range and then
iteratively merge (ST_Union) all intersecting buffers (ST_Intersects).
This requires 2 additional tables containing temporary result
geometries. First the buffers are created in table1, then each
iteration puts the result of the merging operation in the other unused
table (truncating it first), i.e. points->table1, table1->table2,
table2->table1, and so on. The function returns the name of table
where the final result is stored.

As for the "cursor" solution, here is the algorithm:

for each ptA in points
  for each ptB in reachable points from ptA
    if ptA has no group and ptB has no group
      ptA.grp = ptB.grp = new_id()
    else if ptA has no group and ptB has a group
      ptA.grp = ptB.grp
    else if ptA has a group and ptB has no group
      ptB.grp = ptA.grp
    else if ptA.grp != ptB.grp
      update all points with a group equals to ptB.grp with ptA.grp

One might further optimize it by first dividing the points in a grid
and for the inner loop, process only the points that are in adjacent
cells of the point of origin. The size of cells should be so that
reachable points are never located 2 cells away (ie they are in
immediately adjacent cells), to ease/speed up the processing.

I am currently working on a Scala version which will allow me to add
multi-threading and offload the processing to multiple nodes instead
of being restrained only to the DB server itself if I do it in pl/sql.
With the grid optimization, it is already about an order of magnitude
faster.

I have attached the code of the 3 functions I used for testing. You
will also need to create a sequence named "seq_pts_clusterId". I also
included a data set with 100k points if you want to play around with
the fonctions. Update the "range" column to form bigger clusters.

You will notice that the "cursor" code (dwithin2) have some commented
out code. It is because I tried to update the data using the current
position of the cursors, instead of doing it via a standard WHERE
query, but I had data refresh problems, ie the outer loop cursor was
still using old data when I reached a record that was previously
updated by the inner loop. I am not sure why, I tried a couple of
different syntaxes, but to no avail. If someone has an idea, feel free
to share it ...

Sébastien
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster_points_buffer.sql
Type: application/octet-stream
Size: 2148 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20110107/c09ab9ee/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster_points_dwithin.sql
Type: application/octet-stream
Size: 1506 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20110107/c09ab9ee/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cluster_points_dwithin2.sql
Type: application/octet-stream
Size: 3186 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20110107/c09ab9ee/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pts.backup
Type: application/octet-stream
Size: 1589579 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20110107/c09ab9ee/attachment-0003.obj>