[postgis-users] Determining clusters of points

Tue Dec 7 22:22:04 PST 2010

Well, unfortunately I cannot really talk about the specific details of what
I am trying to do, but I will try to give more information though.

I have 2 master DB, one for GIS objects, another for data, both with a
failover hot standby (used for replication to a highly distributed read-only
slave cluster). The data master is sharded by the PK modulo of a main
entity. My plan is(was) to shard the GIS DB the same way, but scaling up is
also possible. If I go with the buffer merging solution, it appears to me
that I will have no choice but to scale up. If there is a solution that
allows me to still scale out instead, it would have a strong point in its
favor.

For the duration of the task at hand, I need to lock the rows on the GIS DB
(ie repeatable reads isolation level at least). I have control on the
frequency of writes done on the GIS DB, by way of a scheduler that runs jobs
at fixed intervals, but that said, the task should not take more than 20-25
min. to allow me to run it every 30 min.

Concerning a recursive query, I only need one clustering level for my use
case, but that said, the idea will prove useful when users do zoom out
(those queries are done on a read-only slave cluster, so no real bottleneck
there). Same with the geomunion suggestion: the actual density does not
really concern me, except to help me finding the clusters. If a probability
function gives me 99.x% success rate with a really good speed, I can deal
with the error margin.

I think it is important to mention that the data points cover the whole
world, so I am using the new geography types. I really don't want to have to
handle projections ...

My use case has this particularity: if a point does not interact with any
other, I can ignore it totally. I am not aggregating data, I am finding all
zones of interaction which later are processed individually and
independently from each other, but still within the same original GIS DB
transaction. The row locking I was talking before is first done on the GIS
DB, then after I am done finding the zones and processing them all, then I
start another transaction on the data DB and do the updates on both DB and
commit both transactions. I could do finer grained transactions (say one for
each zone I found), but it would not really help with the time limit and
since I have control on the updates done to the GIS DB, concurrency is not a
concern.

I hope this additional info helps a bit more ... It is a bit hard to give
details without going into the specifics ;)

Sébastien

On Tue, Dec 7, 2010 at 20:15, <pcreso at pcreso.com> wrote:

> Hi Sébastien,
>
> Sounds like an interesting situation :-)
>
> You might look at the newish (v8.4+ ?) RECURSIVE/WITH capabilities of
> Postgres.
>
> Given Postgis allows you to create arbitrary buffer zones around your
> points, the RECURSIVE query capability allows you to write an SQL that will
> recursively track points which can interact with each other, so you should
> be able to
>
> select points which interact with points which interact with points
> which....
> (ie: creates an "interaction cluster")
>
> by using a Postgis buffer or a distance expression to represent "interact
> with". Or a similar effect to a predetermined level with subqueries in an
> SQL & cut-and-paste.
>
> However I'm not sure this will give the performance you need, and an
> in-memory tool like R may be more appropriate than a disk based RDBMS
> solution.
>
> A bit left field, but, depending on your use case, you could generate
> polygons for an "immediate neighbour" interaction as the distinct geomunion
> of all the point buffers which intersect with each other, then for each,
> select the count() of points contained in each to "rank" the clusters, or
> the count/area to estimate density within clusters, or assign points to
> clusters or vice versa.
>
> There are lots of things you can do with Postgis in this arena, you need to
> really clarify the actual use case to see if it can be done in SQL, how it
> could be done and if it should be done in SQL :-)
>
>
>
> Cheers,
>
>
>   Brent Wood
>
> --- On *Wed, 12/8/10, Sébastien Lorion <sl at thestrangefactory.com>* wrote:
>
>
> From: Sébastien Lorion <sl at thestrangefactory.com>
> Subject: Re: [postgis-users] Determining clusters of points
> To: pcreso at pcreso.com
> Cc: "PostGIS Users Discussion" <postgis-users at postgis.refractions.net>
> Date: Wednesday, December 8, 2010, 10:53 AM
>
>
> Hello Brent,
>
> Thanks for your help also. I saw this method, but I cannot use it in my
> scenario as it is too imprecise.
>
> Consider all points as agents that interact with each other, each up to a
> certain range. I need to find interaction clusters and cannot afford to have
> those split in many parts, though it is ok to miss including some points.
>
> Sébastien
>
> On Tue, Dec 7, 2010 at 15:06, <pcreso at pcreso.com<http://mc/compose?to=pcreso@pcreso.com>
> > wrote:
>
>  Hi Sébastien ,
>
> One way I "clustered" points in the (distant) past for a zoom-layer mapping
> solution was to create multiple geometries, by reducing the number of
> decimal points in the X & Y coords and grouping by these locations.
> Simplistic & not especially statistical, but given a random point
> distribution you get an order of magnitude reduction in the number of
> positions of points at each step. With a non-random distribution, the
> reduction in number of these clusters is often greater.
>
> I have also done a similar thing by creating grids of square cells of
> appropriate sizes, then assigning points to the cells using Postgis queries,
> which allows a regular clustered sampling of the points by grid cell.
>
> If you are after a more statistical approach, I'd look at using R, and once
> you have your R clustering capability coded, perhaps look at PL/R to embed
> the clustering capability within your database.
>
> What isn't clear from your question is whether the clustering is something
> done often, perhaps at various resolutions, or is done once, to assign
> points to clusters, then used as these groups from then on. This would have
> an impact on performance & the approach you might take.
>
> Also note that with both Postgis & R, any use of multiple cores or hardware
> clusters will require some work, as these are not applications that parallel
> process particularly well, but you could perhaps develop code that assigns
> individual analyses of data clusters to a load balanced hardware cluster.
>
> If you have a clusterin algorithm, it may also be feaqsibl to implement it
> as a native user defined Postgis/Postgres function, but I think PL/R is an
> easier approach for embedded SQL clustering capability.
>
>
> HTH,
>
> Brent Wood
>
> --- On *Wed, 12/8/10, Sébastien Lorion <sl at thestrangefactory.com<http://mc/compose?to=sl@thestrangefactory.com>
> >* wrote:
>
>
> From: Sébastien Lorion <sl at thestrangefactory.com<http://mc/compose?to=sl@thestrangefactory.com>
> >
> Subject: Re: [postgis-users] Determining clusters of points
> To: "PostGIS Users Discussion" <postgis-users at postgis.refractions.net<http://mc/compose?to=postgis-users@postgis.refractions.net>
> >
> Date: Wednesday, December 8, 2010, 8:03 AM
>
>
> I like your idea very much, especially since something I did not say is
> that the points have a range attribute which determines how far they can
> interact with other points. So the circle buffer you talk about would have a
> diameter equals to the range of each point.
>
> What would be fastest as the last step : bounding box, minimum bounding
> circle or minimum convex hull ? I am guessing the BB, but is the difference
> significant enough that one should be chosen over another ?
>
> Sébastien
>
> On Tue, Dec 7, 2010 at 13:42, Emilie Laffray <emilie.laffray at gmail.com<http://mc/compose?to=emilie.laffray@gmail.com>
> > wrote:
>
>
>
> On 7 December 2010 17:01, Sébastien Lorion <sl at thestrangefactory.com<http://mc/compose?to=sl@thestrangefactory.com>
> > wrote:
>
> Hello,
>
> I am trying to find an efficient way to find clusters of points as shown in
> the attached image. The only clustering criteria is the distance between the
> points. The dataset can be very large (millions of points) and point
> distribution is mostly clustered with some sparse points in the gaps.
>
> I searched the net and this mailing list and found two promising solution
> paths:
>
> - use a statistical tools such as R with a density function (
> http://www.r-project.org)
> - use a clustering algorithm like those explained here
> http://www.med.nyu.edu/biostatistics/people/Ilana%20Belitskaya-Levy/Courses/MAS/Handouts/clustering.pdf(agnes seems the most promising for my purposes)
>
> I would like your advice to help me find which approach would be best
> suited with PostGIS (maybe there is even something already made that I can
> use?). Whatever solution I pick, it must be efficient and the workload must
> be able to be distributed on a cluster of commodity hardware.
>
> I am new to GIS and this mailing list, so please excuse me if I am not
> using the right vocabulary.
>
> Thank you very much!
>
>
> Hello,
>
> Some time ago I have worked on something similar, except that I was using
> circles instead of boxes which should not be a problem. I am just giving the
> logic as I don't have access to the code right now.
> You can start by creating a buffer around each of your points of the
> distance you want.
> The next step is to create an UNION of all the buffers that intersect.
> You get the list of points included in each of the resulting polygons and
> then you create either a bounding box around them or use a minimum bounding
> circle (Postgis 1.5 and above).
>
> Emily Laffray
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net<http://mc/compose?to=postgis-users@postgis.refractions.net>
> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
>
>
> -----Inline Attachment Follows-----
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net<http://mc/compose?to=postgis-users@postgis.refractions.net>
> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20101208/52895d21/attachment.html>