[postgis-devel] Windowing Functions for Clustering

Daniel Baston dbaston at gmail.com
Sat Dec 19 12:36:42 PST 2015


Having just tested this out, I think the window function provides a much
better user experience.  And in hindsight, the implementation turns out to
be far simpler too, with less aggregate boilerplate.  Adding a window
version of ST_ClusterIntersecting doesn't even take 50 lines...maybe double
that with various error and null-checking branches added.

https://gist.github.com/dbaston/6f76c3462e40a69518cd

Are there any caveats we're missing?  Performance penalties, memory
consumption, anything else?

On Sat, Dec 19, 2015 at 12:23 PM, Paul Ramsey <pramsey at cleverelephant.ca>
wrote:

> Yes, the only advantage to doing a st_clusterkmeans(geometry) is
> convenience. PostGIS users would not have to install an extra package,
> and they could use it directly on geometry w/o mucking around w/
> extracting coordinates, etc.
>
> In general, do you like the window function approach?
>
> P.
>
> On Fri, Dec 18, 2015 at 4:00 PM, Paul Norman <penorman at mac.com> wrote:
> > On 12/18/2015 11:02 AM, Paul Ramsey wrote:
> >>
> >> Hey Dan,
> >> I've been reading up on the k-means cluster implementation already out
> >> there, thinking about adding one to PostGIS (makes sense, I figure)
> >> and one thing I've been trying to figure out is what the right API for
> >> a clustering function is.
> >>
> >> The k-means guy decided to do a windowing function, which I kind of
> >> like...
> >>
> >>
> https://github.com/umitanuki/kmeans-postgresql/blob/master/kmeans.c#L298
> >>
> >> So we'd put do something like,
> >>
> >>    select gid, st_clusterkmeans(geom, 4) from geotable;
> >>
> >> and get back a list of unique ids and cluster ids. If the user wanted
> >> to so something after that in terms of unioning, or collecting, or
> >> whatever, that would be up to the user to decide.
> >
> >
> > I've been working on clustering of 1d data and functions which can be
> used
> > to group data into different bins.
> >
> > The different ways of having a function like this are aggregate
> functions,
> > ordered set aggregate functions[1], and window functions[2]. Ordered-set
> > aggregates are new in PostgreSQL 9.4 and built-in ones are used for
> > computations like rank or percentile.
> >
> > Both types of aggregate functions are not well suited for this, as you'd
> > have to stuff the return value into an array and unnest it. This also
> makes
> > it difficult to do a more complicated cluster, such as cities where you
> want
> > the total population of the cluster.
> >
> > Window functions work well for clustering, but are infrequently enough
> used
> > so that I need to look up the syntax.[3]
> >
> > For kmeans, it is used as kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER
> ()
> > and returns the number of the group, allowing an outer select to perform
> > aggregates[4] like ST_Collect. A window function with the default window
> > frame and no filtering seems odd to me, but I can't point out anything
> > inherently wrong, and even the PostgreSQL examples do this,[5] and
> overall
> > the window functions seem better to me.
> >
> > As far as I can tell, postgresql-kmeans only works with points, but this
> > might be because the algorithms used don't clearly apply to areas. If
> this
> > is so, is the only advantage of of ST_Clusterkmeans(geom, N) over
> > kmeans(ARRAY[ST_X(ST_Centroid(way)), ST_Y(ST_Centroid(way))], N) that of
> > verbosity?
> >
> > Performance-wise, the postgresql-kmeans implementation is fast,
> particularly
> > compared to a SQL implementation of kmeans for the 1-d case.[6]
> >
> > [1]: http://www.postgresql.org/docs/9.4/static/functions-aggregate.html
> > [2]: http://www.postgresql.org/docs/9.4/static/tutorial-window.html
> > [3]:
> >
> http://www.postgresql.org/docs/9.4/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
> > [4]:
> >
> http://gis.stackexchange.com/questions/11567/spatial-clustering-with-postgis
> > [5]: http://www.postgresql.org/docs/9.4/static/tutorial-window.html
> > [6]: https://github.com/CartoDB/cartodb-postgresql/issues/183
> >
> > _______________________________________________
> > postgis-devel mailing list
> > postgis-devel at lists.osgeo.org
> > http://lists.osgeo.org/mailman/listinfo/postgis-devel
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/postgis-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20151219/50267691/attachment.html>


More information about the postgis-devel mailing list