[postgis-devel] Windowing Functions for Clustering

Paul Norman penorman at mac.com
Fri Dec 18 16:00:51 PST 2015


On 12/18/2015 11:02 AM, Paul Ramsey wrote:
> Hey Dan,
> I've been reading up on the k-means cluster implementation already out
> there, thinking about adding one to PostGIS (makes sense, I figure)
> and one thing I've been trying to figure out is what the right API for
> a clustering function is.
>
> The k-means guy decided to do a windowing function, which I kind of like...
>
> https://github.com/umitanuki/kmeans-postgresql/blob/master/kmeans.c#L298
>
> So we'd put do something like,
>
>    select gid, st_clusterkmeans(geom, 4) from geotable;
>
> and get back a list of unique ids and cluster ids. If the user wanted
> to so something after that in terms of unioning, or collecting, or
> whatever, that would be up to the user to decide.

I've been working on clustering of 1d data and functions which can be 
used to group data into different bins.

The different ways of having a function like this are aggregate 
functions, ordered set aggregate functions[1], and window functions[2]. 
Ordered-set aggregates are new in PostgreSQL 9.4 and built-in ones are 
used for computations like rank or percentile.

Both types of aggregate functions are not well suited for this, as you'd 
have to stuff the return value into an array and unnest it. This also 
makes it difficult to do a more complicated cluster, such as cities 
where you want the total population of the cluster.

Window functions work well for clustering, but are infrequently enough 
used so that I need to look up the syntax.[3]

For kmeans, it is used as kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER 
() and returns the number of the group, allowing an outer select to 
perform aggregates[4] like ST_Collect. A window function with the 
default window frame and no filtering seems odd to me, but I can't point 
out anything inherently wrong, and even the PostgreSQL examples do 
this,[5] and overall the window functions seem better to me.

As far as I can tell, postgresql-kmeans only works with points, but this 
might be because the algorithms used don't clearly apply to areas. If 
this is so, is the only advantage of of ST_Clusterkmeans(geom, N) over 
kmeans(ARRAY[ST_X(ST_Centroid(way)), ST_Y(ST_Centroid(way))], N) that of 
verbosity?

Performance-wise, the postgresql-kmeans implementation is fast, 
particularly compared to a SQL implementation of kmeans for the 1-d case.[6]

[1]: http://www.postgresql.org/docs/9.4/static/functions-aggregate.html
[2]: http://www.postgresql.org/docs/9.4/static/tutorial-window.html
[3]: 
http://www.postgresql.org/docs/9.4/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
[4]: 
http://gis.stackexchange.com/questions/11567/spatial-clustering-with-postgis
[5]: http://www.postgresql.org/docs/9.4/static/tutorial-window.html
[6]: https://github.com/CartoDB/cartodb-postgresql/issues/183



More information about the postgis-devel mailing list