[postgis-devel] Windowing Functions for Clustering

Sat Dec 19 09:23:14 PST 2015

Yes, the only advantage to doing a st_clusterkmeans(geometry) is
convenience. PostGIS users would not have to install an extra package,
and they could use it directly on geometry w/o mucking around w/
extracting coordinates, etc.

In general, do you like the window function approach?

P.

On Fri, Dec 18, 2015 at 4:00 PM, Paul Norman <penorman at mac.com> wrote:
> On 12/18/2015 11:02 AM, Paul Ramsey wrote:
>>
>> Hey Dan,
>> I've been reading up on the k-means cluster implementation already out
>> there, thinking about adding one to PostGIS (makes sense, I figure)
>> and one thing I've been trying to figure out is what the right API for
>> a clustering function is.
>>
>> The k-means guy decided to do a windowing function, which I kind of
>> like...
>>
>> https://github.com/umitanuki/kmeans-postgresql/blob/master/kmeans.c#L298
>>
>> So we'd put do something like,
>>
>>    select gid, st_clusterkmeans(geom, 4) from geotable;
>>
>> and get back a list of unique ids and cluster ids. If the user wanted
>> to so something after that in terms of unioning, or collecting, or
>> whatever, that would be up to the user to decide.
>
>
> I've been working on clustering of 1d data and functions which can be used
> to group data into different bins.
>
> The different ways of having a function like this are aggregate functions,
> ordered set aggregate functions[1], and window functions[2]. Ordered-set
> aggregates are new in PostgreSQL 9.4 and built-in ones are used for
> computations like rank or percentile.
>
> Both types of aggregate functions are not well suited for this, as you'd
> have to stuff the return value into an array and unnest it. This also makes
> it difficult to do a more complicated cluster, such as cities where you want
> the total population of the cluster.
>
> Window functions work well for clustering, but are infrequently enough used
> so that I need to look up the syntax.[3]
>
> For kmeans, it is used as kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER ()
> and returns the number of the group, allowing an outer select to perform
> aggregates[4] like ST_Collect. A window function with the default window
> frame and no filtering seems odd to me, but I can't point out anything
> inherently wrong, and even the PostgreSQL examples do this,[5] and overall
> the window functions seem better to me.
>
> As far as I can tell, postgresql-kmeans only works with points, but this
> might be because the algorithms used don't clearly apply to areas. If this
> is so, is the only advantage of of ST_Clusterkmeans(geom, N) over
> kmeans(ARRAY[ST_X(ST_Centroid(way)), ST_Y(ST_Centroid(way))], N) that of
> verbosity?
>
> Performance-wise, the postgresql-kmeans implementation is fast, particularly
> compared to a SQL implementation of kmeans for the 1-d case.[6]
>
> [1]: http://www.postgresql.org/docs/9.4/static/functions-aggregate.html
> [2]: http://www.postgresql.org/docs/9.4/static/tutorial-window.html
> [3]:
> http://www.postgresql.org/docs/9.4/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS
> [4]:
> http://gis.stackexchange.com/questions/11567/spatial-clustering-with-postgis
> [5]: http://www.postgresql.org/docs/9.4/static/tutorial-window.html
> [6]: https://github.com/CartoDB/cartodb-postgresql/issues/183
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/postgis-devel