<div dir="ltr">Having just tested this out, I think the window function provides a much better user experience.  And in hindsight, the implementation turns out to be far simpler too, with less aggregate boilerplate.  Adding a window version of ST_ClusterIntersecting doesn't even take 50 lines...maybe double that with various error and null-checking branches added.<div><br></div><div><a href="https://gist.github.com/dbaston/6f76c3462e40a69518cd">https://gist.github.com/dbaston/6f76c3462e40a69518cd</a></div><div><br></div><div>Are there any caveats we're missing?  Performance penalties, memory consumption, anything else?</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Dec 19, 2015 at 12:23 PM, Paul Ramsey <span dir="ltr"><<a href="mailto:pramsey@cleverelephant.ca" target="_blank">pramsey@cleverelephant.ca</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Yes, the only advantage to doing a st_clusterkmeans(geometry) is<br>

convenience. PostGIS users would not have to install an extra package,<br>

and they could use it directly on geometry w/o mucking around w/<br>

extracting coordinates, etc.<br>

<br>

In general, do you like the window function approach?<br>

<span class="HOEnZb"><font color="#888888"><br>

P.<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

On Fri, Dec 18, 2015 at 4:00 PM, Paul Norman <<a href="mailto:penorman@mac.com">penorman@mac.com</a>> wrote:<br>

> On 12/18/2015 11:02 AM, Paul Ramsey wrote:<br>

>><br>

>> Hey Dan,<br>

>> I've been reading up on the k-means cluster implementation already out<br>

>> there, thinking about adding one to PostGIS (makes sense, I figure)<br>

>> and one thing I've been trying to figure out is what the right API for<br>

>> a clustering function is.<br>

>><br>

>> The k-means guy decided to do a windowing function, which I kind of<br>

>> like...<br>

>><br>

>> <a href="https://github.com/umitanuki/kmeans-postgresql/blob/master/kmeans.c#L298" rel="noreferrer" target="_blank">https://github.com/umitanuki/kmeans-postgresql/blob/master/kmeans.c#L298</a><br>

>><br>

>> So we'd put do something like,<br>

>><br>

>>    select gid, st_clusterkmeans(geom, 4) from geotable;<br>

>><br>

>> and get back a list of unique ids and cluster ids. If the user wanted<br>

>> to so something after that in terms of unioning, or collecting, or<br>

>> whatever, that would be up to the user to decide.<br>

><br>

><br>

> I've been working on clustering of 1d data and functions which can be used<br>

> to group data into different bins.<br>

><br>

> The different ways of having a function like this are aggregate functions,<br>

> ordered set aggregate functions[1], and window functions[2]. Ordered-set<br>

> aggregates are new in PostgreSQL 9.4 and built-in ones are used for<br>

> computations like rank or percentile.<br>

><br>

> Both types of aggregate functions are not well suited for this, as you'd<br>

> have to stuff the return value into an array and unnest it. This also makes<br>

> it difficult to do a more complicated cluster, such as cities where you want<br>

> the total population of the cluster.<br>

><br>

> Window functions work well for clustering, but are infrequently enough used<br>

> so that I need to look up the syntax.[3]<br>

><br>

> For kmeans, it is used as kmeans(ARRAY[ST_X(geom), ST_Y(geom)], 5) OVER ()<br>

> and returns the number of the group, allowing an outer select to perform<br>

> aggregates[4] like ST_Collect. A window function with the default window<br>

> frame and no filtering seems odd to me, but I can't point out anything<br>

> inherently wrong, and even the PostgreSQL examples do this,[5] and overall<br>

> the window functions seem better to me.<br>

><br>

> As far as I can tell, postgresql-kmeans only works with points, but this<br>

> might be because the algorithms used don't clearly apply to areas. If this<br>

> is so, is the only advantage of of ST_Clusterkmeans(geom, N) over<br>

> kmeans(ARRAY[ST_X(ST_Centroid(way)), ST_Y(ST_Centroid(way))], N) that of<br>

> verbosity?<br>

><br>

> Performance-wise, the postgresql-kmeans implementation is fast, particularly<br>

> compared to a SQL implementation of kmeans for the 1-d case.[6]<br>

><br>

> [1]: <a href="http://www.postgresql.org/docs/9.4/static/functions-aggregate.html" rel="noreferrer" target="_blank">http://www.postgresql.org/docs/9.4/static/functions-aggregate.html</a><br>

> [2]: <a href="http://www.postgresql.org/docs/9.4/static/tutorial-window.html" rel="noreferrer" target="_blank">http://www.postgresql.org/docs/9.4/static/tutorial-window.html</a><br>

> [3]:<br>

> <a href="http://www.postgresql.org/docs/9.4/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS" rel="noreferrer" target="_blank">http://www.postgresql.org/docs/9.4/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS</a><br>

> [4]:<br>

> <a href="http://gis.stackexchange.com/questions/11567/spatial-clustering-with-postgis" rel="noreferrer" target="_blank">http://gis.stackexchange.com/questions/11567/spatial-clustering-with-postgis</a><br>

> [5]: <a href="http://www.postgresql.org/docs/9.4/static/tutorial-window.html" rel="noreferrer" target="_blank">http://www.postgresql.org/docs/9.4/static/tutorial-window.html</a><br>

> [6]: <a href="https://github.com/CartoDB/cartodb-postgresql/issues/183" rel="noreferrer" target="_blank">https://github.com/CartoDB/cartodb-postgresql/issues/183</a><br>

><br>

> _______________________________________________<br>

> postgis-devel mailing list<br>

> <a href="mailto:postgis-devel@lists.osgeo.org">postgis-devel@lists.osgeo.org</a><br>

> <a href="http://lists.osgeo.org/mailman/listinfo/postgis-devel" rel="noreferrer" target="_blank">http://lists.osgeo.org/mailman/listinfo/postgis-devel</a><br>

_______________________________________________<br>

postgis-devel mailing list<br>

<a href="mailto:postgis-devel@lists.osgeo.org">postgis-devel@lists.osgeo.org</a><br>

<a href="http://lists.osgeo.org/mailman/listinfo/postgis-devel" rel="noreferrer" target="_blank">http://lists.osgeo.org/mailman/listinfo/postgis-devel</a></div></div></blockquote></div><br></div>