[postgis-devel] Windowing Functions for Clustering

Mon Dec 21 07:17:33 PST 2015

Glad to know window sounds right to you, Remi.
Using a library is good, and in any event I’m not planning on re-inventing any wheels, just copying them. Was just going to transcribe the k-means code right into postgis w/ the minimal changes necessary. Don’t support you know any pure-c clustering libs? Otherwise the cpp lib would need a c shim around it, ala sfcgal.
P

-- 
Paul Ramsey
http://cleverelephant.ca
http://postgis.net

On December 21, 2015 at 2:29:16 AM, Rémi Cura (remi.cura at gmail.com) wrote:

Hey guys sorry to hijack,
just to give a testimony about clustering.

I uses a lot of clustering / learning (I'm a phd student),
at the beggining I used plpgsql to write functions,
then I grew tired and resorted to plpython coupled with scikit-learn.

A major missing function was the connected-components,
which I ended up implementing in sql, plpgsql, and trough python (networkx).

In all cases, the functions returns a table with at least (point_id,cluster_id),
(in your discussion it would be the windows function style, much better imo)
for the moment the input has to be array of feature (limitation of plpython).

I know I have more advanced needs than most,
and I would definitively find it very useful to have simple clustering algorithms
directly embedded within PostGis.

But please consider that all those advanced clustering functions
already exists, works, scale well, are being maintained and so.
So if you want to add real clustering capabilities
(liek DBSCAN, a much more advanced method than k-means),
would'nt it be better to create a postgis-clustering extension with wrapper
around a dedicated clustering lib (a bit like sfcgal wrapps a dedicated 3D tool).
There is of course scikit-learn in python, but also mlpack in cpp,
both with permissive licensing, sane dependency, etc.

Cheers,
Rémi-C

2015-12-19 23:31 GMT+01:00 Paul Norman <penorman at mac.com>:
On 12/19/2015 12:36 PM, Daniel Baston wrote:

Are there any caveats we're missing?  Performance penalties, memory consumption, anything else?

I would expect a window function to be better, not needing unnest(). I'd also expect in practice the algorithm needing to go over all of the rows is going to be the major cause of use of memory or CPU, which is the same with window functions and aggregates.

But, I haven't benchmarked any of this.

_______________________________________________
postgis-devel mailing list
postgis-devel at lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/postgis-devel

_______________________________________________
postgis-devel mailing list
postgis-devel at lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/postgis-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20151221/f395c038/attachment.html>