[postgis-devel] Windowing Functions for Clustering

Daniel Baston dbaston at gmail.com
Wed Dec 23 07:01:59 PST 2015


I posted a branch with window implementations of ST_ClusterIntersecting and
ST_ClusterWithin in case anyone wants to test out the API.  I didn't do any
serious performance testing, but if anything, it seems like these versions
are 5-10% faster than the aggregate versions.

https://github.com/postgis/postgis/pull/81

Dan

On Mon, Dec 21, 2015 at 10:17 AM, Paul Ramsey <pramsey at cleverelephant.ca>
wrote:

> Glad to know window sounds right to you, Remi.
> Using a library is good, and in any event I’m not planning on re-inventing
> any wheels, just copying them. Was just going to transcribe the k-means
> code right into postgis w/ the minimal changes necessary. Don’t support you
> know any pure-c clustering libs? Otherwise the cpp lib would need a c shim
> around it, ala sfcgal.
> P
>
> --
> Paul Ramsey
> http://cleverelephant.ca
> http://postgis.net
>
> On December 21, 2015 at 2:29:16 AM, Rémi Cura (remi.cura at gmail.com) wrote:
>
> Hey guys sorry to hijack,
> just to give a testimony about clustering.
>
> I uses a lot of clustering / learning (I'm a phd student),
> at the beggining I used plpgsql to write functions,
> then I grew tired and resorted to plpython coupled with scikit-learn
> <http://scikit-learn.org/stable/modules/clustering.html#clustering>.
>
> A major missing function was the connected-components,
> which I ended up implementing in sql, plpgsql, and trough python
> (networkx).
>
>
> In all cases, the functions returns a table with at least
> (point_id,cluster_id),
> (in your discussion it would be the windows function style, much better
> imo)
> for the moment the input has to be array of feature (limitation of
> plpython).
>
> I know I have more advanced needs than most,
> and I would definitively find it very useful to have simple clustering
> algorithms
> directly embedded within PostGis.
>
> But please consider that all those advanced clustering functions
> already exists, works, scale well, are being maintained and so.
> So if you want to add real clustering capabilities
> (liek DBSCAN, a much more advanced method than k-means),
> would'nt it be better to create a postgis-clustering extension with wrapper
> around a dedicated clustering lib (a bit like sfcgal wrapps a dedicated 3D
> tool).
> There is of course scikit-learn
> <http://scikit-learn.org/stable/modules/clustering.html#clustering> in
> python, but also mlpack <https://github.com/mlpack/mlpack> in cpp,
> both with permissive licensing, sane dependency, etc.
>
>
> Cheers,
> Rémi-C
>
>
>
>
> 2015-12-19 23:31 GMT+01:00 Paul Norman <penorman at mac.com>:
>
>> On 12/19/2015 12:36 PM, Daniel Baston wrote:
>>
>>>
>>> Are there any caveats we're missing?  Performance penalties, memory
>>> consumption, anything else?
>>>
>>
>> I would expect a window function to be better, not needing unnest(). I'd
>> also expect in practice the algorithm needing to go over all of the rows is
>> going to be the major cause of use of memory or CPU, which is the same with
>> window functions and aggregates.
>>
>> But, I haven't benchmarked any of this.
>>
>> _______________________________________________
>> postgis-devel mailing list
>> postgis-devel at lists.osgeo.org
>> http://lists.osgeo.org/mailman/listinfo/postgis-devel
>>
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/postgis-devel
>
>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/postgis-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-devel/attachments/20151223/bbc3eb67/attachment.html>


More information about the postgis-devel mailing list