[postgis-users] hi, kevin, I'll explain my initial motivation for clustering.

Kevin Neufeld kneufeld at refractions.net
Wed Feb 10 13:44:21 PST 2010


You want to extend the SQL language by adding the CLUSTERING keyword and then extend PostGIS to somehow implement the 
clustering?

Sorry, but I think this would be nothing less than an enormous/colossal/massive undertaking.

I'm totally guessing here, but at a very high level I think the steps might be

1. Add the keyword CLUSTERING to PostgreSQL's Parser
      http://www.postgresql.org/docs/8.3/static/parser-stage.html
      (the Postgresql development lists should be able to help you out here)

2. Somehow add an operator (or set of operators) that define CLUSTERING.  As I understand it, ORDER BY using the typical 
ordering operators (<, >, =, <=, etc) to perform it's operations.  Similiarly, GROUP BY uses the = operator.  PostGIS 
implements these operators as function calls that operate on the bounding box of geometries.  This ultimately allows you 
to issue a "ORDER BY" or "GROUP BY" types of queries against geometries.  I would think you would need to do something 
similar for your CLUSTERING idea.  Create some operator[s] that only the CLUSTERING keyword understands. Again, try the 
Postgresql development lists for help here.

3. Dive into PostGIS source and implement the CLUSTERING operator[s] you defined in step 2, which I assume your going to 
implement something like k-means clustering (which by itself is not trivial).

4. Polygonize your clustered points using PostGIS.  This is the easiest as it's a single sql query using the 
ST_ConvexHull function.


Sorry to burst your enthusiasm, but I think you'd be far better off to try to implement k-means (or the clustering 
algorithm of your choice) external to the database using your favorite language (Java, C, Python, ...).  Extract data 
from the database, cluster, save the results back in the database, and polygonize.

-- Kevin

On 2/10/2010 12:36 PM, sunpeng wrote:
>
>      hi,Kevin,thanks for your help.
>
>      Now, I'll explain my initial motivation.
>
>      Suppose we have a table with:
>      create table houses  (
>         NAME              VARCHAR(128)                     not null,
>      );
>      SELECT AddGeometryColumn('houses', 'location', 4214, 'POINT', 2);
>
>      if we want to do clustering(in data mining environment) the houses
> on location, that is, clustering those house near as a cluster(or a
> group), and then calculate each cluster's shape, we can not use the
> following sql:
>
>      select ST_Boundary(*)
>      from houses
>      group by location
>
>      I would like to extend the postgresql or postgis to support the
> following sql:
>
>      select ST_Boundary(*)
>      from houses
>      clustering by location
>
>      in which the following steps are considered:
>      1. cluster the houses into several goups
>      2. calculate each goup's shape
>
>      I know the second step is usually related to aggregation functions
> like sum,count,and so on which i don't care too much now. All what I
> want to do is to add this sql semantics into postgresql or postgis and
> then add a clustering algorithm like dbscan in the executor in
> postgresql(or postgis?).
>      How could I do? Any detailed steps?Like I should modify kwlist.h to
> support "CLUSTERING" keyword, and the following steps?
>
>      Thanks!
>      peng
>
>     ---------------------------------------------------------------------------
>
>     we all know
>
>     I'm not sure I follow.  Can you explain what exactly you want to do?
>
>     The following query will collect points into clusters (multipoints),
>     clustered on a 100x100 grid.
>
>     -- generate a sample random point dataset
>     CREATE TABLE points AS
>     SELECT ST_MakePoint(random()*1000, random()*1000) AS geom
>     FROM generate_series(1, 100000);
>
>     -- create point clusters
>     SELECT st_collect(geom)
>     FROM points
>     GROUP BY
>        round(st_x(geom)/100)*100,
>        round(st_y(geom)/100)*100;
>
>
>     Kevin
>
>
>     On 2/4/2010 11:26 PM, sunpeng wrote:
>      > I want to write "cluster by" instead of "group by" on geospatial
>     point
>      > data,should I write the code at postgresql or postgis ?
>      > thanks
>      >
>      > peng
>
>
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users



More information about the postgis-users mailing list