[postgis-users] hi, kevin, I'll explain my initial motivation for clustering.

Dan Putler dan.putler at sauder.ubc.ca
Wed Feb 10 14:09:10 PST 2010


I have to agree with Kevin. My advice would be to do the clustering in
R. Here is a link on how to read a PostGIS table into R (either via ODBC
or GDAL/OGR). I haven't done it, but my guess is that doing it via
GDAL/OGR (via the rgdal package) will be less painful since it will read
the data into an sp class object in R. Once the data is clustered you
can get its "shape" via creating a polygon by taking the convex hull of
the points.

Having done this type of thing a fair amount, I can tell you that
K-Means is really a poor choice for this type of analysis (it is biased
in two dimensional space to creating circular clusters). I would look at
a clustering algorithm called DBSCAN for this type of application. In
addition, you will want to look into cluster validation techniques to
determine the appropriate clustering parameters (e.g., the number of
clusters in K-Means or the epsilon radius used in DBSCAN).

Instead of reading the data into R, you can try using PL/R as David
suggests, but I think this is not the way to go since the ability to
determine the appropriate clustering parameters using cluster validation
methods would be very cumbersome following this route.

Dan

On Wed, 2010-02-10 at 13:44 -0800, Kevin Neufeld wrote:
> You want to extend the SQL language by adding the CLUSTERING keyword and then extend PostGIS to somehow implement the 
> clustering?
> 
> Sorry, but I think this would be nothing less than an enormous/colossal/massive undertaking.
> 
> I'm totally guessing here, but at a very high level I think the steps might be
> 
> 1. Add the keyword CLUSTERING to PostgreSQL's Parser
>       http://www.postgresql.org/docs/8.3/static/parser-stage.html
>       (the Postgresql development lists should be able to help you out here)
> 
> 2. Somehow add an operator (or set of operators) that define CLUSTERING.  As I understand it, ORDER BY using the typical 
> ordering operators (<, >, =, <=, etc) to perform it's operations.  Similiarly, GROUP BY uses the = operator.  PostGIS 
> implements these operators as function calls that operate on the bounding box of geometries.  This ultimately allows you 
> to issue a "ORDER BY" or "GROUP BY" types of queries against geometries.  I would think you would need to do something 
> similar for your CLUSTERING idea.  Create some operator[s] that only the CLUSTERING keyword understands. Again, try the 
> Postgresql development lists for help here.
> 
> 3. Dive into PostGIS source and implement the CLUSTERING operator[s] you defined in step 2, which I assume your going to 
> implement something like k-means clustering (which by itself is not trivial).
> 
> 4. Polygonize your clustered points using PostGIS.  This is the easiest as it's a single sql query using the 
> ST_ConvexHull function.
> 
> 
> Sorry to burst your enthusiasm, but I think you'd be far better off to try to implement k-means (or the clustering 
> algorithm of your choice) external to the database using your favorite language (Java, C, Python, ...).  Extract data 
> from the database, cluster, save the results back in the database, and polygonize.
> 
> -- Kevin
> 
> On 2/10/2010 12:36 PM, sunpeng wrote:
> >
> >      hi,Kevin,thanks for your help.
> >
> >      Now, I'll explain my initial motivation.
> >
> >      Suppose we have a table with:
> >      create table houses  (
> >         NAME              VARCHAR(128)                     not null,
> >      );
> >      SELECT AddGeometryColumn('houses', 'location', 4214, 'POINT', 2);
> >
> >      if we want to do clustering(in data mining environment) the houses
> > on location, that is, clustering those house near as a cluster(or a
> > group), and then calculate each cluster's shape, we can not use the
> > following sql:
> >
> >      select ST_Boundary(*)
> >      from houses
> >      group by location
> >
> >      I would like to extend the postgresql or postgis to support the
> > following sql:
> >
> >      select ST_Boundary(*)
> >      from houses
> >      clustering by location
> >
> >      in which the following steps are considered:
> >      1. cluster the houses into several goups
> >      2. calculate each goup's shape
> >
> >      I know the second step is usually related to aggregation functions
> > like sum,count,and so on which i don't care too much now. All what I
> > want to do is to add this sql semantics into postgresql or postgis and
> > then add a clustering algorithm like dbscan in the executor in
> > postgresql(or postgis?).
> >      How could I do? Any detailed steps?Like I should modify kwlist.h to
> > support "CLUSTERING" keyword, and the following steps?
> >
> >      Thanks!
> >      peng
> >
> >     ---------------------------------------------------------------------------
> >
> >     we all know
> >
> >     I'm not sure I follow.  Can you explain what exactly you want to do?
> >
> >     The following query will collect points into clusters (multipoints),
> >     clustered on a 100x100 grid.
> >
> >     -- generate a sample random point dataset
> >     CREATE TABLE points AS
> >     SELECT ST_MakePoint(random()*1000, random()*1000) AS geom
> >     FROM generate_series(1, 100000);
> >
> >     -- create point clusters
> >     SELECT st_collect(geom)
> >     FROM points
> >     GROUP BY
> >        round(st_x(geom)/100)*100,
> >        round(st_y(geom)/100)*100;
> >
> >
> >     Kevin
> >
> >
> >     On 2/4/2010 11:26 PM, sunpeng wrote:
> >      > I want to write "cluster by" instead of "group by" on geospatial
> >     point
> >      > data,should I write the code at postgresql or postgis ?
> >      > thanks
> >      >
> >      > peng
> >
> >
> >
> >
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users at postgis.refractions.net
> > http://postgis.refractions.net/mailman/listinfo/postgis-users
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
-- 
Dan Putler
Sauder School of Business
University of British Columbia




More information about the postgis-users mailing list