[GRASS-dev] how to determine best k in a set of unsupervised classifications?

Wed Oct 31 04:19:55 PDT 2018

* Veronica Andreo <veroandreo at gmail.com> [2018-10-31 00:23:57 +0100]:

>Hi devs,
>

Hi Vero,

(not a real dev, but I'll share what I think)

>I'm writing to ask how do one determine the best number of classes/clusters
>in a set of unsupervised classifications with different k in GRASS?

You already know better than me I guess, but I'd like to refresh my mind
on all this a bit.

I guess the only way to tell if the number of classes is "best", is to
judge yourself by inspecting the "quality" of the clusters returned.

One way to tell would be to compute the "error of clusters" which would
be the overall distance between the points that are assigned to a
cluster and its center.  I guess comparing the overall errors between
different clustering settings (or even algorithms?), would give an idea
about how close points are around the centers of clusters.
Maybe we could implement something like this.

(All this I practiced during an generic Algorithmic Thinking course.  I
guess it's applicable in our "domain" too.)

>I use i.cluster with different number of classes and then i.maxlik that uses a
>modified version of k-means according to the manual page. Now, I would like
>to know which unsup classif is the best within the set.

Sorry, I guess I have to read up:  what is "unsup classif"?

>I check the i.cluster reports (looking for separability) and then explored the
>rejection maps. But none of those seems to work as a crisp and clear
>indicator. BTW, does anyone know which separability index does i.cluster
>use?

I am interested to learn about the distance measure too.  I am looking
at the source code of `i.cluster`.  And then, searching around, I think
it's this file:

grasstrunk/lib/cluster/c_sep.c

and I/we just need to identify which distance it measures.

Nikos

>In any case, I have seen some indices elsewhere (mainly R and Python) that
>are used to choose the best clustering results (coming from the same or
>different clustering methods). Examples of those indices are Silhouette,
>Dunn, etc. Some are called internal as they do not require test data and
>just characterize the compactness of clusters. On the other hand, the ones
>requiring test data are called external. I have seen them in dtwclust R
>package [0] (the package is oriented to time series clustering but
>validation indices are more general) and in scikit-learn in Python [1].
>Does any of you have something already implemented in this direction? or
>how do you assess your unsup classification (clustering) results?
>
>Any ideas or suggestions within GRASS?
>
>Thanks much in advance!
>Vero
>
>[0] https://rdrr.io/cran/dtwclust/man/cvi.html
>[1]
>http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

>_______________________________________________
>grass-dev mailing list
>grass-dev at lists.osgeo.org
>https://lists.osgeo.org/mailman/listinfo/grass-dev

-- 
Nikos Alexandris | Remote Sensing & Geomatics
GPG Key Fingerprint 6F9D4506F3CA28380974D31A9053534B693C4FB3 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/grass-dev/attachments/20181031/1520c1f6/attachment.sig>