[GRASS-dev] how to determine best k in a set of unsupervised classifications?

Veronica Andreo veroandreo at gmail.com
Fri Nov 2 05:36:47 PDT 2018


Hi Nikos and Moritz,

Thanks for your replies and for the recipe :)

Cheers,
Vero

El mié., 31 oct. 2018 a las 22:09, Moritz Lennert (<
mlennert at club.worldonline.be>) escribió:

> On 31/10/18 12:19, Nikos Alexandris wrote:
> > * Veronica Andreo <veroandreo at gmail.com> [2018-10-31 00:23:57 +0100]:
> >
> >> Hi devs,
> >>
> >
> > Hi Vero,
> >
> > (not a real dev, but I'll share what I think)
> >
> >> I'm writing to ask how do one determine the best number of
> classes/clusters
> >> in a set of unsupervised classifications with different k in GRASS?
> >
> > You already know better than me I guess, but I'd like to refresh my mind
> > on all this a bit.
> >
> > I guess the only way to tell if the number of classes is "best", is to
> > judge yourself by inspecting the "quality" of the clusters returned.
> >
> > One way to tell would be to compute the "error of clusters" which would
> > be the overall distance between the points that are assigned to a
> > cluster and its center.  I guess comparing the overall errors between
> > different clustering settings (or even algorithms?), would give an idea
> > about how close points are around the centers of clusters.
> > Maybe we could implement something like this.
> >
> > (All this I practiced during an generic Algorithmic Thinking course.  I
> > guess it's applicable in our "domain" too.)
> >
> >
> >> I use i.cluster with different number of classes and then i.maxlik that
> uses a
> >> modified version of k-means according to the manual page. Now, I would
> like
> >> to know which unsup classif is the best within the set.
> >
> > Sorry, I guess I have to read up:  what is "unsup classif"?
> >
> >> I check the i.cluster reports (looking for separability) and then
> explored the
> >> rejection maps. But none of those seems to work as a crisp and clear
> >> indicator. BTW, does anyone know which separability index does i.cluster
> >> use?
> >
> >
> > I am interested to learn about the distance measure too.  I am looking
> > at the source code of `i.cluster`.  And then, searching around, I think
> > it's this file:
> >
> > grasstrunk/lib/cluster/c_sep.c
> >
> > and I/we just need to identify which distance it measures.
>
> i.cluster uses a simple k-means approach based on the spectral euclidean
> distance between pixels or between pixels and existing clusters. By
> including a min cluster size and a min cluster separation parameter,
> total number of clusters might change which is different from a
> classical k-means.
>
> i.cluster also works on a sample of the image pixels to define the
> clusters, so there is no guarantee that the clusters it identifies would
> be those one would find if using all pixels, but AFAIK it is generally
> reasonable close to justify the pay-off as it provides greater speed.
>
> i.maxlik does not interfere in the clustering part. It uses the
> signatures of classes provided as input (possibly the signatures of the
> clusters if the input is the output of i.cluster) to then assign each
> pixel to one of the classes. The reject map of i.maxlik allows you see
> the probability of a pixels membership in the chosen class. It does not
> really allow you to measure cluster "quality", nor ideal number of
> clusters (well you could try with many different cluster numbers and
> then chose the one where the reject map values are the lowest on average).
>
> If you want to use a very simple approach to Nikos' suggestion of
> calculating the error, you could use something like this:
>
> - For each i.cluster + i.maxlik result:
>         - For each original band
>                 - Create new pseudo band with mean values of the
>                         original band per cluster (r.stats.zonal)
>         - Calculate euclidean distance in spectral space of each pixel
>                         to its cluster (r.mapcalc):
>
>                 (pixel_value_band1 - r.stats.zonal result on band 1)^2 +
>                 (pixel_value_band2 - r.stats.zonal result on band 2)^2 +
>                 etc
>
>         - Calculate mean euclidean distance on the result of (or median,
>                 or whatever you are looking for) (r.univar)
>
> - Identify the i.cluster + i.maxlik result that reaches the best score
>
>
> Moritz
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/grass-dev/attachments/20181102/2ac192db/attachment-0001.html>


More information about the grass-dev mailing list