[GRASS-dev] how to determine best k in a set of unsupervised classifications?

Wed Oct 31 08:44:09 PDT 2018

On 31/10/18 12:19, Nikos Alexandris wrote:
> * Veronica Andreo <veroandreo at gmail.com> [2018-10-31 00:23:57 +0100]:
> 
>> Hi devs,
>>
> 
> Hi Vero,
> 
> (not a real dev, but I'll share what I think)
> 
>> I'm writing to ask how do one determine the best number of classes/clusters
>> in a set of unsupervised classifications with different k in GRASS?
> 
> You already know better than me I guess, but I'd like to refresh my mind
> on all this a bit.
> 
> I guess the only way to tell if the number of classes is "best", is to
> judge yourself by inspecting the "quality" of the clusters returned.
> 
> One way to tell would be to compute the "error of clusters" which would
> be the overall distance between the points that are assigned to a
> cluster and its center.  I guess comparing the overall errors between
> different clustering settings (or even algorithms?), would give an idea
> about how close points are around the centers of clusters.
> Maybe we could implement something like this.
> 
> (All this I practiced during an generic Algorithmic Thinking course.  I
> guess it's applicable in our "domain" too.)
> 
> 
>> I use i.cluster with different number of classes and then i.maxlik that uses a
>> modified version of k-means according to the manual page. Now, I would like
>> to know which unsup classif is the best within the set.
> 
> Sorry, I guess I have to read up:  what is "unsup classif"?
> 
>> I check the i.cluster reports (looking for separability) and then explored the
>> rejection maps. But none of those seems to work as a crisp and clear
>> indicator. BTW, does anyone know which separability index does i.cluster
>> use?
> 
> 
> I am interested to learn about the distance measure too.  I am looking
> at the source code of `i.cluster`.  And then, searching around, I think
> it's this file:
> 
> grasstrunk/lib/cluster/c_sep.c
> 
> and I/we just need to identify which distance it measures.

i.cluster uses a simple k-means approach based on the spectral euclidean 
distance between pixels or between pixels and existing clusters. By 
including a min cluster size and a min cluster separation parameter, 
total number of clusters might change which is different from a 
classical k-means.

i.cluster also works on a sample of the image pixels to define the 
clusters, so there is no guarantee that the clusters it identifies would 
be those one would find if using all pixels, but AFAIK it is generally 
reasonable close to justify the pay-off as it provides greater speed.

i.maxlik does not interfere in the clustering part. It uses the 
signatures of classes provided as input (possibly the signatures of the 
clusters if the input is the output of i.cluster) to then assign each 
pixel to one of the classes. The reject map of i.maxlik allows you see 
the probability of a pixels membership in the chosen class. It does not 
really allow you to measure cluster "quality", nor ideal number of 
clusters (well you could try with many different cluster numbers and 
then chose the one where the reject map values are the lowest on average).

If you want to use a very simple approach to Nikos' suggestion of 
calculating the error, you could use something like this:

- For each i.cluster + i.maxlik result:
	- For each original band
		- Create new pseudo band with mean values of the
			original band per cluster (r.stats.zonal)
	- Calculate euclidean distance in spectral space of each pixel
			to its cluster (r.mapcalc):

		(pixel_value_band1 - r.stats.zonal result on band 1)^2 +
		(pixel_value_band2 - r.stats.zonal result on band 2)^2 +
		etc

	- Calculate mean euclidean distance on the result of (or median,
		or whatever you are looking for) (r.univar)

- Identify the i.cluster + i.maxlik result that reaches the best score

Moritz