[GRASS-SVN] r68576 - grass-addons/grass7/vector/v.class.mlR

Thu Jun 2 08:07:49 PDT 2016

Author: mlennert
Date: 2016-06-02 08:07:49 -0700 (Thu, 02 Jun 2016)
New Revision: 68576

Modified:
   grass-addons/grass7/vector/v.class.mlR/v.class.mlR.html
   grass-addons/grass7/vector/v.class.mlR/v.class.mlR.py
Log:
v.class.mlR: added choice of weighting metric and improved documentation


Modified: grass-addons/grass7/vector/v.class.mlR/v.class.mlR.html
===================================================================

--- grass-addons/grass7/vector/v.class.mlR/v.class.mlR.html	2016-06-02 10:40:29 UTC (rev 68575)
+++ grass-addons/grass7/vector/v.class.mlR/v.class.mlR.html	2016-06-02 15:07:49 UTC (rev 68576)
@@ -5,37 +5,55 @@
 for machine learning in R to classify features using training features 
 by supervised learning.
 
-<p>The user can provide input either as vector maps, or as csv files, or 
-a combination of both. Output can consist of either additional columns in
-the vector input map of features, a text file or reclassed raster maps.
+<p>The user can provide input either as vector maps (<em>segments_map</em>
+and <em>training_map</em>, or as csv files (<em>segments_file</em> and
+<em>training file</em>, or a combination of both. Csv files have to be
+formatted in line with the default output of 
+<a href"v.db.select.html">v.db.select</a>, i.e. with a header and the
+pipe character as field separator.  Output can consist of either 
+additional columns in the vector input map of features, a text file 
+(<em>classification_results</em>) or reclassed raster maps 
+(<em>classified_map</em>).
 
+<p>The user has to provide the name of the column in the training data
+that contains the class values (<em>train_class_column</em>), the prefix
+of the columns that will contain the final class after classification
+(<em>output_class_column</em>) as well as the prefix of the columns that
+will contain the probability values linked to these classifications 
+(<em>output_prob_column</em> - see below).
+
 <p>Different classifiers are proposed: k-nearest neighbor (knn and knn1 
 for k=1), support vector machine with a radial kernel (svmRadial), random 
 forest (rf) and recursive partitioning (rpart). Each of these classifiers
-is tuned automatically throught repeated cross-validation. See the 
-<a href="https://topepo.github.io/caret/index.html">caret webpage</a> for
+is tuned automatically throught repeated cross-validation. caret will 
+automatically determine a reasonable set of values for tuning. See the 
+<a href="http://topepo.github.io/caret/modelList.html">caret webpage</a> for
 more information about the tuning parameters for each classifier, and
 more generally for the information about how caret works.
 
 <p>The user can chose to include the individual classifiers results in
 the output using the <em>i</em> flag, but by default the output will be
 the result of a voting scheme merging the results of the different 
-classifiers. The voting schemes available are: simple majority vote without 
-weighting (smv), simple weighted majority vote (swv), best-worst weighted 
-vote (bwwv) and quadratic best-worst weighted vote (qbwwv). For more details
-about these voting schemes see [TODO: include reference].
+classifiers. Votes can be weighted according to a user-defined mode 
+(<em>weighting_mode</em>): simple majority vote without weighting, i.e. 
+all weights are equal (smv), simple weighted majority vote (swv), 
+best-worst weighted vote (bwwv) and quadratic best-worst weighted vote 
+(qbwwv). For more details about these voting modes see [TODO: include 
+reference]. By default, the weights are calculated based on the accuracy 
+metric, but the user can chose the kappa value as an alternative 
+(<em>weighting_metric</em>).
 
 <p>In the output (as attribute columns or text file) each weighting schemes 
-result is provided accompanied by an estimation of the probability of the
-classification, based on the equation used in [TODO: include reference].
+result is provided accompanied by a value that can be considered as an
+estimation of the probability of the classification after weighted vote, 
+based on the equation used in [TODO: include reference].
 
 <p>Optional output of the module include a box-and-whisker plot indicating
-the variance of the cross-validation results for each classifier 
+the resampling variance based on the cross-validation for each classifier 
 (<em>bw_plot_file</em>) and a csv file containing accuracy measures (overall
 accuracy and kappa) for each classifier (<em>accuracy_file</em>). The user
 can also chose to write the R script constructed and used internally to a text
 file for study or further modification.
-
 <h2>NOTES</h2>
 
 <p>
@@ -50,7 +68,12 @@
 
 <h2>TODO</h2>
 
-Add automagic installation of missing R packages.
+<ul>
+	<li>Add automagic installation of missing R packages.</li>
+	<li>Add output with confusion matrix
+	<li>Add option to manually define grid of tuning parameters</li>
+</ul>
+- 
 
 <h2>EXAMPLE</h2>
 

Modified: grass-addons/grass7/vector/v.class.mlR/v.class.mlR.py
===================================================================
--- grass-addons/grass7/vector/v.class.mlR/v.class.mlR.py	2016-06-02 10:40:29 UTC (rev 68575)
+++ grass-addons/grass7/vector/v.class.mlR/v.class.mlR.py	2016-06-02 15:07:49 UTC (rev 68576)
@@ -100,6 +100,14 @@
 #% options: smv,swv,bwwv,qbwwv
 #% answer: smv
 #%end
+#%option
+#% key: weighting_metric
+#% type: string
+#% description: Metric to use for weighting
+#% required: yes
+#% options: accuracy,kappa
+#% answer: accuracy
+#%end
 #%option G_OPT_F_OUTPUT
 #% key: classification_results
 #% description: File for saving results of all classifiers
@@ -185,10 +193,10 @@
     voting_function += "return(list(maj_class=maj_class, prob=prob))\n}"
 
     weighting_functions = {}
-    weighting_functions['smv'] = "weights <- rep(1/length(accuracy_means), length(accuracy_means))"
-    weighting_functions['swv'] = "weights <- accuracy_means/sum(accuracy_means)"
-    weighting_functions['bwwv'] = "weights <- 1-(max(accuracy_means) - accuracy_means)/(max(accuracy_means) - min(accuracy_means))"
-    weighting_functions['qbwwv'] = "weights <- ((min(accuracy_means) - accuracy_means)/(max(accuracy_means) - min(accuracy_means)))**2"
+    weighting_functions['smv'] = "weights <- rep(1/length(weighting_base), length(weighting_base))"
+    weighting_functions['swv'] = "weights <- weighting_base/sum(weighting_base)"
+    weighting_functions['bwwv'] = "weights <- 1-(max(weighting_base) - weighting_base)/(max(weighting_base) - min(weighting_base))"
+    weighting_functions['qbwwv'] = "weights <- ((min(weighting_base) - weighting_base)/(max(weighting_base) - min(weighting_base)))**2"
 
     if options['segments_map']:
         allfeatures = options['segments_map']
@@ -211,6 +219,7 @@
         output_probcol = options['output_prob_column']
     classifiers = options['classifiers'].split(',')
     weighting_modes = options['weighting_modes'].split(',')
+    weighting_metric = options['weighting_metric']
 
     classification_results = None
     if options['classification_results']:
@@ -320,6 +329,11 @@
     r_file.write(voting_function)
     r_file.write("\n")
 
+    if weighting_metric == 'kappa':
+        r_file.write("weighting_base <- kappa_means")
+    else:
+        r_file.write("weighting_base <- accuracy_means")
+    r_file.write("\n")
     for weighting_mode in weighting_modes:
         r_file.write(weighting_functions[weighting_mode])
         r_file.write("\n")