[GRASS-SVN] r70328 - grass-addons/grass7/raster/r.learn.ml

Mon Jan 9 15:14:31 PST 2017

Author: spawley
Date: 2017-01-09 15:14:31 -0800 (Mon, 09 Jan 2017)
New Revision: 70328

Modified:
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
Log:
'allow different number of kmeans partitions to be used in a k-fold cross-validation'

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
===================================================================

--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-01-09 17:21:56 UTC (rev 70327)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-01-09 23:14:31 UTC (rev 70328)
@@ -38,7 +38,7 @@
 
 <p>In addition to model fitting and prediction, feature selection can be performed using the <em>-f</em> flag. The feature selection method employed consists of a custom permutation-based method that can be applied to all of the classifiers as part of a cross-validation. The method consists of: (1) determining a performance metric on a test partition of the data; (2) permuting each variable and assessing the difference in performance between the original and permutation; (3) repeating step 2 for <em>n_permutations</em>; (4) averaging the results. Steps 1-4 are repeated on each k partition. The feature importance represent the average decrease in performance of each variable when permuted. For binary classifications, the AUC is used as the metric. Multiclass classifications use accuracy, and regressions use R2.</p>
 
-<p>Cross validation can be performed by setting the <em>cv</em> parameters to &gt 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced. Also note that this cross-validation is performed on a pixel basis. If there is a strong autocorrelation between pixels (i.e. the pixels represent polygons) then the training/test splits will not represent independent samples and will overestimate the accuracy. In this case, the <em>cvtype</em> parameter can be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into groups by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic per
 formance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the group_ids of the partitions can be supplied using the <em>group_raster</em> option.</p>
+<p>Cross validation can be performed by setting the <em>cv</em> parameters to &gt 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced. The <em>cvtype</em> parameter can be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into <em>n_partitions</em> by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic performance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the group_ids of the partitions can be supplied using the <em>group_raster</em> option.</p>
 
 <p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as linear models may not perform optimally if some predictors have variances that are orders of magnitude larger than others. The <em>-s</em> flag adds a standardization preprocessing step to the classification and prediction to reduce this effect. Non-ordinal, categorical predictors are also not specifically recognized by scikit-learn. Some classifiers are not very sensitive to this (i.e. decision trees) but generally, categorical predictors need to be converted to a suite of binary using onehot encoding (i.e. where each value in a categorical raster is parsed into a separate binary grid). Entering the indices (comma-separated) of the categorical rasters as they are listed in the imagery group as 0...n in the <em>categorymaps</em> option will cause onehot encoding to be performed on the fly during training and prediction. The feature importances are returned as per the origin
 al imagery group and represent the sum of the feature importances of the onehot-encoded variables. Note: it is important that the training samples all of the categories in the rasters, otherwise the onehot-encoding will fail when it comes to the prediction. An alterative approach is to onehot-encode the categorical rasters manually (i.e. create a series of raster maps coded to 0 and 1 for each category value) and use these in the imagery group.</p>
 

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-01-09 17:21:56 UTC (rev 70327)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-01-09 23:14:31 UTC (rev 70328)
@@ -159,6 +159,14 @@
 #% options: non-spatial,clumped,kmeans
 #%end
 
+#%option
+#% key: n_partitions
+#% type: integer
+#% description: Number of kmeans partitions
+#% answer: 10
+#% guisection: Optional
+#%end
+
 #%option G_OPT_R_INPUT
 #% key: group_raster
 #% label: Custom group ids for labelled pixels from GRASS raster
@@ -363,8 +371,7 @@
         self.X = self.enc.transform(self.X)    
 
 
-    def fit(self, param_distribution=None, n_iter=3, scorers='multiclass',
-            cv=3, tune_cv=3, feature_importances=False, n_permutations=1,
+    def fit(self, param_distribution=None, n_iter=3, cv=3,
             random_state=None):
 
         """
@@ -376,42 +383,38 @@
         param_distribution: continuous parameter distribution to be used in a
         randomizedCVsearch
         n_iter: Number of randomized search iterations
-        scorers: Suite of metrics to obtain
-        cv: Number of cross-validation folds
-        tune_cv: Number of cross-validation folds for parameter tuning
-        feature_importances: Boolean to perform permuatation-based importances
-        during cross-validation
-        n_permutations: Number of random permutations during feature importance
+        cv: Number of cross-validation folds for parameter tuning
         random_state: seed to be used during random number generation
         """
+        
         from sklearn.model_selection import RandomizedSearchCV
         from sklearn.model_selection import GroupKFold
 
+        # RandomizedSearchCV if parameter_distributions
         if param_distribution is not None and n_iter > 1:
             
             # use groupkfold for hyperparameter search if groups are present
             if self.groups is not None:
-                cv_search = GroupKFold(n_splits=tune_cv)
+                cv_search = GroupKFold(n_splits=cv)
             else:
-                cv_search = tune_cv
+                cv_search = cv
                 
             self.estimator = RandomizedSearchCV(
                 estimator=self.estimator,
                 param_distributions=param_distribution,
                 n_iter=n_iter, cv=cv_search, random_state=random_state)
-        
+
+            # if groups then fit RandomizedSearchCV.fit requires groups param
             if self.groups is None:
                 self.estimator.fit(self.X, self.y)
             else:
                 self.estimator.fit(self.X, self.y, groups=self.groups)
+        
+        # Fitting without parameter search
         else:
             self.estimator.fit(self.X, self.y)
 
-        if cv > 1:
-            self.cross_val(
-                scorers, cv, feature_importances, n_permutations, random_state)
 
-
     def standardization(self):
         """
         Transforms the train objects X data using standardization
@@ -527,8 +530,8 @@
         return (specificity)
 
 
-    def cross_val(self, scorers, cv, feature_importances, n_permutations,
-                  random_state):
+    def cross_val(self, scorers='binary', cv=3, feature_importances=False,
+                  n_permutations=25, random_state=None):
 
         from sklearn.model_selection import StratifiedKFold
         from sklearn.model_selection import GroupKFold
@@ -696,6 +699,7 @@
             for index in self.categorical_var:
                 self.fimp = np.insert(self.fimp, np.array(index), ohe_sum[0], axis=1)
 
+
     def predict(self, predictors, output, class_probabilities=False,
                rowincr=25):
 
@@ -1343,9 +1347,11 @@
     classifier = options['classifier']
     norm_data = flags['s']
     cv = int(options['cv'])
+    cvtype = options['cvtype']
     group_raster = options['group_raster']
     categorymaps = options['categorymaps']    
-    cvtype = options['cvtype']
+    n_partitions = options['n_partitions']
+    n_partitions = int(options['n_partitions'])
     modelonly = flags['m']
     probability = flags['p']
     rowincr = int(options['lines'])
@@ -1436,7 +1442,7 @@
             X, y, group_id = load_training_data(load_training)
         else:
             X, y, group_id = sample_training_data(
-                response, maplist, group_raster, cv, cvtype,
+                response, maplist, group_raster, n_partitions, cvtype,
                 lowmem, random_state)
 
         # option to save extracted data to .csv file
@@ -1480,10 +1486,8 @@
         ----------------
         """
 
-        # fit, search and cross-validate the training object
-        learn_m.fit(param_grid, n_iter, scorers, cv, tune_cv,
-                    feature_importances=importances,
-                    n_permutations=n_permutations,
+        # fit and parameter search
+        learn_m.fit(param_grid, n_iter, tune_cv,
                     random_state=random_state)
 
         if n_iter > 1:
@@ -1496,6 +1500,10 @@
             grass.message('\r\n')
             grass.message(
                 "Cross validation global performance measures......:")
+            
+            # cross-validate the training object
+            learn_m.cross_val(scorers, cv, importances, n_permutations=n_permutations,
+                              random_state=random_state)
 
             if mode == 'classification':
                 if scorers == 'binary':