[GRASS-SVN] r70333 - grass-addons/grass7/raster/r.learn.ml

Tue Jan 10 11:51:10 PST 2017

Author: spawley
Date: 2017-01-10 11:51:10 -0800 (Tue, 10 Jan 2017)
New Revision: 70333

Modified:
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
Log:
'Tweaks to parameter tuning settings'

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
===================================================================

--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-01-10 11:41:24 UTC (rev 70332)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-01-10 19:51:10 UTC (rev 70333)
@@ -20,27 +20,27 @@
 	<li>The <em>EarthClassifier</em> and <em>EarthRegressor</em> is a python-based version of Friedman's multivariate adaptive regression splines. This classifier depends on the <a href="https://github.com/scikit-learn-contrib/py-earth">py-earth package</a>, which optionally can be installed in addition to scikit-learn. Earth represents a non-parametric extension to linear models such as logistic regression which improves model fit by partitioning the data into subregions, with each region being fitted by a separate regression term.</li>
 </ul>
 
-<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. The classifier defaults are supplied but these parameters can be automatically tuning using a randomized search by setting the <em>n_iter</em> option to &gt 1. Parameter tuning can be accomplished simultaneously with nested cross-validation by also settings the <em>cv</em> option to &gt > 1. The parameters consist of:</p>
+<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. The scikit-learn classifier defaults are generally supplied, and some of these parameters can be automatically tuning using a randomized search by setting the <em>n_iter</em> option to &gt 1. The strategy used in <em>r.learn.ml</em> is not to attempt to tune all possible parameters because this is too computationally expensive for typical remote sensing and spatial models. Instead, only the parameters that most strongly affect model performance are automatically tuned. This tuning can also be accomplished simultaneously with nested cross-validation by also settings the <em>cv</em> option to &gt > 1. The parameters and their tuning strategies consist of:</p>
 
 <ul>
 	<li><em>C</em> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <em>C</em> applies to the LogisticRegression and SVC models. Tuning occurs over the range of 1-1000. </li>
 	
-	<li><em>n_estimators</em> represents the number of trees in Random Forest model, and the number of trees used in each model step during Gradient Boosting. Tuning occurs over 50-500 for gradient boosting, whereas <em>n_estimators</em> is not tuning for random forests.</li>
+	<li><em>n_estimators</em> represents the number of trees in Random Forest model, and the number of trees used in each model step during Gradient Boosting. Tuning occurs over 40-100 for gradient boosting, whereas <em>n_estimators</em> is not tuned for random forests because having a large number never adversely affects model accuracy, but it can cause unneccessary computational time, and is therefore better set manually.</li>
 	
 	<li><em>max_features</em> controls the number of variables that are allowed to be chosen from at each node split in the tree-based models, and can be considered to control the degree of correlation between the trees in ensemble tree methods. Tuning occurs over 1 to all of the features being available for random forests and gradient boosting. Single decision trees are not tuned on this parameter.</li>
 	
-	<li><em>min_samples_split</em> and <em>min_samples_leaf</em> control the number of samples required to split a node, or form a leaf node, respectively. Tuning varies these parameters by allowing up to 2% of the samples to be required form a node split or leaf node.</li>
+	<li><em>min_samples_split</em> and <em>min_samples_leaf</em> control the number of samples required to split a node or form a leaf node, respectively. These parameters are not tuned automatically and most models are not sensitive to these settings.</li>
 	
-	<li>The <em>learning_rate</em> and <em>subsample</em> parameters apply only to Gradient Boosting. <em>learning_rate</em> shrinks the contribution of each tree, and <em>subsample</em> is the fraction of randomly selected samples for each tree. <em>learning_rate</em> is tuning over 0.01-0.1, and <em>subsample</em> is tuned over 0-1.0.</li>
+	<li>The <em>learning_rate</em> and <em>subsample</em> parameters apply only to Gradient Boosting. <em>learning_rate</em> shrinks the contribution of each tree, and <em>subsample</em> is the fraction of randomly selected samples for each tree. These parameters are not tuned automatically and the best strategy is to vary these manually and determine how <em>learning_rate</em> influences the optimal number of <em>n_estimators</em>, and how <em>subsample</em> affects accuracy.</li>
 	
-	<li>Parameters relating to the Earth classifier consist of: <em>max_degree</em> which is the maximum degree of terms generated by the forward pass; <em>penalty</em> is a smoothing parameter; and <em>minspan_alpha</em> is the probability between 0 and 1 that controls the number of data points between knots. These are tuned over 1-5 for <em>max_degree</em>, 0.5-2.0 for <em>penalty</em>, and 0.05-1.0 for <em>minspan_alpha</em>. Note that the Earth classifier is slow when using max_degree > 1, although performance is generally improved with max_degree between 2-3.</li>
+	<li>The main control on accuracy in the Earth classifier consists <em>max_degree</em> which is the maximum degree of terms generated by the forward pass. <em>max_degree</em> is available for tuning from 1-3, although note that the Earth classifier is slow when using max_degree > 1.</li>
 </ul>
 
 <p>In addition to model fitting and prediction, feature selection can be performed using the <em>-f</em> flag. The feature selection method employed consists of a custom permutation-based method that can be applied to all of the classifiers as part of a cross-validation. The method consists of: (1) determining a performance metric on a test partition of the data; (2) permuting each variable and assessing the difference in performance between the original and permutation; (3) repeating step 2 for <em>n_permutations</em>; (4) averaging the results. Steps 1-4 are repeated on each k partition. The feature importance represent the average decrease in performance of each variable when permuted. For binary classifications, the AUC is used as the metric. Multiclass classifications use accuracy, and regressions use R2.</p>
 
-<p>Cross validation can be performed by setting the <em>cv</em> parameters to &gt 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced. The <em>cvtype</em> parameter can be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into <em>n_partitions</em> by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic performance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the group_ids of the partitions can be supplied using the <em>group_raster</em> option.</p>
+<p>Cross validation can be performed by setting the <em>cv</em> parameters to &gt 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced depending on whether the response variable is binary or multiclass, or the classifier is for regression or classification. The <em>cvtype</em> parameter can also be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into <em>n_partitions</em> by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic performance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the gr
 oup_ids of the partitions can be supplied using the <em>group_raster</em> option.</p>
 
-<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as linear models may not perform optimally if some predictors have variances that are orders of magnitude larger than others. The <em>-s</em> flag adds a standardization preprocessing step to the classification and prediction to reduce this effect. Non-ordinal, categorical predictors are also not specifically recognized by scikit-learn. Some classifiers are not very sensitive to this (i.e. decision trees) but generally, categorical predictors need to be converted to a suite of binary using onehot encoding (i.e. where each value in a categorical raster is parsed into a separate binary grid). Entering the indices (comma-separated) of the categorical rasters as they are listed in the imagery group as 0...n in the <em>categorymaps</em> option will cause onehot encoding to be performed on the fly during training and prediction. The feature importances are returned as per the origin
 al imagery group and represent the sum of the feature importances of the onehot-encoded variables. Note: it is important that the training samples all of the categories in the rasters, otherwise the onehot-encoding will fail when it comes to the prediction. An alterative approach is to onehot-encode the categorical rasters manually (i.e. create a series of raster maps coded to 0 and 1 for each category value) and use these in the imagery group.</p>
+<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as linear models may not perform optimally if some predictors have variances that are orders of magnitude larger than others. The <em>-s</em> flag adds a standardization preprocessing step to the classification and prediction to reduce this effect. Non-ordinal, categorical predictors are also not specifically recognized by scikit-learn. Some classifiers are not very sensitive to this (i.e. decision trees) but generally, categorical predictors need to be converted to a suite of binary using onehot encoding (i.e. where each value in a categorical raster is parsed into a separate binary grid). Entering the indices (comma-separated) of the categorical rasters as they are listed in the imagery group as 0...n in the <em>categorymaps</em> option will cause onehot encoding to be performed on the fly during training and prediction. The feature importances are returned as per the origin
 al imagery group and represent the sum of the feature importances of the onehot-encoded variables. Note: it is important that the training samples all of the categories in the rasters, otherwise the onehot-encoding will fail when it comes to the prediction.</p>
 
 <p>The module also offers the ability to save and load a classification or regression model. Saving and loading a model allows a model to be fitted on one imagery group, with the prediction applied to additional imagery groups. This approach is commonly employed in species distribution or landslide susceptibility modelling whereby a classification or regression model is built with one set of predictors (e.g. present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables.</p>
 
@@ -56,7 +56,7 @@
 
 <h2>TODO</h2>
 
-<p>The balancing option in scikit-learn, which seeks to reduce class imbalances using weights that are inversely proportional to class frequencies, only applies to a few of the classifiers (LogisticRegression, DecisionTree, RandomForest, and GradientBoostingClassifiers). An separate python package called imbalanced-learn provides more sophisticated up- and down-sampling methods, e.g. using SMOTE, ROSE, etc. The option to balance the training data using this optionally installed package will be added in the future.</p>
+<p>The balancing option in scikit-learn, which seeks to reduce class imbalances using weights that are inversely proportional to class frequencies, only applies to a few of the classifiers (LogisticRegression, DecisionTree and RandomForest). An separate python package called imbalanced-learn provides more sophisticated up- and down-sampling methods, e.g. using SMOTE, ROSE, etc. The option to balance the training data using this optionally installed package will be added in the future.</p>
 
 <h2>EXAMPLE</h2>
 

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-01-10 11:41:24 UTC (rev 70332)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-01-10 19:51:10 UTC (rev 70333)
@@ -94,7 +94,7 @@
 #% key: n_estimators
 #% type: integer
 #% description: Number of estimators for tree-based classifiers
-#% answer: 500
+#% answer: 100
 #% guisection: Classifier Parameters
 #%end
 
@@ -121,20 +121,6 @@
 #% guisection: Classifier Parameters
 #%end
 
-#%option double
-#% key: penalty
-#% description: A smoothing parameter used to calculate GCV and GRSQ in Earth
-#% answer: 3.0
-#% guisection: Classifier Parameters
-#%end
-
-#%option double
-#% key: minspan_alpha
-#% description: probability between 0 and 1 controlling the number of data points between knots in Earth
-#% answer: 0.05
-#% guisection: Classifier Parameters
-#%end
-
 # General options
 
 #%flag
@@ -162,7 +148,7 @@
 #%option
 #% key: n_partitions
 #% type: integer
-#% description: Number of kmeans partitions
+#% description: Number of kmeans spatial partitions
 #% answer: 10
 #% guisection: Optional
 #%end
@@ -178,7 +164,7 @@
 #%option
 #% key: cv
 #% type: integer
-#% description: Number of cross-validation folds to perform in cv > 1
+#% description: Number of cross-validation folds for performance evaluation
 #% answer: 1
 #% guisection: Optional
 #%end
@@ -297,7 +283,8 @@
 
 #%rules
 #% exclusive: trainingmap,load_model
-#% exclusive: save_training,load_training
+#% exclusive: load_training,save_training
+
 #%end
 
 import atexit
@@ -884,8 +871,7 @@
                       class_weight=None, C=1, max_depth=None,
                       max_features='auto', min_samples_split=2,
                       min_samples_leaf=1, n_estimators=100, subsample=1.0,
-                      learning_rate=0.1, max_degree=1, penalty=3.0,
-                      minspan_alpha=0.05):
+                      learning_rate=0.1, max_degree=1):
 
     """
     Provides the classifiers and parameters using by the module
@@ -903,8 +889,6 @@
     subsample: Controls randomization in gradient boosting
     learning_rate: Used in gradient boosting
     max_degree: For earth
-    penalty: For earth
-    minspan_alpha: For earth
 
     Returns
     -------
@@ -934,14 +918,10 @@
 
             # Combine Earth with LogisticRegression in a pipeline to do classification
             earth_classifier = Pipeline([('Earth',
-                Earth(max_degree=max_degree,
-                     penalty=penalty,
-                     minspan_alpha=minspan_alpha)), ('Logistic', LogisticRegression())])
+                Earth(max_degree=max_degree)), ('Logistic', LogisticRegression())])
 
             classifiers = {'EarthClassifier': earth_classifier,
-                           'EarthRegressor': Earth(max_degree=max_degree,
-                                                   penalty=penalty,
-                                                   minspan_alpha=minspan_alpha)}
+                           'EarthRegressor': Earth(max_degree=max_degree)}
         except:
             grass.fatal('Py-earth package not installed')
     else:
@@ -949,7 +929,6 @@
         classifiers = {
             'SVC': SVC(C=C, probability=True, random_state=random_state),
             'LogisticRegression':
-
                 LogisticRegression(C=C, class_weight=class_weight,
                                   random_state=random_state, n_jobs=-1),
             'DecisionTreeClassifier':
@@ -1005,23 +984,13 @@
         }
 
     LogisticRegressionOpts = {'C': randint(1, 1000)}
-    DecisionTreeOpts = {'max_depth': randint(2, 20),
-                        'min_samples_split': uniform(0, 0.02)}
+    DecisionTreeOpts = {'max_depth': randint(2, 20)}
     RandomForestOpts = {'max_features': uniform()}
-    GradientBoostingOpts = {'learning_rate': uniform(0.01, 0.1),
-                            'max_depth': randint(3, 10),
-                            'max_features': uniform(),
-                            'n_estimators': randint(50, 500),
-                            'min_samples_split': uniform(0, 0.02),
-                            'min_samples_leaf': uniform(0, 0.02),
-                            'subsample': uniform()}
-    SVCOpts = {'C': randint(1, 100), 'shrinking': [True, False]}
-    EarthOpts = {'max_degree': randint(1,5),
-                 'penalty': uniform(0.5, 2),
-                 'minspan_alpha': uniform(0.05, 1.0)}
-    EarthClassifierOpts = {'Earth__max_degree': randint(1,5),
-                           'Earth__penalty': uniform(0.5, 2),
-                           'Earth__minspan_alpha': uniform(0.05, 1.0)}
+    GradientBoostingOpts = {'max_depth': randint(3, 10),
+                            'n_estimators': randint(40, 100)}
+    SVCOpts = {'C': randint(1, 100)}
+    EarthOpts = {'max_degree': randint(1,3)}
+    EarthClassifierOpts = {'Earth__max_degree': randint(1,3)}
 
     param_grids = {
         'SVC': SVCOpts,
@@ -1350,7 +1319,6 @@
     cvtype = options['cvtype']
     group_raster = options['group_raster']
     categorymaps = options['categorymaps']    
-    n_partitions = options['n_partitions']
     n_partitions = int(options['n_partitions'])
     modelonly = flags['m']
     probability = flags['p']
@@ -1382,8 +1350,6 @@
 
     # classifier options
     max_degree = int(options['max_degree'])
-    penalty = float(options['penalty'])
-    minspan_alpha = float(options['minspan_alpha'])
     C = float(options['c'])
     min_samples_split = int(options['min_samples_split'])
     min_samples_leaf = int(options['min_samples_leaf'])
@@ -1457,8 +1423,7 @@
                               class_weight, C, max_depth,
                               max_features, min_samples_split,
                               min_samples_leaf, n_estimators,
-                              subsample, learning_rate, max_degree, penalty,
-                              minspan_alpha)
+                              subsample, learning_rate, max_degree)
 
         # Decide on scoring metric scheme
         if mode == 'classification':
@@ -1508,41 +1473,41 @@
             if mode == 'classification':
                 if scorers == 'binary':
                     grass.message(
-                        "Accuracy   :\t%0.2f\t+/-SD\t%0.2f" %
+                        "Accuracy   :\t%0.3f\t+/-SD\t%0.3f" %
                         (learn_m.scores['accuracy'].mean(),
                          learn_m.scores['accuracy'].std()))
                     grass.message(
-                        "AUC        :\t%0.2f\t+/-SD\t%0.2f" %
+                        "AUC        :\t%0.3f\t+/-SD\t%0.3f" %
                         (learn_m.scores['auc'].mean(),
                          learn_m.scores['auc'].std()))
                     grass.message(
-                        "Kappa      :\t%0.2f\t+/-SD\t%0.2f" %
+                        "Kappa      :\t%0.3f\t+/-SD\t%0.3f" %
                         (learn_m.scores['kappa'].mean(),
                          learn_m.scores['kappa'].std()))
                     grass.message(
-                        "Precision  :\t%0.2f\t+/-SD\t%0.2f" %
+                        "Precision  :\t%0.3f\t+/-SD\t%0.3f" %
                         (learn_m.scores['precision'].mean(),
                          learn_m.scores['precision'].std()))
                     grass.message(
-                        "Recall     :\t%0.2f\t+/-SD\t%0.2f" %
+                        "Recall     :\t%0.3f\t+/-SD\t%0.3f" %
                         (learn_m.scores['recall'].mean(),
                          learn_m.scores['recall'].std()))
                     grass.message(
-                        "Specificity:\t%0.2f\t+/-SD\t%0.2f" %
+                        "Specificity:\t%0.3f\t+/-SD\t%0.3f" %
                         (learn_m.scores['specificity'].mean(),
                          learn_m.scores['specificity'].std()))
                     grass.message(
-                        "F1         :\t%0.2f\t+/-SD\t%0.2f" %
+                        "F1         :\t%0.3f\t+/-SD\t%0.3f" %
                         (learn_m.scores['f1'].mean(),
                          learn_m.scores['f1'].std()))
 
                 if scorers == 'multiclass':
                     grass.message(
-                        "Accuracy:\t%0.2f\t+/-SD\t%0.2f" %
+                        "Accuracy:\t%0.3f\t+/-SD\t%0.3f" %
                         (learn_m.scores['accuracy'].mean(),
                          learn_m.scores['accuracy'].std()))
                     grass.message(
-                        "Kappa   :\t%0.2f\t+/-SD\t%0.2f" %
+                        "Kappa   :\t%0.3f\t+/-SD\t%0.3f" %
                         (learn_m.scores['kappa'].mean(),
                          learn_m.scores['kappa'].std()))
 
@@ -1552,7 +1517,7 @@
                 grass.message(learn_m.scores_cm)
 
             else:
-                grass.message("R2:\t%0.2f\t+/-\t%0.2f" %
+                grass.message("R2:\t%0.3f\t+/-\t%0.3f" %
                               (learn_m.scores['r2'].mean(),
                                learn_m.scores['r2'].std()))