[GRASS-SVN] r70494 - grass-addons/grass7/raster/r.learn.ml

Mon Feb 6 23:18:59 PST 2017

Author: spawley
Date: 2017-02-06 23:18:59 -0800 (Mon, 06 Feb 2017)
New Revision: 70494

Modified:
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
Log:
r.learn.ml minor change to scoring metric used to search for optimal hyperparameters; additional error checking; fixed bug for performance metrics for regressors; manual update for XGBoost classifier

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
===================================================================

--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-02-07 03:54:29 UTC (rev 70493)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-02-07 07:18:59 UTC (rev 70494)
@@ -13,7 +13,7 @@
 	
 	<li>The <em>RandomForestsClassifier</em> and <em>RandomForestsRegressor</em> represent ensemble classification and regression tree methods. Random forests overcome some of the disadvantages of single decision trees by constructing an ensemble of uncorrelated trees. Each tree is constructed from a random subsample of the training data and only a random subset of the predictors based on <em>max_features</em> is made available during each node split. Each tree produces a prediction probability and the final classification result is obtained by averaging of the prediction probabilities across all of the trees.</li>
 	
-	<li>The <em>GradientBoostingClassifier</em> and <em>GradientBoostingRegressor</em> also represent ensemble tree-based methods. However, in a boosted model the learning processes is additive in a forward step-wise fashion whereby <i>n_estimators</i> are fit during each model step, and each model step is designed to better fit samples that are not currently well predicted by the previous step. This incrementally improves the performance of the entire model ensemble by fitting to the model residuals.</li>
+	<li>The <em>GradientBoostingClassifier</em> and <em>GradientBoostingRegressor</em> also represent ensemble tree-based methods. However, in a boosted model the learning processes is additive in a forward step-wise fashion whereby <i>n_estimators</i> are fit during each model step, and each model step is designed to better fit samples that are not currently well predicted by the previous step. This incrementally improves the performance of the entire model ensemble by fitting to the model residuals. Additionally, the <em>XGBClassifier</em> and <em>XGBRegressor</em> models represent an accelerated version of gradient boosting which can optionally be installed from the XGBoost python package.</li>
 	
 	<li>The <em>SVC</em> model is C-Support Vector Classifier. Only a linear kernel is supported because non-linear kernels using scikit learn for typical remote sensing and spatial analysis datasets which consist of large numbers of samples are too slow to be practical. This classifier can still be slow for large datasets.</li>
 	
@@ -31,7 +31,7 @@
 	
 	<li><em>min_samples_split</em> and <em>min_samples_leaf</em> control the number of samples required to split a node or form a leaf node, respectively.</li>
 	
-	<li>The <em>learning_rate</em> and <em>subsample</em> parameters apply only to Gradient Boosting. <em>learning_rate</em> shrinks the contribution of each tree, and <em>subsample</em> is the fraction of randomly selected samples for each tree. A lower <em>learning_rate</em> always improves accuracy in gradient boosting but will require a much larger <em>n_estimators</em> setting which will lower computational performance.</li>
+	<li>The <em>learning_rate</em> and <em>subsample</em> parameters apply only to Gradient Boosting and XGBClassifier or XGBRegressor. <em>learning_rate</em> shrinks the contribution of each tree, and <em>subsample</em> is the fraction of randomly selected samples for each tree. A lower <em>learning_rate</em> always improves accuracy in gradient boosting but will require a much larger <em>n_estimators</em> setting which will lower computational performance.</li>
 	
 	<li>The main control on accuracy in the Earth classifier consists <em>max_degree</em> which is the maximum degree of terms generated by the forward pass. Settings of <em>max_degree</em> = 1 or 2 offer good accuracy versus computational performance.</li>
 </ul>

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-02-07 03:54:29 UTC (rev 70493)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-02-07 07:18:59 UTC (rev 70494)
@@ -417,8 +417,8 @@
         self.enc.fit(self.X)
         self.X = self.enc.transform(self.X)
 
-    def fit(self, param_distributions=None, param_grid=None, n_iter=3, cv=3,
-            random_state=None):
+    def fit(self, param_distributions=None, param_grid=None,
+            scoring=None, n_iter=3, cv=3, random_state=None):
 
         """
         Main fit method for the train object. Performs fitting, hyperparameter
@@ -466,14 +466,15 @@
                 self.estimator = RandomizedSearchCV(
                     estimator=self.estimator,
                     param_distributions=param_distributions,
-                    n_iter=n_iter,
+                    n_iter=n_iter, scoring=scoring,
                     cv=cv_search)
 
             # Grid Search
             if param_grid is not None:
                 self.estimator = GridSearchCV(self.estimator,
                                               param_grid,
-                                              n_jobs=-1, cv=cv_search)
+                                              n_jobs=-1, cv=cv_search,
+                                              scoring=scoring)
 
             # if groups then fit RandomizedSearchCV.fit requires groups param
             if self.groups is None:
@@ -745,6 +746,9 @@
                     self.scores['kappa'],
                     metrics.cohen_kappa_score(y_test, y_pred))
 
+                self.scores_cm = metrics.classification_report(
+                        y_test_agg, y_pred_agg)
+
             elif scorers == 'multiclass':
 
                 self.scores['accuracy'] = np.append(
@@ -755,13 +759,16 @@
                     self.scores['kappa'],
                     metrics.cohen_kappa_score(y_test, y_pred))
 
+                self.scores_cm = metrics.classification_report(
+                        y_test_agg, y_pred_agg)
+
             elif scorers == 'regression':
                 self.scores['r2'] = np.append(
                     self.scores['r2'], metrics.r2_score(y_test, y_pred))
 
             # feature importances using permutation
             if feature_importances is True:
-                if (self.fimp==0).all() == True:
+                if bool((self.fimp == 0).all()) is True:
                     self.fimp = self.varImp_permutation(
                         fit_train, X_test, y_test, n_permutations, scorers,
                         random_state)
@@ -771,8 +778,6 @@
                             fit_train, X_test, y_test,
                             n_permutations, scorers, random_state)))
 
-        self.scores_cm = metrics.classification_report(y_test_agg, y_pred_agg)
-
         # convert onehot-encoded feature importances back to original vars
         if self.fimp is not None and self.enc is not None:
 
@@ -819,7 +824,7 @@
         # determine output data type and nodata
         predicted = self.estimator.predict(self.X)
 
-        if (predicted % 1 == 0).all() == True:
+        if bool((predicted % 1 == 0).all()) is True:
             ftype = 'CELL'
             nodata = -2147483648
         else:
@@ -1045,7 +1050,7 @@
                            'EarthRegressor': Earth(max_degree=max_degree)}
         except:
             grass.fatal('Py-earth package not installed')
-            
+
     elif estimator == 'XGBClassifier' or estimator == 'XGBRegressor':
         try:
             from xgboost import XGBClassifier, XGBRegressor
@@ -1053,14 +1058,17 @@
             if max_depth is None:
                 max_depth = int(3)
 
-            classifiers = {'XGBClassifier': XGBClassifier(learning_rate=learning_rate,
-                                                          n_estimators=n_estimators,
-                                                          max_depth=max_depth,
-                                                          subsample=subsample),
-                           'XGBRegressor': XGBRegressor(learning_rate=learning_rate,
-                                                        n_estimators=n_estimators,
-                                                        max_depth=max_depth,
-                                                        subsample=subsample)}
+            classifiers = {
+                'XGBClassifier':
+                    XGBClassifier(learning_rate=learning_rate,
+                                  n_estimators=n_estimators,
+                                  max_depth=max_depth,
+                                  subsample=subsample),
+                'XGBRegressor':
+                    XGBRegressor(learning_rate=learning_rate,
+                                 n_estimators=n_estimators,
+                                 max_depth=max_depth,
+                                 subsample=subsample)}
         except:
             grass.fatal('Py-earth package not installed')
     else:
@@ -1187,7 +1195,7 @@
     groups = training_data[:, -1]
 
     # if all nans then set groups to None
-    if np.isnan(groups).all() == True:
+    if bool(np.isnan(groups).all()) is True:
         groups = None
 
     # fetch X and y
@@ -1307,91 +1315,6 @@
     return(X, y, y_indexes)
 
 
-def sample_training_data(response, maplist, group_raster='', n_partitions=3,
-                         cvtype='', impute=False, lowmem=False,
-                         random_state=None):
-
-    """
-    Samples predictor and optional group id raster for cross-val
-
-    Args
-    ----
-    roi: String; GRASS raster with labelled pixels
-    maplist: List of GRASS rasters containing explanatory variables
-    group_raster: GRASS raster containing group ids of labelled pixels
-    n_partitions: Number of spatial partitions
-    cvtype: Type of spatial clustering
-    save_training: Save extracted training data to .csv file
-    lowmem: Boolean to use numpy memmap during extraction
-    random_state: Seed
-
-    Returns
-    -------
-    X: Numpy array of extracted raster values
-    y: Numpy array of labels
-    group_id: Group ids of labels
-    """
-
-    from sklearn.cluster import KMeans
-
-    # clump the labelled pixel raster if labels represent polygons
-    # then set the group_raster to the clumped raster to extract the group_ids
-    # used in the GroupKFold cross-validation
-    # ------------------------------------------------------------------------
-    if cvtype == 'clumped' and group_raster == '':
-        r.clump(input=response, output='tmp_roi_clumped',
-                overwrite=True, quiet=True)
-        group_raster = 'tmp_roi_clumped'
-
-    # extract training data from maplist and take group ids from
-    # group_raster. Shuffle=False so that group ids and labels align
-    # because cross-validation will be performed spatially
-    # ---------------------------------------------------------------
-    if group_raster != '':
-        maplist2 = deepcopy(maplist)
-        maplist2.append(group_raster)
-        X, y, sample_coords = sample_predictors(response=response,
-                                                predictors=maplist2,
-                                                impute=impute,
-                                                shuffle_data=False,
-                                                lowmem=False,
-                                                random_state=random_state)
-        # take group id from last column and remove column from predictors
-        group_id = X[:, -1]
-        X = np.delete(X, -1, axis=1)
-
-        # remove the clumped raster
-        try:
-            grass.run_command(
-                "g.remove", name='tmp_roi_clumped', flags="f",
-                type="raster", quiet=True)
-        except:
-            pass
-
-    # extract training data from maplist without group Ids
-    # shuffle this data by default
-    # ----------------------------------------------------
-    else:
-        X, y, sample_coords = sample_predictors(
-            response=response, predictors=maplist,
-            impute=impute,
-            shuffle_data=True,
-            lowmem=lowmem,
-            random_state=random_state)
-
-        group_id = None
-
-        if cvtype == 'kmeans':
-            clusters = KMeans(n_clusters=n_partitions,
-                              random_state=random_state,
-                              n_jobs=-1)
-
-            clusters.fit(sample_coords)
-            group_id = clusters.labels_
-
-    return (X, y, group_id)
-
-
 def maps_from_group(group):
     """
     Parse individual rasters into a list from an imagery group
@@ -1420,7 +1343,8 @@
 
     try:
         from sklearn.externals import joblib
-
+        from sklearn.cluster import KMeans
+        from sklearn.metrics import make_scorer, cohen_kappa_score
     except:
         grass.fatal("Scikit learn 0.18 or newer is not installed")
 
@@ -1553,28 +1477,81 @@
     maplist, map_names = maps_from_group(group)
 
     """
-    Train the classifier
+    Sample training data and group ids
     --------------------
     """
 
-    # Sample training data and group ids
-    # Perform parameter tuning and cross-validation
-    # Unless a previously fitted model is to be loaded
-    # ------------------------------------------------
     if model_load == '':
 
         # Sample training data and group id
         if load_training != '':
             X, y, group_id = load_training_data(load_training)
         else:
-            X, y, group_id = sample_training_data(
-                response, maplist, group_raster, n_partitions, cvtype,
-                impute, lowmem, random_state)
+            # clump the labelled pixel raster if labels represent polygons
+            # then set the group_raster to the clumped raster to extract the
+            # group_ids used in the GroupKFold cross-validation
+            if cvtype == 'clumped' and group_raster == '':
+                r.clump(input=response, output='tmp_roi_clumped',
+                        overwrite=True, quiet=True)
+                group_raster = 'tmp_roi_clumped'
 
+            # extract training data from maplist and take group ids from
+            # group_raster. Shuffle=False so that group ids and labels align
+            # because cross-validation will be performed spatially
+            if group_raster != '':
+                maplist2 = deepcopy(maplist)
+                maplist2.append(group_raster)
+                X, y, sample_coords = sample_predictors(
+                        response=response, predictors=maplist2,
+                        impute=impute, shuffle_data=False, lowmem=False,
+                        random_state=random_state)
+
+                # take group id from last column and remove from predictors
+                group_id = X[:, -1]
+                X = np.delete(X, -1, axis=1)
+
+                # remove the clumped raster
+                try:
+                    grass.run_command(
+                        "g.remove", name='tmp_roi_clumped', flags="f",
+                        type="raster", quiet=True)
+                except:
+                    pass
+
+            else:
+                # extract training data from maplist without group Ids
+                # shuffle this data by default
+                X, y, sample_coords = sample_predictors(
+                    response=response, predictors=maplist,
+                    impute=impute,
+                    shuffle_data=True,
+                    lowmem=lowmem,
+                    random_state=random_state)
+
+                group_id = None
+
+                if cvtype == 'kmeans':
+                    clusters = KMeans(n_clusters=n_partitions,
+                                      random_state=random_state,
+                                      n_jobs=-1)
+
+                    clusters.fit(sample_coords)
+                    group_id = clusters.labels_
+
+            # check for labelled pixels and training data
+            if y.shape[0] == 0 or X.shape[0] == 0:
+                grass.fatal(('No training pixels or pixels in imagery group '
+                             '...check computational region'))
+
         # option to save extracted data to .csv file
         if save_training != '':
             save_training_data(X, y, group_id, save_training)
 
+        """
+        Train the classifier
+        --------------------
+        """
+
         # retrieve sklearn classifier object and parameters
         grass.message("Classifier = " + classifier)
 
@@ -1600,14 +1577,16 @@
         if any(param_grid) is not True:
             param_grid = None
 
-        # Decide on scoring metric scheme
+        # Decide on scoring metric scheme and scorer to for grid search
         if mode == 'classification':
             if len(np.unique(y)) == 2 and all([0, 1] == np.unique(y)):
                 scorers = 'binary'
             else:
                 scorers = 'multiclass'
+            search_scorer = make_scorer(cohen_kappa_score)
         else:
             scorers = 'regression'
+            search_scorer = 'r2'
 
         if mode == 'regression' and probability is True:
             grass.warning(
@@ -1625,7 +1604,7 @@
         """
 
         # fit and parameter search
-        learn_m.fit(param_grid=param_grid, cv=tune_cv,
+        learn_m.fit(param_grid=param_grid, cv=tune_cv, scoring=search_scorer,
                     random_state=random_state)
 
         if param_grid is not None: