[GRASS-SVN] r70356 - grass-addons/grass7/raster/r.learn.ml

svn_grass at osgeo.org svn_grass at osgeo.org
Thu Jan 12 16:41:55 PST 2017


Author: spawley
Date: 2017-01-12 16:41:55 -0800 (Thu, 12 Jan 2017)
New Revision: 70356

Modified:
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
Log:
'added balancing option to balance class numbers by oversampling the minority classes in the training partitions; fixed bug with standardization; added better control of hyperparameter search using grid search when user enters comma-separated list of parameters'

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-01-12 23:30:02 UTC (rev 70355)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-01-13 00:41:55 UTC (rev 70356)
@@ -20,28 +20,30 @@
 	<li>The <em>EarthClassifier</em> and <em>EarthRegressor</em> is a python-based version of Friedman's multivariate adaptive regression splines. This classifier depends on the <a href="https://github.com/scikit-learn-contrib/py-earth">py-earth package</a>, which optionally can be installed in addition to scikit-learn. Earth represents a non-parametric extension to linear models such as logistic regression which improves model fit by partitioning the data into subregions, with each region being fitted by a separate regression term.</li>
 </ul>
 
-<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. The scikit-learn classifier defaults are generally supplied, and some of these parameters can be automatically tuning using a randomized search by setting the <em>n_iter</em> option to &gt 1. The strategy used in <em>r.learn.ml</em> is not to attempt to tune all possible parameters because this is too computationally expensive for typical remote sensing and spatial models. Instead, only the parameters that most strongly affect model performance are automatically tuned. This tuning can also be accomplished simultaneously with nested cross-validation by also settings the <em>cv</em> option to &gt > 1. The parameters and their tuning strategies consist of:</p>
+<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. The scikit-learn classifier defaults are generally supplied, and some of these parameters can be tuning using a grid-search by inputting multiple parameter settings as a comma-separated list. This tuning can also be accomplished simultaneously with nested cross-validation by also settings the <em>cv</em> option to &gt > 1. The parameters consist of:</p>
 
 <ul>
-	<li><em>C</em> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <em>C</em> applies to the LogisticRegression and SVC models. Tuning occurs over the range of 1-1000. </li>
+	<li><em>C</em> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <em>C</em> applies to the LogisticRegression and SVC models.</li>
 	
-	<li><em>n_estimators</em> represents the number of trees in Random Forest model, and the number of trees used in each model step during Gradient Boosting. Tuning occurs over 40-100 for gradient boosting, whereas <em>n_estimators</em> is not tuned for random forests because having a large number never adversely affects model accuracy, but it can cause unneccessary computational time, and is therefore better set manually.</li>
+	<li><em>n_estimators</em> represents the number of trees in Random Forest model, and the number of trees used in each model step during Gradient Boosting. For random forests, a larger number of trees will never adversely affect accuracy although this is at the expensive of computational performance. In contrast, Gradient boosting will start to overfit if <em>n_estimators</em> is too high, which will reduce model accuracy.</li>
 	
-	<li><em>max_features</em> controls the number of variables that are allowed to be chosen from at each node split in the tree-based models, and can be considered to control the degree of correlation between the trees in ensemble tree methods. Tuning occurs over 1 to all of the features being available for random forests and gradient boosting. Single decision trees are not tuned on this parameter.</li>
+	<li><em>max_features</em> controls the number of variables that are allowed to be chosen from at each node split in the tree-based models, and can be considered to control the degree of correlation between the trees in ensemble tree methods.</li>
 	
-	<li><em>min_samples_split</em> and <em>min_samples_leaf</em> control the number of samples required to split a node or form a leaf node, respectively. These parameters are not tuned automatically and most models are not sensitive to these settings.</li>
+	<li><em>min_samples_split</em> and <em>min_samples_leaf</em> control the number of samples required to split a node or form a leaf node, respectively.</li>
 	
-	<li>The <em>learning_rate</em> and <em>subsample</em> parameters apply only to Gradient Boosting. <em>learning_rate</em> shrinks the contribution of each tree, and <em>subsample</em> is the fraction of randomly selected samples for each tree. These parameters are not tuned automatically and the best strategy is to vary these manually and determine how <em>learning_rate</em> influences the optimal number of <em>n_estimators</em>, and how <em>subsample</em> affects accuracy.</li>
+	<li>The <em>learning_rate</em> and <em>subsample</em> parameters apply only to Gradient Boosting. <em>learning_rate</em> shrinks the contribution of each tree, and <em>subsample</em> is the fraction of randomly selected samples for each tree. A lower <em>learning_rate</em> always improves accuracy in gradient boosting but will require a much larger <em>n_estimators</em> setting which will lower computational performance.</li>
 	
-	<li>The main control on accuracy in the Earth classifier consists <em>max_degree</em> which is the maximum degree of terms generated by the forward pass. <em>max_degree</em> is available for tuning from 1-3, although note that the Earth classifier is slow when using max_degree > 1.</li>
+	<li>The main control on accuracy in the Earth classifier consists <em>max_degree</em> which is the maximum degree of terms generated by the forward pass. Settings of <em>max_degree</em> = 1 or 2 offer good accuracy versus computational performance.</li>
 </ul>
 
 <p>In addition to model fitting and prediction, feature selection can be performed using the <em>-f</em> flag. The feature selection method employed consists of a custom permutation-based method that can be applied to all of the classifiers as part of a cross-validation. The method consists of: (1) determining a performance metric on a test partition of the data; (2) permuting each variable and assessing the difference in performance between the original and permutation; (3) repeating step 2 for <em>n_permutations</em>; (4) averaging the results. Steps 1-4 are repeated on each k partition. The feature importance represent the average decrease in performance of each variable when permuted. For binary classifications, the AUC is used as the metric. Multiclass classifications use accuracy, and regressions use R2.</p>
 
 <p>Cross validation can be performed by setting the <em>cv</em> parameters to &gt 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced depending on whether the response variable is binary or multiclass, or the classifier is for regression or classification. The <em>cvtype</em> parameter can also be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into <em>n_partitions</em> by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic performance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the gr
 oup_ids of the partitions can be supplied using the <em>group_raster</em> option.</p>
 
-<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as linear models may not perform optimally if some predictors have variances that are orders of magnitude larger than others. The <em>-s</em> flag adds a standardization preprocessing step to the classification and prediction to reduce this effect. Non-ordinal, categorical predictors are also not specifically recognized by scikit-learn. Some classifiers are not very sensitive to this (i.e. decision trees) but generally, categorical predictors need to be converted to a suite of binary using onehot encoding (i.e. where each value in a categorical raster is parsed into a separate binary grid). Entering the indices (comma-separated) of the categorical rasters as they are listed in the imagery group as 0...n in the <em>categorymaps</em> option will cause onehot encoding to be performed on the fly during training and prediction. The feature importances are returned as per the origin
 al imagery group and represent the sum of the feature importances of the onehot-encoded variables. Note: it is important that the training samples all of the categories in the rasters, otherwise the onehot-encoding will fail when it comes to the prediction.</p>
+<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as linear models may not perform optimally if some predictors have variances that are orders of magnitude larger than others. The <em>-s</em> flag adds a standardization preprocessing step to the classification and prediction to reduce this effect. Additionally, most of the classifiers do not perform well if there is a large class imbalance in the training data. Using the <em>-b</em> flag balances the training data by oversampling the minority classes relative to the majority class. This only applies to classification models.</p> 
 
+<p>Non-ordinal, categorical predictors are also not specifically recognized by scikit-learn. Some classifiers are not very sensitive to this (i.e. decision trees) but generally, categorical predictors need to be converted to a suite of binary using onehot encoding (i.e. where each value in a categorical raster is parsed into a separate binary grid). Entering the indices (comma-separated) of the categorical rasters as they are listed in the imagery group as 0...n in the <em>categorymaps</em> option will cause onehot encoding to be performed on the fly during training and prediction. The feature importances are returned as per the original imagery group and represent the sum of the feature importances of the onehot-encoded variables. Note: it is important that the training samples all of the categories in the rasters, otherwise the onehot-encoding will fail when it comes to the prediction.</p>
+
 <p>The module also offers the ability to save and load a classification or regression model. Saving and loading a model allows a model to be fitted on one imagery group, with the prediction applied to additional imagery groups. This approach is commonly employed in species distribution or landslide susceptibility modelling whereby a classification or regression model is built with one set of predictors (e.g. present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables.</p>
 
 <p>For convenience when performing repeated classifications using different classifiers or parameters, the training data can be saved to a csv file using the <em>save_training</em> option. This data can then be loaded into subsequent classification runs, saving time by avoiding the need to repeatedly query the predictors.</p>
@@ -54,10 +56,6 @@
 
 <p>Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <em>randst</em> parameter.</p>
 
-<h2>TODO</h2>
-
-<p>The balancing option in scikit-learn, which seeks to reduce class imbalances using weights that are inversely proportional to class frequencies, only applies to a few of the classifiers (LogisticRegression, DecisionTree and RandomForest). An separate python package called imbalanced-learn provides more sophisticated up- and down-sampling methods, e.g. using SMOTE, ROSE, etc. The option to balance the training data using this optionally installed package will be added in the future.</p>
-
 <h2>EXAMPLE</h2>
 
 <p>Here we are going to use the GRASS GIS sample North Carolina data set as a basis to perform a landsat classification. We are going to classify a Landsat 7 scene from 2000, using training information from an older (1996) land cover dataset.</p>

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-01-12 23:30:02 UTC (rev 70355)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-01-13 00:41:55 UTC (rev 70356)
@@ -51,42 +51,50 @@
 #% options: LogisticRegression,LinearDiscriminantAnalysis,QuadraticDiscriminantAnalysis,GaussianNB,DecisionTreeClassifier,DecisionTreeRegressor,RandomForestClassifier,RandomForestRegressor,GradientBoostingClassifier,GradientBoostingRegressor,SVC,EarthClassifier,EarthRegressor
 #%end
 
-#%option double
+#%option
 #% key: c
+#% type: double
 #% description: Inverse of regularization strength (logistic regresson and SVC)
 #% answer: 1.0
+#% multiple: yes
 #% guisection: Classifier Parameters
 #%end
 
 #%option
 #% key: max_features
 #% type: integer
-#% description: Number of features to consider during splitting for tree-based classifiers. Default -1 is sqrt(n_features) for classification, and n_features for regression
-#% answer: -1
+#% description: Number of features to consider during splitting for tree based classifiers. Default is sqrt(n_features) for classification, and n_features for regression
+#% required: no
+#% answer:
+#% multiple: yes
 #% guisection: Classifier Parameters
 #%end
 
 #%option
 #% key: max_depth
 #% type: integer
-#% description: Maximum tree depth for tree-based classifiers. Value of -1 uses classifier defaults
-#% answer: -1
+#% description: Optionally specifiy maximum tree depth. Otherwise full-growing occurs for decision trees and random forests, and max_depth=3 for gradient boosting
+#% required: no
+#% answer:
+#% multiple: yes
 #% guisection: Classifier Parameters
 #%end
 
 #%option
 #% key: min_samples_split
-#% type: integer
-#% description: The minimum number of samples required for node splitting in tree-based classifiers
+#% type: double
+#% description: The minimum number of samples required for node splitting in tree based classifiers
 #% answer: 2
+#% multiple: yes
 #% guisection: Classifier Parameters
 #%end
 
 #%option
 #% key: min_samples_leaf
 #% type: integer
-#% description: The minimum number of samples required to form a leaf node for tree-based classifiers
+#% description: The minimum number of samples required to form a leaf node for tree based classifiers
 #% answer: 1
+#% multiple: yes
 #% guisection: Classifier Parameters
 #%end
 
@@ -95,6 +103,7 @@
 #% type: integer
 #% description: Number of estimators for tree-based classifiers
 #% answer: 100
+#% multiple: yes
 #% guisection: Classifier Parameters
 #%end
 
@@ -103,6 +112,7 @@
 #% type: double
 #% description: learning rate for gradient boosting
 #% answer: 0.1
+#% multiple: yes
 #% guisection: Classifier Parameters
 #%end
 
@@ -111,6 +121,7 @@
 #% type: double
 #% description: The fraction of samples to be used for fitting for gradient boosting
 #% answer: 1.0
+#% multiple: yes
 #% guisection: Classifier Parameters
 #%end
 
@@ -118,6 +129,7 @@
 #% key: max_degree
 #% description: The maximum degree of terms generated by the forward pass in Earth
 #% answer: 1
+#% multiple: yes
 #% guisection: Classifier Parameters
 #%end
 
@@ -132,6 +144,7 @@
 #%option string
 #% key: categorymaps
 #% required: no
+#% multiple: yes
 #% label: Indices of categorical rasters within the imagery group (0..n)
 #% description: Indices of categorical rasters within the imagery group (0..n)
 #%end
@@ -243,7 +256,7 @@
 
 #%flag
 #% key: b
-#% description: Balance number of observations by weighting for logistic regression, CART and RF methods
+#% description: Balance training data by random oversampling
 #% guisection: Optional
 #%end
 
@@ -303,7 +316,8 @@
 
 class train():
 
-    def __init__(self, estimator, X, y, groups=None, categorical_var=None):
+    def __init__(self, estimator, X, y, groups=None, categorical_var=None,
+                 standardize=False, balance=False):
         """
         Train class to perform preprocessing, fitting, parameter search and
         cross-validation in a single step
@@ -314,12 +328,16 @@
         X, y: training data and labels as numpy arrays
         groups: groups to be used for cross-validation
         categorical_var: 1D list containing indices of categorical predictors
+        standardize: Transform predictors
+        balance: boolean to balance number of classes
         """
 
+        # fitting data
         self.estimator = estimator
         self.X = X
         self.y = y
         self.groups = groups
+        self.balance = balance
 
         # for onehot-encoding
         self.enc = None
@@ -330,7 +348,10 @@
             self.onehotencode()
         
         # for standardization
-        self.scaler = None
+        if standardize == True:
+            self.standardization()
+        else:
+            self.scaler = None
 
         # for cross-validation scores
         self.scores = None
@@ -338,6 +359,49 @@
         self.fimp = None
 
 
+    def random_oversampling(self, X, y, random_state=None):
+        """
+        Balances X, y observations using simple oversampling
+        
+        Args
+        ----
+        X: numpy array of training data
+        y: 1D numpy array of response data
+        random_state: Seed to pass onto random number generator
+        
+        Returns
+        -------
+        X_resampled: Numpy array of resampled training data
+        y_resampled: Numpy array of resampled response data
+        """
+        
+        np.random.seed(seed=random_state)
+        
+        # count the number of observations per class
+        y_classes = np.unique(y)
+        class_counts = np.histogram(y, bins=len(y_classes))[0]
+        maj_counts = class_counts.max()
+ 
+        y_resampled = y
+        X_resampled = X
+        
+        for cla, counts in zip(y_classes, class_counts):
+            # get the number of samples needed to balance minority class
+            num_samples = maj_counts - counts
+            
+            # get the indices of the ith class
+            indx = np.nonzero(y==cla)
+            
+            # create some new indices         
+            oversamp_indx = np.random.choice(indx[0], size=num_samples)
+    
+            # concatenate to the original X and y
+            y_resampled = np.concatenate((y[oversamp_indx], y_resampled))
+            X_resampled = np.concatenate((X[oversamp_indx], X_resampled))
+            
+            return (X_resampled, y_resampled)
+
+
     def onehotencode(self):
         """
         Method to convert a list of categorical arrays in X into a suite of
@@ -358,7 +422,7 @@
         self.X = self.enc.transform(self.X)    
 
 
-    def fit(self, param_distribution=None, n_iter=3, cv=3,
+    def fit(self, param_distributions=None, param_grid=None, n_iter=3, cv=3,
             random_state=None):
 
         """
@@ -367,50 +431,85 @@
 
         Args
         ----
-        param_distribution: continuous parameter distribution to be used in a
+        param_distributions: continuous parameter distribution to be used in a 
         randomizedCVsearch
+        param_grid: Dist of non-continuous parameters to grid search
         n_iter: Number of randomized search iterations
         cv: Number of cross-validation folds for parameter tuning
         random_state: seed to be used during random number generation
         """
+
+        from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
+        from sklearn.model_selection import GroupKFold
         
-        from sklearn.model_selection import RandomizedSearchCV
-        from sklearn.model_selection import GroupKFold
+        # Balance classes
+        if self.balance == True:
+            X, y = self.random_oversampling(self.X, self.y, random_state=random_state)
+            
+            if self.groups is not None:
+                groups, _ = self.random_oversampling(
+                    self.groups, self.y, random_state=random_state)
+            else:
+                groups = None
+        else:
+            X = self.X
+            y = self.y
+            groups = self.groups
 
-        # RandomizedSearchCV if parameter_distributions
-        if param_distribution is not None and n_iter > 1:
+        # Randomized or grid search
+        if param_distributions is not None or param_grid is not None:
             
             # use groupkfold for hyperparameter search if groups are present
             if self.groups is not None:
                 cv_search = GroupKFold(n_splits=cv)
             else:
                 cv_search = cv
-                
-            self.estimator = RandomizedSearchCV(
-                estimator=self.estimator,
-                param_distributions=param_distribution,
-                n_iter=n_iter, cv=cv_search, random_state=random_state)
-
+        
+            # Randomized search
+            if param_distributions is not None:
+                self.estimator = RandomizedSearchCV(
+                    estimator=self.estimator,
+                    param_distributions=param_distributions,
+                    n_iter=n_iter,
+                    cv=cv_search)
+            
+            # Grid Search
+            if param_grid is not None:
+                self.estimator = GridSearchCV(self.estimator,
+                                              param_grid,
+                                              n_jobs=-1, cv=cv_search)
+                        
             # if groups then fit RandomizedSearchCV.fit requires groups param
             if self.groups is None:
-                self.estimator.fit(self.X, self.y)
+                self.estimator.fit(X, y)
             else:
-                self.estimator.fit(self.X, self.y, groups=self.groups)
+                self.estimator.fit(X, y, groups=groups)
         
         # Fitting without parameter search
         else:
-            self.estimator.fit(self.X, self.y)
+            self.estimator.fit(X, y)
 
 
     def standardization(self):
         """
-        Transforms the train objects X data using standardization
+        Transforms the non-categorical X
         """
 
         from sklearn.preprocessing import StandardScaler
+        
+        # create mask so that indices that represent categorical
+        # predictors are not selected
+        if self.categorical_var is not None:
+            idx = np.arange(self.X.shape[1])
+            mask = np.ones(len(idx), dtype=bool)
+            mask[self.categorical_var] = False
+        else:
+            mask = np.arange(self.X.shape[1])
 
-        scaler = StandardScaler().fit(self.X)
-        self.X = scaler.transform(self.X)
+        X_continuous = self.X[:, mask]    
+        self.scaler = StandardScaler()
+        self.scaler.fit(X_continuous)
+        self.X[:, mask] =  self.scaler.transform(X_continuous)
 
 
     def pred_func(self, estimator, X_test, y_true, scorers):
@@ -582,14 +681,23 @@
 
             # get indices for train and test partitions
             X_train, X_test = self.X[train_indices], self.X[test_indices]
-            y_train, y_test = self.y[train_indices], self.y[test_indices]
+            y_train, y_test = self.y[train_indices], self.y[test_indices]         
             
             # also get indices of groups for the training partition
+            if self.groups is not None:
+                groups_train = self.groups[train_indices]
+
+            # balance the fold
+            if self.balance == True:
+                X_train, y_train = self.random_oversampling(X_train, y_train, random_state=random_state)                
+                if self.groups is not None:
+                    groups_train, _ = self.random_oversampling(
+                        groups_train, y_train, random_state=random_state) 
+                
             # fit the model on the training data and predict the test data
             # need the groups parameter because the estimator can be a 
             # RandomizedSearchCV estimator where cv=GroupKFold
             if self.groups is not None and isinstance(self.estimator, RandomizedSearchCV):
-                groups_train = self.groups[train_indices]
                 fit = self.estimator.fit(X_train, y_train, groups=groups_train)            
             else:
                 fit = self.estimator.fit(X_train, y_train)   
@@ -803,7 +911,16 @@
             
             # rescale
             if self.scaler is not None:
-                flat_pixels = self.scaler.transform(flat_pixels)
+                # create mask so that indices that represent categorical
+                # predictors are not selected
+                if self.categorical_var is not None:
+                    idx = np.arange(self.X.shape[1])
+                    mask = np.ones(len(idx), dtype=bool)
+                    mask[self.categorical_var] = False
+                else:
+                    mask = np.arange(self.X.shape[1])
+                flat_pixels_continuous = flat_pixels[:, mask]        
+                flat_pixels[:, mask] =  self.scaler.transform(flat_pixels_continuous)
 
             # perform prediction
             result = self.estimator.predict(flat_pixels)
@@ -868,9 +985,9 @@
 
 
 def model_classifiers(estimator='LogisticRegression', random_state=None,
-                      class_weight=None, C=1, max_depth=None,
-                      max_features='auto', min_samples_split=2,
-                      min_samples_leaf=1, n_estimators=100, subsample=1.0,
+                      C=1, max_depth=None, max_features='auto',
+                      min_samples_split=2, min_samples_leaf=1,
+                      n_estimators=100, subsample=1.0,
                       learning_rate=0.1, max_degree=1):
 
     """
@@ -880,7 +997,6 @@
     ----
     estimator: Name of estimator
     random_state: Seed to use in randomized components
-    class_weight: Option to balance classes using weighting
     C: Inverse of regularization strength
     max_depth: Maximum depth for tree-based methods
     min_samples_split: Minimum number of samples to split a node
@@ -893,13 +1009,10 @@
     Returns
     -------
     clf: Scikit-learn classifier object
-    params: Parameters to use for object
     mode: Flag to indicate whether classifier performs classification or
           regression
     """
 
-    from scipy.stats import randint, uniform
-
     from sklearn.linear_model import LogisticRegression
     from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
     from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
@@ -929,15 +1042,13 @@
         classifiers = {
             'SVC': SVC(C=C, probability=True, random_state=random_state),
             'LogisticRegression':
-                LogisticRegression(C=C, class_weight=class_weight,
-                                  random_state=random_state, n_jobs=-1),
+                LogisticRegression(C=C, random_state=random_state, n_jobs=-1),
             'DecisionTreeClassifier':
                 DecisionTreeClassifier(max_depth=max_depth,
                                       max_features=max_features,
                                       min_samples_split=min_samples_split,
                                       min_samples_leaf=min_samples_leaf,
-                                      random_state=random_state,
-                                      class_weight=class_weight),
+                                      random_state=random_state),
             'DecisionTreeRegressor':
                 DecisionTreeRegressor(max_features=max_features,
                                       min_samples_split=min_samples_split,
@@ -945,7 +1056,6 @@
                                       random_state=random_state),
             'RandomForestClassifier':
                 RandomForestClassifier(n_estimators=n_estimators,
-                                       class_weight=class_weight,
                                        max_features=max_features,
                                        min_samples_split=min_samples_split,
                                        min_samples_leaf=min_samples_leaf,
@@ -983,34 +1093,8 @@
             'QuadraticDiscriminantAnalysis': QuadraticDiscriminantAnalysis(),
         }
 
-    LogisticRegressionOpts = {'C': randint(1, 1000)}
-    DecisionTreeOpts = {'max_depth': randint(2, 20)}
-    RandomForestOpts = {'max_features': uniform()}
-    GradientBoostingOpts = {'max_depth': randint(3, 10),
-                            'n_estimators': randint(40, 100)}
-    SVCOpts = {'C': randint(1, 100)}
-    EarthOpts = {'max_degree': randint(1,3)}
-    EarthClassifierOpts = {'Earth__max_degree': randint(1,3)}
-
-    param_grids = {
-        'SVC': SVCOpts,
-        'LogisticRegression': LogisticRegressionOpts,
-        'DecisionTreeClassifier': DecisionTreeOpts,
-        'DecisionTreeRegressor': DecisionTreeOpts,
-        'RandomForestClassifier': RandomForestOpts,
-        'RandomForestRegressor': RandomForestOpts,
-        'GradientBoostingClassifier': GradientBoostingOpts,
-        'GradientBoostingRegressor': GradientBoostingOpts,
-        'GaussianNB': {},
-        'LinearDiscriminantAnalysis': {},
-        'QuadraticDiscriminantAnalysis': {},
-        'EarthClassifier': EarthClassifierOpts,
-        'EarthRegressor': EarthOpts
-    }
-
     # define classifier
     clf = classifiers[estimator]
-    params = param_grids[estimator]
 
     # classification or regression
     if estimator == 'LogisticRegression' \
@@ -1026,7 +1110,7 @@
     else:
         mode = 'regression'
 
-    return (clf, params, mode)
+    return (clf, mode)
 
 
 def save_training_data(X, y, groups, file):
@@ -1278,7 +1362,6 @@
     Args
     ----
     group: String; GRASS imagery group
-
     Returns
     -------
     maplist: Python list containing individual GRASS raster maps
@@ -1329,69 +1412,104 @@
     load_training = options['load_training']
     save_training = options['save_training']
     importances = flags['f']
-    n_iter = int(options['n_iter'])
     tune_cv = int(options['tune_cv'])
     n_permutations = int(options['n_permutations'])
     lowmem = flags['l']
     errors_file = options['errors_file']
     fimp_file = options['fimp_file']
-
-    if flags['b'] is True:
-        class_weight = 'balanced'
+    balance = flags['b']
+   
+    if ',' in categorymaps:
+        categorymaps = [int(i) for i in categorymaps.split(',')]
     else:
-        class_weight = None
+        categorymaps = None
+        
+    param_grid = {'C': None,
+                'min_samples_split': None,
+                'min_samples_leaf': None,
+                'n_estimators': None,
+                'learning_rate': None,
+                'subsample': None,
+                'max_depth': None,
+                'max_features': None,
+                'max_degree': None}
     
-    # convert comma-delimited string into int list
-    if categorymaps != '':
-        categorymaps = categorymaps.split(',')
-        for i in range(len(categorymaps)): categorymaps[i] = int(categorymaps[i])
+    # classifier options
+    C = options['c']
+    if ',' in C:
+        param_grid['C'] = [float(i) for i in C.split(',')]
+        C = None
     else:
-        categorymaps = None
+        C = float(C)
+    
+    min_samples_split = options['min_samples_split']
+    if ',' in min_samples_split:
+        param_grid['min_samples_split'] = [float(i) for i in min_samples_split.split(',')]
+        min_samples_split = None                
+    else:
+        min_samples_split = int(min_samples_split)
+    
+    min_samples_leaf = options['min_samples_leaf']
+    if ',' in min_samples_leaf:
+        param_grid['min_samples_leaf'] = [int(i) for i in min_samples_leaf.split(',')]
+        min_samples_leaf = None
+    else:
+        min_samples_leaf = int(min_samples_leaf)
 
-    # classifier options
-    max_degree = int(options['max_degree'])
-    C = float(options['c'])
-    min_samples_split = int(options['min_samples_split'])
-    min_samples_leaf = int(options['min_samples_leaf'])
-    n_estimators = int(options['n_estimators'])
-    learning_rate = float(options['learning_rate'])
-    subsample = float(options['subsample'])
-    max_depth = int(options['max_depth'])
-    max_features = int(options['max_features'])
+    n_estimators = options['n_estimators']
+    if ',' in n_estimators:
+        param_grid['n_estimators'] = [int(i) for i in n_estimators.split(',')]
+        n_estimators = None
+    else:
+        n_estimators = int(n_estimators)
 
-    """
-    Error checking of options and flags
-    -----------------------------------
-    """
+    learning_rate = options['learning_rate']
+    if ',' in learning_rate:
+        param_grid['learning_rate'] = [float(i) for i in learning_rate.split(',')]
+        learning_rate = None
+    else:
+        learning_rate = float(learning_rate)
 
-    if max_features == -1:
-        max_features = str('auto')
-    if max_depth == -1:
+    subsample = options['subsample']
+    if ',' in subsample:
+        param_grid['subsample'] = [float(i) for i in subsample.split(',')]
+        subsample = None
+    else:
+        subsample = float(subsample)
+
+    max_depth = options['max_depth']
+    if max_depth == '':
         max_depth = None
-
-    if n_iter > 1:
-        if (classifier == 'LinearDiscriminantAnalysis' or
-        classifier == 'QuadraticDiscriminantAnalysis' or
-        classifier == 'GaussianNB'):
-            grass.warning('No parameters to tune for selected model...ignoring')
-            n_iter = 1
+    else:
+        if ',' in max_depth:
+            param_grid['max_depth'] = [int(i) for i in max_depth.split(',')]
+            max_depth = None
+        else:
+            max_depth = float(max_depth)
     
+    max_features = options['max_features']
+    if max_features == '':
+        max_features = 'auto'
+    else:
+        if ',' in max_features:
+            param_grid['max_features'] = [int(i) for i in max_features.split(',')]
+            max_features = None
+        else:
+            max_features = int(max_features)
+    
+    max_degree = options['max_degree']
+    if ',' in max_degree:
+        param_grid['max_degree'] = [int(i) for i in max_degree.split(',')]
+        max_degree = None
+    else:
+        max_degree = int(max_degree)
+    
     if importances is True and cv == 1:
         grass.fatal('Feature importances require cross-validation cv > 1')
-        
-    """
-    Obtain information about GRASS rasters to be classified
-    -------------------------------------------------------
-    """
 
     # fetch individual raster names from group
     maplist, map_names = maps_from_group(group)
-    n_features = len(maplist)
 
-    # Error checking for m_features settings
-    if max_features > n_features:
-        max_features = n_features
-
     """
     Train the classifier
     --------------------
@@ -1418,13 +1536,27 @@
         # retrieve sklearn classifier object and parameters
         grass.message("Classifier = " + classifier)
 
-        clf, param_grid, mode = \
+        clf, mode = \
             model_classifiers(classifier, random_state,
-                              class_weight, C, max_depth,
-                              max_features, min_samples_split,
+                              C, max_depth, max_features, min_samples_split,
                               min_samples_leaf, n_estimators,
                               subsample, learning_rate, max_degree)
+        
+        # turn off balancing if mode = regression
+        if mode == 'regression' and balance == True:
+            balance = False
 
+        # remove empty items from the param_grid dict
+        param_grid = {k: v for k, v in param_grid.iteritems() if v != None}
+        
+        # check that dict keys are compatible for the selected classifier
+        clf_params = clf.get_params()
+        param_grid = { key: value for key, value in param_grid.iteritems() if key in clf_params}
+        
+        # check if dict contains and keys, otherwise set it to None
+        # so that the train object will not perform GridSearchCV
+        if any(param_grid) != True: param_grid = None
+        
         # Decide on scoring metric scheme
         if mode == 'classification':
             if len(np.unique(y)) == 2 and all([0, 1] == np.unique(y)):
@@ -1433,29 +1565,25 @@
                 scorers = 'multiclass'
         else:
             scorers = 'regression'
-
+        
         if mode == 'regression' and probability is True:
             grass.warning(
                 'Class probabilities only valid for classifications...ignoring')
             probability = False
 
         # create training object - onehot-encoded on-the-fly
-        learn_m = train(clf, X, y, group_id, categorical_var=categorymaps)
+        learn_m = train(clf, X, y, group_id, categorical_var=categorymaps,
+                        standardize=norm_data, balance=balance)
 
-        # preprocessing
-        if norm_data is True:
-            learn_m.standardization()
-
         """
         Fitting, parameter search and cross-validation
         ----------------
         """
 
         # fit and parameter search
-        learn_m.fit(param_grid, n_iter, tune_cv,
-                    random_state=random_state)
+        learn_m.fit(param_grid=param_grid, cv=tune_cv, random_state=random_state)
 
-        if n_iter > 1:
+        if param_grid is not None:
             grass.message('\n')
             grass.message('Best parameters:')
             grass.message(str(learn_m.estimator.best_params_))



More information about the grass-commit mailing list