[GRASS-SVN] r70889 - grass-addons/grass7/raster/r.learn.ml

svn_grass at osgeo.org svn_grass at osgeo.org
Sun Apr 16 21:56:22 PDT 2017


Author: spawley
Date: 2017-04-16 21:56:22 -0700 (Sun, 16 Apr 2017)
New Revision: 70889

Modified:
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
   grass-addons/grass7/raster/r.learn.ml/raster_learning.py
Log:
Update to r.learn.ml. Refactoring and removal of superfluous classes. Included option to use a vector points map as the training data. Added ExtraTreesClassifier option. Reverted to using class weights to balance training data for all classifiers that support them. Added options to only output probabilities for specific classes without predicting the classification raster.

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-04-16 20:37:08 UTC (rev 70888)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-04-17 04:56:22 UTC (rev 70889)
@@ -11,7 +11,7 @@
 	
 	<li>The <em>DecisionTreeClassifier</em> and <em>DecisionTreeRegressor</em> map observations to a response variable using a hierarchy of splits and branches. The terminus of these branches, termed leaves, represent the prediction of the response variable. Decision trees are non-parametric and can model non-linear relationships between a response and predictor variables, and are insensitive the scaling of the predictors.</li>
 	
-	<li>The <em>RandomForestsClassifier</em> and <em>RandomForestsRegressor</em> represent ensemble classification and regression tree methods. Random forests overcome some of the disadvantages of single decision trees by constructing an ensemble of uncorrelated trees. Each tree is constructed from a random subsample of the training data and only a random subset of the predictors based on <em>max_features</em> is made available during each node split. Each tree produces a prediction probability and the final classification result is obtained by averaging of the prediction probabilities across all of the trees.</li>
+	<li>The <em>RandomForestsClassifier</em> and <em>RandomForestsRegressor</em> represent ensemble classification and regression tree methods. Random forests overcome some of the disadvantages of single decision trees by constructing an ensemble of uncorrelated trees. Each tree is constructed from a random subsample of the training data and only a random subset of the predictors based on <em>max_features</em> is made available during each node split. Each tree produces a prediction probability and the final classification result is obtained by averaging of the prediction probabilities across all of the trees. The <em>ExtraTreesClassifier</em> is a variant on random forests where during each node split, the splitting rule that is selected is based on the best of a collection of randomly-geneated thresholds that were assigned to the features.</li>
 	
 	<li>The <em>GradientBoostingClassifier</em> and <em>GradientBoostingRegressor</em> also represent ensemble tree-based methods. However, in a boosted model the learning processes is additive in a forward step-wise fashion whereby <i>n_estimators</i> are fit during each model step, and each model step is designed to better fit samples that are not currently well predicted by the previous step. This incrementally improves the performance of the entire model ensemble by fitting to the model residuals. Additionally, the <em>XGBClassifier</em> and <em>XGBRegressor</em> models represent an accelerated version of gradient boosting which can optionally be installed from the XGBoost python package.</li>
 	
@@ -40,7 +40,7 @@
 
 <p>Cross validation can be performed by setting the <em>cv</em> parameters to &gt 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced depending on whether the response variable is binary or multiclass, or the classifier is for regression or classification. The <em>cvtype</em> parameter can also be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into <em>n_partitions</em> by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic performance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the gr
 oup_ids of the partitions can be supplied using the <em>group_raster</em> option.</p>
 
-<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as linear models may not perform optimally if some predictors have variances that are orders of magnitude larger than others. The <em>-s</em> flag adds a standardization preprocessing step to the classification and prediction to reduce this effect. Additionally, most of the classifiers do not perform well if there is a large class imbalance in the training data. Using the <em>-b</em> flag balances the training data by oversampling the minority classes relative to the majority class. This only applies to classification models.</p> 
+<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as linear models may not perform optimally if some predictors have variances that are orders of magnitude larger than others. The <em>-s</em> flag adds a standardization preprocessing step to the classification and prediction to reduce this effect. Additionally, most of the classifiers do not perform well if there is a large class imbalance in the training data. Using the <em>-b</em> flag balances the training data by weighting of the minority classes relative to the majority class. This does not apply to the Naive Bayes or LinearDiscriminantAnalysis classifiers.</p> 
 
 <p>Non-ordinal, categorical predictors are also not specifically recognized by scikit-learn. Some classifiers are not very sensitive to this (i.e. decision trees) but generally, categorical predictors need to be converted to a suite of binary using onehot encoding (i.e. where each value in a categorical raster is parsed into a separate binary grid). Entering the indices (comma-separated) of the categorical rasters as they are listed in the imagery group as 0...n in the <em>categorymaps</em> option will cause onehot encoding to be performed on the fly during training and prediction. The feature importances are returned as per the original imagery group and represent the sum of the feature importances of the onehot-encoded variables. Note: it is important that the training samples all of the categories in the rasters, otherwise the onehot-encoding will fail when it comes to the prediction.</p>
 
@@ -52,7 +52,7 @@
 
 <p><em>r.learn.ml</em> uses the "scikit-learn" machine learning python package along with the "pandas" package. These packages need to be installed within your GRASS GIS Python environment. For Linux users, these packages should be available through the linux package manager. For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows with a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.</p>
 
-<p><em>r.learn.ml</em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <em>lines</em> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <em>Lines=25</em> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <em>-l</em> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on rasters of any size irrespective of the availa
 ble memory.</p>
+<p><em>r.learn.ml</em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <em>lines</em> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <em>Lines=25</em> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set when providing a raster as the trainingmap. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <em>-l</em> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on ras
 ters of any size irrespective of the available memory.</p>
 
 <p>Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <em>randst</em> parameter.</p>
 

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-04-16 20:37:08 UTC (rev 70888)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-04-17 04:56:22 UTC (rev 70889)
@@ -5,7 +5,7 @@
 # PURPOSE:       Supervised classification and regression of GRASS rasters
 #                using the python scikit-learn package
 #
-# COPYRIGHT: (c) 2016 Steven Pawley, and the GRASS Development Team
+# COPYRIGHT: (c) 2017 Steven Pawley, and the GRASS Development Team
 #                This program is free software under the GNU General Public
 #                for details.
 #
@@ -35,6 +35,22 @@
 #% guisection: Required
 #%end
 
+#%option G_OPT_V_INPUT
+#% key: trainingpoints
+#% label: Training point vector
+#% description: Vector points map for training
+#% required: no
+#% guisection: Required
+#%end
+
+#%option G_OPT_DB_COLUMN
+#% key: field
+#% label: Response attribute column
+#% description: Name of attribute column containing response value
+#% required: no
+#% guisection: Required
+#%end
+
 #%option G_OPT_R_OUTPUT
 #% key: output
 #% required: yes
@@ -48,7 +64,7 @@
 #% label: Classifier
 #% description: Supervised learning model to use
 #% answer: RandomForestClassifier
-#% options: LogisticRegression,LinearDiscriminantAnalysis,QuadraticDiscriminantAnalysis,GaussianNB,DecisionTreeClassifier,DecisionTreeRegressor,RandomForestClassifier,RandomForestRegressor,GradientBoostingClassifier,GradientBoostingRegressor,SVC,EarthClassifier,EarthRegressor,XGBClassifier,XGBRegressor
+#% options: LogisticRegression,LinearDiscriminantAnalysis,QuadraticDiscriminantAnalysis,GaussianNB,DecisionTreeClassifier,DecisionTreeRegressor,RandomForestClassifier,RandomForestRegressor,ExtraTreesClassifier,ExtraTreesRegressor,GradientBoostingClassifier,GradientBoostingRegressor,SVC,EarthClassifier,EarthRegressor,XGBClassifier,XGBRegressor
 #%end
 
 #%option
@@ -82,7 +98,7 @@
 
 #%option
 #% key: min_samples_split
-#% type: double
+#% type: integer
 #% description: The minimum number of samples required for node splitting in tree based classifiers
 #% answer: 2
 #% multiple: yes
@@ -133,8 +149,6 @@
 #% guisection: Classifier Parameters
 #%end
 
-# General options
-
 #%flag
 #% key: s
 #% label: Standardization preprocessing
@@ -147,7 +161,7 @@
 #% guisection: Optional
 #%end
 
-#%option string
+#%option integer
 #% key: categorymaps
 #% required: no
 #% multiple: yes
@@ -218,6 +232,12 @@
 #%end
 
 #%flag
+#% key: z
+#% label: Only predict class probabilities
+#% guisection: Optional
+#%end
+
+#%flag
 #% key: m
 #% description: Build model only - do not perform prediction
 #% guisection: Optional
@@ -230,6 +250,15 @@
 #%end
 
 #%option
+#% key: indexes
+#% type: integer
+#% description: Indexes of class probabilities to predict. Default -1 predicts all classes
+#% answer: -1
+#% guisection: Optional
+#% multiple: yes
+#%end
+
+#%option
 #% key: n_permutations
 #% type: integer
 #% description: Number of permutations to perform for feature importances
@@ -246,7 +275,7 @@
 
 #%flag
 #% key: b
-#% description: Balance training data by random oversampling
+#% description: Balance training data using class weights
 #% guisection: Optional
 #%end
 
@@ -288,6 +317,8 @@
 #% exclusive: trainingmap,load_model
 #% exclusive: load_training,save_training
 #% exclusive: trainingmap,load_training
+#% exclusive: trainingpoints,trainingmap
+#% exclusive: trainingpoints,load_training
 #%end
 
 import atexit
@@ -299,44 +330,38 @@
 from grass.pygrass.utils import set_path
 
 set_path('r.learn.ml')
-from raster_learning import train, model_classifiers
-from raster_learning import save_training_data, load_training_data
-from raster_learning import extract, maps_from_group, random_oversampling
+from raster_learning import (
+    model_classifiers, save_training_data, load_training_data, extract,
+    maps_from_group, predict, cross_val_scores, extract_points)
 
 tmp_rast = []
 
-
 def cleanup():
     for rast in tmp_rast:
-        grass.run_command("g.remove", rast=rast, quiet=True)
+        grass.run_command("g.remove", name=rast, type='raster', flags='f', quiet=True)
 
-
 def main():
-
-    """
-    Lazy imports for main------------------------------------------------------
-    """
-
     try:
         from sklearn.externals import joblib
         from sklearn.cluster import KMeans
-        from sklearn.metrics import make_scorer, cohen_kappa_score
         from sklearn.model_selection import StratifiedKFold, GroupKFold
-        from sklearn.preprocessing import StandardScaler
+        from sklearn.preprocessing import StandardScaler, Imputer
         from sklearn.model_selection import GridSearchCV
+        from sklearn.preprocessing import OneHotEncoder
+        from sklearn.pipeline import Pipeline
+        from sklearn.utils import shuffle
         import warnings
         warnings.filterwarnings('ignore')  # turn off UndefinedMetricWarning
     except:
         grass.fatal("Scikit learn 0.18 or newer is not installed")
 
-    """
-    GRASS options and flags----------------------------------------------------
-    """
-
-    # General options and flags -----------------------------------------------
     group = options['group']
-    response = options['trainingmap']
+    trainingmap = options['trainingmap']
+    trainingpoints = options['trainingpoints']
+    field = options['field']
     output = options['output']
+    if '@' in output:
+        output = output.split('@')[0]
     classifier = options['classifier']
     norm_data = flags['s']
     cv = int(options['cv'])
@@ -353,115 +378,78 @@
     load_training = options['load_training']
     save_training = options['save_training']
     importances = flags['f']
+    indexes = options['indexes']
+    if ',' in indexes:
+        indexes = [int(i) for i in indexes.split(',')]
+    else:
+        indexes = int(indexes)
+    if indexes == -1:
+        indexes = None
     n_permutations = int(options['n_permutations'])
     lowmem = flags['l']
     impute = flags['i']
+    prob_only = flags['z']
     errors_file = options['errors_file']
     fimp_file = options['fimp_file']
     balance = flags['b']
-
+    if balance is True:
+        balance = 'balanced'
+    else: balance = None
     if ',' in categorymaps:
         categorymaps = [int(i) for i in categorymaps.split(',')]
-    else:
-        categorymaps = None
+    else: categorymaps = None
+    if importances is True and cv == 1:
+        grass.fatal('Feature importances require cross-validation cv > 1')
 
-    # classifier options and parameter grid settings --------------------------
-    param_grid = {'C': None,
-                  'min_samples_split': None,
-                  'min_samples_leaf': None,
-                  'n_estimators': None,
-                  'learning_rate': None,
-                  'subsample': None,
-                  'max_depth': None,
-                  'max_features': None,
-                  'max_degree': None}
+    # make dicts for hyperparameters, datatypes and parameters for tuning
+    hyperparams = {'C': options['c'],
+                   'min_samples_split': options['min_samples_split'],
+                   'min_samples_leaf': options['min_samples_leaf'],
+                   'n_estimators': options['n_estimators'],
+                   'learning_rate': options['learning_rate'],
+                   'subsample': options['subsample'],
+                   'max_depth': options['max_depth'],
+                   'max_features': options['max_features'],
+                   'max_degree': options['max_degree']}
+    hyperparams_type = dict.fromkeys(hyperparams, int)
+    hyperparams_type['C'] = float
+    hyperparams_type['learning_rate'] = float
+    hyperparams_type['subsample'] = float
+    param_grid = deepcopy(hyperparams_type)
+    param_grid = dict.fromkeys(param_grid, None)
 
-    C = options['c']
-    if ',' in C:
-        param_grid['C'] = [float(i) for i in C.split(',')]
-        C = None
-    else:
-        C = float(C)
+    for key, val in hyperparams.iteritems():
+        # split any comma separated strings and add them to the param_grid
+        if ',' in val: param_grid[key] = [hyperparams_type[key](i) for i in val.split(',')]
+        # else convert the single strings to int or float
+        else: hyperparams[key] = hyperparams_type[key](val)
 
-    min_samples_split = options['min_samples_split']
-    if ',' in min_samples_split:
-        param_grid['min_samples_split'] = \
-            [float(i) for i in min_samples_split.split(',')]
-        min_samples_split = None
-    else:
-        min_samples_split = int(min_samples_split)
+    if hyperparams['max_depth'] == 0: hyperparams['max_depth'] = None
+    if hyperparams['max_features'] == 0: hyperparams['max_features'] = 'auto'
+    param_grid = {k: v for k, v in param_grid.iteritems() if v is not None}
 
-    min_samples_leaf = options['min_samples_leaf']
-    if ',' in min_samples_leaf:
-        param_grid['min_samples_leaf'] = \
-            [int(i) for i in min_samples_leaf.split(',')]
-        min_samples_leaf = None
-    else:
-        min_samples_leaf = int(min_samples_leaf)
+    # retrieve sklearn classifier object and parameters
+    clf, mode = model_classifiers(
+        classifier, random_state, hyperparams, balance)
 
-    n_estimators = options['n_estimators']
-    if ',' in n_estimators:
-        param_grid['n_estimators'] = [int(i) for i in n_estimators.split(',')]
-        n_estimators = None
-    else:
-        n_estimators = int(n_estimators)
+    # remove dict keys that are incompatible for the selected classifier
+    clf_params = clf.get_params()
+    param_grid = {
+        key: value for key, value in param_grid.iteritems()
+        if key in clf_params}
 
-    learning_rate = options['learning_rate']
-    if ',' in learning_rate:
-        param_grid['learning_rate'] = \
-            [float(i) for i in learning_rate.split(',')]
-        learning_rate = None
+    # scoring metrics
+    if mode == 'classification':
+        scoring = ['accuracy', 'precision', 'recall', 'f1', 'kappa', 'balanced_accuracy']
     else:
-        learning_rate = float(learning_rate)
+        scoring = ['r2', 'neg_mean_squared_error']
 
-    subsample = options['subsample']
-    if ',' in subsample:
-        param_grid['subsample'] = [float(i) for i in subsample.split(',')]
-        subsample = None
-    else:
-        subsample = float(subsample)
+    # Sample training data and group ids
+    # ----------------------------------
 
-    max_depth = options['max_depth']
-    if max_depth == '0':
-        max_depth = None
-    else:
-        if ',' in max_depth:
-            param_grid['max_depth'] = [int(i) for i in max_depth.split(',')]
-            max_depth = None
-        else:
-            max_depth = int(max_depth)
-
-    max_features = options['max_features']
-    if max_features == '0':
-        max_features = 'auto'
-    else:
-        if ',' in max_features:
-            param_grid['max_features'] = \
-                [int(i) for i in max_features.split(',')]
-            max_features = None
-        else:
-            max_features = int(max_features)
-
-    max_degree = options['max_degree']
-    if ',' in max_degree:
-        param_grid['max_degree'] = [int(i) for i in max_degree.split(',')]
-        max_degree = None
-    else:
-        max_degree = int(max_degree)
-
-    # remove empty items from the param_grid dict
-    param_grid = {k: v for k, v in param_grid.iteritems() if v is not None}
-
-    if importances is True and cv == 1:
-        grass.fatal('Feature importances require cross-validation cv > 1')
-
-    # fetch individual raster names from group --------------------------------
+    # fetch individual raster names from group
     maplist, map_names = maps_from_group(group)
 
-    """
-    Sample training data and group ids ----------------------------------------
-    """
-
     if model_load == '':
 
         # Sample training data and group id
@@ -476,7 +464,7 @@
             if cvtype == 'clumped' and group_raster == '':
                 clumped_trainingmap = 'tmp_clumped_trainingmap'
                 tmp_rast.append(clumped_trainingmap)
-                r.clump(input=response, output=clumped_trainingmap,
+                r.clump(input=trainingmap, output=clumped_trainingmap,
                         overwrite=True, quiet=True)
                 group_raster = clumped_trainingmap
 
@@ -486,10 +474,12 @@
             if group_raster != '':
                 maplist2 = deepcopy(maplist)
                 maplist2.append(group_raster)
-                X, y, sample_coords = extract(
-                        response=response, predictors=maplist2,
-                        impute=impute, shuffle_data=False, lowmem=False,
-                        random_state=random_state)
+                if trainingmap != '':
+                    X, y, sample_coords = extract(
+                        response=trainingmap, predictors=maplist2, lowmem=False)
+                elif trainingpoints != '':
+                    X, y, sample_coords = extract_points(
+                        trainingpoints, maplist, field)
 
                 # take group id from last column and remove from predictors
                 group_id = X[:, -1]
@@ -497,19 +487,18 @@
             else:
                 # extract training data from maplist without group Ids
                 # shuffle this data by default
-                X, y, sample_coords = extract(
-                    response=response, predictors=maplist,
-                    impute=impute,
-                    shuffle_data=True,
-                    lowmem=lowmem,
-                    random_state=random_state)
-
+                if trainingmap != '':
+                    X, y, sample_coords = extract(
+                        response=trainingmap, predictors=maplist, lowmem=lowmem)
+                elif trainingpoints != '':
+                    X, y, sample_coords = extract_points(
+                        trainingpoints, maplist, field)
                 group_id = None
 
                 if cvtype == 'kmeans':
-                    clusters = KMeans(n_clusters=n_partitions,
-                                      random_state=random_state,
-                                      n_jobs=-1)
+                    clusters = KMeans(
+                        n_clusters=n_partitions,
+                        random_state=random_state, n_jobs=-1)
 
                     clusters.fit(sample_coords)
                     group_id = clusters.labels_
@@ -519,97 +508,124 @@
                 grass.fatal('No training pixels or pixels in imagery group '
                             '...check computational region')
 
+            # impute or remove NaNs
+            if impute is False:
+                y = y[~np.isnan(X).any(axis=1)]
+                sample_coords = sample_coords[~np.isnan(X).any(axis=1)]
+                if group_id is not None:
+                    group_id = group_id[~np.isnan(X).any(axis=1)]
+                X = X[~np.isnan(X).any(axis=1)]
+            else:
+                missing = Imputer(strategy='median')
+                X = missing.fit_transform(X)
+
+            # shuffle data
+            if group_id is None:
+                X, y, sample_coords = shuffle(X, y, sample_coords, random_state=random_state)
+            if group_id is not None:
+                X, y, sample_coords, group_id = shuffle(
+                    X, y, sample_coords, group_id, random_state=random_state)
+
         # option to save extracted data to .csv file
         if save_training != '':
             save_training_data(X, y, group_id, save_training)
 
-        """
-        Train the classifier --------------------------------------------------
-        """
-
-        # retrieve sklearn classifier object and parameters -------------------
-        clf, mode = \
-            model_classifiers(classifier, random_state,
-                              C, max_depth, max_features, min_samples_split,
-                              min_samples_leaf, n_estimators,
-                              subsample, learning_rate, max_degree)
-
-        # set other parameters based on classification or regression ----------
-        if mode == 'classification':
-            if len(np.unique(y)) == 2 and all([0, 1] == np.unique(y)):
-                scorers = 'binary'
-            else:
-                scorers = 'multiclass'
-            search_scorer = make_scorer(cohen_kappa_score)
-            labels = np.unique(y)
-        else:
-            scorers = 'regression'
-            search_scorer = 'r2'
-            labels = None  # no classes
-            balance = False  # no balancing for regression
-            if probability is True:
-                grass.warning(
-                        'Class probabilities only valid for classifications...'
-                        'ignoring')
-                probability = False
-
-        # setup model selection model -----------------------------------------
+        # define model selection cross-validation method
         if any(param_grid) is True and cv == 1:
             grass.fatal('Hyperparameter search requires cv > 1')
         if any(param_grid) is True or cv > 1:
             if group_id is None:
-                search_cv_method = StratifiedKFold(
-                        n_splits=cv, random_state=random_state)
+                resampling = StratifiedKFold(
+                    n_splits=cv, random_state=random_state)
             else:
-                search_cv_method = GroupKFold(n_splits=cv)
+                resampling = GroupKFold(n_splits=cv)
+        else:
+            resampling = None
 
-        # set-up parameter grid for hyperparameter search ---------------------
-        # check that dict keys are compatible for the selected classifier
-        clf_params = clf.get_params()
-        param_grid = { key: value for key, value in param_grid.iteritems() if key in clf_params}
+        # define preprocessing pipeline
+        # -----------------------------
 
-        # check if dict contains and keys, otherwise set it to None
-        # so that the train object will not perform GridSearchCV
-        if any(param_grid) is not True:
-            param_grid = None
+        # sample weights for GradientBoosting or XGBClassifier
+        if balance == 'balanced' and mode == 'classification' and classifier in (
+                'GradientBoostingClassifier', 'XGBClassifier'):
+            from sklearn.utils import compute_class_weight
+            class_weights = compute_class_weight(
+                class_weight='balanced', classes=(y), y=y)
         else:
-            clf = GridSearchCV(estimator=clf, param_grid=param_grid,
-                               scoring=search_scorer, n_jobs=-1,
-                               cv=search_cv_method)
+            class_weights = None
 
-        # preprocessing options -----------------------------------------------
-        if balance is True:
-            sampling = random_oversampling(random_state)
-        else:
-            sampling = None
+        # scaling and onehot encoding
+        if norm_data is True and categorymaps is None:
+            clf = Pipeline([('scaling', StandardScaler()),
+                            ('classifier', clf)])
+        if categorymaps is not None and norm_data is False:
+            enc = OneHotEncoder(categorical_features=categorymaps)
+            enc.fit(X)
+            clf = Pipeline([('onehot', OneHotEncoder(
+                categorical_features=categorymaps,
+                n_values=enc.n_values_, handle_unknown='ignore', # prevent failure due to new categorical vars
+                sparse=False)),  # dense because not all clf can use sparse
+                            ('classifier', clf)])
+        if norm_data is True and categorymaps is not None:
+            enc = OneHotEncoder(categorical_features=categorymaps)
+            enc.fit(X)
+            clf = Pipeline([('onehot', OneHotEncoder(
+                categorical_features=categorymaps,
+                n_values=enc.n_values_, handle_unknown='ignore',
+                sparse=False)),
+                            ('scaling', StandardScaler()),
+                            ('classifier', clf)])
 
-        if norm_data is True:
-            scaler = StandardScaler()
-        else:
-            scaler = None
+        # define hyperparameter grid search
+        # ---------------------------------
 
-        # create training object ----------------------------------------------
-        learn_m = train(clf, categorical_var=categorymaps,
-                        preprocessing=scaler, sampling=sampling)
+        # check if dict contains and keys - perform GridSearchCV
+        if any(param_grid) is True:
 
-        """
-        Fitting, parameter search and cross-validation ------------------------
-        """
+            # if Pipeline then change param_grid keys to named_step
+            if isinstance(clf, Pipeline):
+                for key in param_grid.keys():
+                    newkey = 'classifier__' + key
+                    param_grid[newkey] = param_grid.pop(key)
 
+            # create grid search method
+            clf = GridSearchCV(
+                estimator=clf, param_grid=param_grid, scoring=scoring[0],
+                n_jobs=-1, cv=resampling)
+
+        # classifier training
+        # -------------------
+
         # fit and parameter search
         grass.message(os.linesep)
         grass.message(('Fitting model using ' + classifier))
-        learn_m.fit(X, y, group_id)
 
-        if param_grid is not None:
+        # use groups if GroupKFold and param_grid are present
+        if isinstance(resampling, GroupKFold) and any(param_grid) is True:
+            if balance == 'balanced' and classifier in (
+                    'GradientBoostingClassifier', 'XGBClassifier'):
+                clf.fit(X=X, y=y, groups=group_id, sample_weight=class_weights)
+            else:
+                clf.fit(X=X, y=y, groups=group_id)
+        else:
+            if balance == 'balanced' and classifier in (
+                    'GradientBoostingClassifier', 'XGBClassifier'):
+                clf.fit(X=X, y=y, sample_weight=class_weights)
+            else:
+                clf.fit(X, y)
+
+        if any(param_grid) is True:
             grass.message(os.linesep)
             grass.message('Best parameters:')
-            grass.message(str(learn_m.estimator.best_params_))
-            
+            grass.message(str(clf.best_params_))
+
+        # cross-validation
+        # -----------------
+
         # If cv > 1 then use cross-validation to generate performance measures
         if cv > 1:
-            # check that a sufficient number of samples are present per class
-            if cv > np.histogram(y, bins=len(labels))[0].min():
+            if mode == 'classification' and cv > np.histogram(
+                    y, bins=len(np.unique(y)))[0].min():
                 grass.message(os.linesep)
                 grass.message('Number of cv folds is greater than number of '
                               'samples in some classes. Cross-validation is being'
@@ -620,98 +636,34 @@
                     "Cross validation global performance measures......:")
 
                 # cross-validate the training object
-                learn_m.cross_val(search_cv_method, X, y, group_id,
-                                  scorers, importances,
-                                  n_permutations=n_permutations,
-                                  random_state=random_state)
+                if mode == 'classification' and \
+                    len(np.unique(y)) == 2 and all([0, 1] == np.unique(y)):
+                    scoring.append('roc_auc')
+                scores, cscores, fimp = cross_val_scores(
+                    clf, X, y, group_id, class_weights, resampling, scoring,
+                    importances, n_permutations, random_state)
 
-                scores = learn_m.get_cross_val_scores()
+                # global scores
+                for method, val in scores.iteritems():
+                    grass.message(
+                        method+":\t%0.3f\t+/-SD\t%0.3f" %
+                        (val.mean(), val.std()))
 
-                if mode == 'classification':
-                    if scorers == 'binary':
-                        grass.message(
-                            "Accuracy   :\t%0.3f\t+/-SD\t%0.3f" %
-                            (scores['accuracy'].mean(),
-                             scores['accuracy'].std()))
-                        grass.message(
-                            "AUC        :\t%0.3f\t+/-SD\t%0.3f" %
-                            (scores['auc'].mean(),
-                             scores['auc'].std()))
-                        grass.message(
-                            "Kappa      :\t%0.3f\t+/-SD\t%0.3f" %
-                            (scores['kappa'].mean(),
-                             scores['kappa'].std()))
-                        grass.message(
-                            "Precision  :\t%0.3f\t+/-SD\t%0.3f" %
-                            (scores['precision'].mean(),
-                             scores['precision'].std()))
-                        grass.message(
-                            "Recall     :\t%0.3f\t+/-SD\t%0.3f" %
-                            (scores['recall'].mean(),
-                             scores['recall'].std()))
-                        grass.message(
-                            "Specificity:\t%0.3f\t+/-SD\t%0.3f" %
-                            (scores['specificity'].mean(),
-                             scores['specificity'].std()))
-                        grass.message(
-                            "F1         :\t%0.3f\t+/-SD\t%0.3f" %
-                            (scores['f1'].mean(),
-                             scores['f1'].std()))
+                # individual class scores
+                if mode == 'classification' and len(np.unique(y)) != 2:
+                    grass.message(os.linesep)
+                    grass.message('Cross validation class performance measures......:')
+                    grass.message('Class \t' + '\t'.join(map(str, np.unique(y))))
 
-                    if scorers == 'multiclass':
-                        # global scores
+                    for method, val in cscores.iteritems():
+                        mat_cscores = np.matrix(val)
                         grass.message(
-                            "Accuracy:\t%0.3f\t+/-SD\t%0.3f" %
-                            (scores['accuracy'].mean(),
-                             scores['accuracy'].std()))
+                            method+':\t' + '\t'.join(
+                                map(str, np.round(mat_cscores.mean(axis=0), 2)[0])))
                         grass.message(
-                            "Kappa   :\t%0.3f\t+/-SD\t%0.3f" %
-                            (scores['kappa'].mean(),
-                             scores['kappa'].std()))
+                            method+' std:\t' + '\t'.join(
+                                map(str, np.round(mat_cscores.std(axis=0), 2)[0])))
 
-                        # per class scores
-                        grass.message(os.linesep)
-                        grass.message('Cross validation class performance measures......:')
-                        mat_precision = np.matrix(scores['precision'])
-                        mat_recall = np.matrix(scores['recall'])
-                        mat_f1 = np.matrix(scores['f1'])
-
-                        grass.message('Class \t' + '\t'.join(map(str, labels)))
-                        grass.message(
-                            'Precision mean \t' + '\t'.join(
-                                    map(str, np.round(
-                                            mat_precision.mean(axis=0), 2)[0])))
-                        grass.message(
-                            'Precision std \t' + '\t'.join(
-                                    map(str, np.round(
-                                            mat_precision.std(axis=0), 2)[0])))
-                        grass.message(
-                            'Recall mean \t' + '\t'.join(
-                                    map(str, np.round(
-                                            mat_recall.mean(axis=0), 2)[0])))
-                        grass.message(
-                            'Recall std \t' + '\t'.join(
-                                    map(str, np.round(
-                                            mat_recall.std(axis=0), 2)[0])))
-                        grass.message(
-                            'F1 score mean \t' + '\t'.join(
-                                    map(str, np.round(
-                                            mat_f1.mean(axis=0), 2)[0])))
-                        grass.message(
-                            'F1 score std \t' + '\t'.join(
-                                    map(str, np.round(
-                                            mat_f1.std(axis=0), 2)[0])))
-
-                        # remove perclass scores from dict
-                        del scores['precision']
-                        del scores['recall']
-                        del scores['f1']
-
-                else:
-                    grass.message("R2:\t%0.3f\t+/-\t%0.3f" %
-                                  (scores['r2'].mean(),
-                                   scores['r2'].std()))
-
                 # write cross-validation results for csv file
                 if errors_file != '':
                     try:
@@ -730,33 +682,44 @@
                     grass.message("id" + "\t" + "Raster" + "\t" + "Importance")
 
                     # mean of cross-validation feature importances
-                    for i in range(len(learn_m.fimp.mean(axis=0))):
+                    for i in range(len(fimp.mean(axis=0))):
                         grass.message(
                             str(i) + "\t" + maplist[i] +
-                            "\t" + str(round(learn_m.fimp.mean(axis=0)[i], 4)))
+                            "\t" + str(round(fimp.mean(axis=0)[i], 4)))
 
                     if fimp_file != '':
-                        np.savetxt(fname=fimp_file, X=learn_m.fimp, delimiter=',',
+                        np.savetxt(fname=fimp_file, X=fimp, delimiter=',',
                                    header=','.join(maplist), comments='')
     else:
         # load a previously fitted train object
-        # -------------------------------------
         if model_load != '':
             # load a previously fitted model
-            learn_m = joblib.load(model_load)
+            clf = joblib.load(model_load)
 
     # Optionally save the fitted model
     if model_save != '':
-        joblib.dump(learn_m, model_save)
+        joblib.dump(clf, model_save)
 
-    """
-    Prediction on the rest of the GRASS rasters in the imagery group ----------
-    """
+    # prediction on the rest of the GRASS rasters in the imagery group
+    # ----------------------------------------------------------------
 
     if modelonly is not True:
         grass.message(os.linesep)
-        grass.message('Predicting raster...')
-        learn_m.predict(maplist, output, labels, probability, rowincr)
+        if mode == 'classification':
+            if prob_only is False:
+                grass.message('Predicting classification raster...')
+                predict(estimator=clf, predictors=maplist, output=output, predict_type='raw',
+                        labels=np.unique(y), rowincr=rowincr)
+
+            if probability is True:
+                grass.message('Predicting class probabilities...')
+                predict(estimator=clf, predictors=maplist, output=output, predict_type='prob',
+                        labels=np.unique(y), index=indexes, rowincr=rowincr)
+
+        elif mode == 'regression':
+            grass.message('Predicting regression raster...')
+            predict(estimator=clf, predictors=maplist, output=output, predict_type='raw',
+                    labels=None, rowincr=rowincr)
     else:
         grass.message("Model built and now exiting")
 

Modified: grass-addons/grass7/raster/r.learn.ml/raster_learning.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/raster_learning.py	2017-04-16 20:37:08 UTC (rev 70888)
+++ grass-addons/grass7/raster/r.learn.ml/raster_learning.py	2017-04-17 04:56:22 UTC (rev 70889)
@@ -1,735 +1,419 @@
-# -*- coding: utf-8 -*-
-
 import os
-import scipy
 import numpy as np
 from numpy.random import RandomState
-from copy import deepcopy
 import tempfile
+from copy import deepcopy
 import grass.script as grass
 from grass.pygrass.raster import RasterRow
 from grass.pygrass.gis.region import Region
 from grass.pygrass.raster.buffer import Buffer
 from grass.pygrass.modules.shortcuts import imagery as im
+from grass.pygrass.vector import VectorTopo
+from grass.pygrass.vector.table import Link
+from grass.pygrass.utils import get_raster_for_points
 from subprocess import PIPE
 
+def specificity_score(y_true, y_pred):
+    """
+    Function to calculate specificity score
 
-class random_oversampling():
+    Args
+    ----
+    y_true: 1D numpy array of truth values
+    y_pred: 1D numpy array of predicted classes
 
-    def __init__(self, random_state):
-        """
-        Balances X, y observations using simple oversampling
+    Returns
+    -------
+    specificity: specificity score
+    """
+    from sklearn.metrics import confusion_matrix
 
-        Args
-        ----
-        random_state: Seed to pass onto random number generator
-        """
+    cm = confusion_matrix(y_true, y_pred)
+    tn = float(cm[0][0])
+    fp = float(cm[0][1])
 
-        self.random_state = random_state
+    return (tn/(tn+fp))
 
-    def fit_sample(self, X, y):
-        """
-        Performs equal balancing of response and explanatory variances
 
-        Args
-        ----
-        X: numpy array of training data
-        y: 1D numpy array of response data
+def varimp_permutation(estimator, X_test, y_true,
+                       n_permutations, scorer,
+                       random_state):
 
-        Returns
-        -------
-        X_resampled: Numpy array of resampled training data
-        y_resampled: Numpy array of resampled response data
-        """
+    """
+    Method to perform permutation-based feature importance during
+    cross-validation (cross-validation is applied externally to this
+    method)
 
-        np.random.seed(seed=self.random_state)
+    Procedure is:
+    1. Pass fitted estimator and test partition X y
+    2. Assess AUC on the test partition (bestauc)
+    3. Permute each variable and assess the difference between bestauc and
+       the messed-up variable
+    4. Repeat (3) for many random permutations
+    5. Average the repeats
 
-        # count the number of observations per class
-        y_classes = np.unique(y)
-        class_counts = np.histogram(y, bins=len(y_classes))[0]
-        maj_counts = class_counts.max()
+    Args
+    ----
+    estimator: estimator that has been fitted to a training partition
+    X_test, y_true: data and labels from a test partition
+    n_permutations: number of random permutations to apply
+    scorer: scikit-learn metric function to use
+    random_state: seed to pass to the numpy random.seed
 
-        y_resampled = y
-        X_resampled = X
+    Returns
+    -------
+    scores: scores for each predictor following permutation
+    """
 
-        for cla, counts in zip(y_classes, class_counts):
-            # get the number of samples needed to balance minority class
-            num_samples = maj_counts - counts
+    # calculate score on original variables without permutation
+    # determine best metric type for binary/multiclass/regression scenarios
+    y_pred = estimator.predict(X_test)
+    best_score = scorer(y_true, y_pred)
 
-            # get the indices of the ith class
-            indx = np.nonzero(y == cla)
+    np.random.seed(seed=random_state)
+    rstate = RandomState(random_state)
+    scores = np.zeros((n_permutations, X_test.shape[1]))
 
-            # create some new indices
-            oversamp_indx = np.random.choice(indx[0], size=num_samples)
+    # outer loop to repeat the pemutation rep times
+    for rep in range(n_permutations):
 
-            # concatenate to the original X and y
-            y_resampled = np.concatenate((y[oversamp_indx], y_resampled))
-            X_resampled = np.concatenate((X[oversamp_indx], X_resampled))
+        # inner loop to permute each predictor variable and assess
+        # difference in auc
+        for i in range(X_test.shape[1]):
+            Xscram = np.copy(X_test)
+            Xscram[:, i] = rstate.choice(X_test[:, i], X_test.shape[0])
 
-        return (X_resampled, y_resampled)
+            # fit the model on the training data and predict the test data
+            y_pred = estimator.predict(Xscram)
+            scores[rep, i] = best_score-scorer(y_true, y_pred)
+            if scores[rep, i] < 0:
+                scores[rep, i] = 0
 
+    # average the repetitions
+    scores = scores.mean(axis=0)
 
-class train():
+    return(scores)
 
-    def __init__(self, estimator, categorical_var=None,
-                 preprocessing=None, sampling=None):
-        """
-        Train class to perform preprocessing, fitting, parameter search and
-        cross-validation in a single step
 
-        Args
-        ----
-        estimator: Scikit-learn compatible estimator object
-        categorical_var: 1D list containing indices of categorical predictors
-        preprocessing: Sklearn preprocessing scaler
-        sampling: Balancing object e.g. from imbalance-learn
-        """
+def cross_val_scores(estimator, X, y, groups=None, sample_weight=None, cv=3,
+                     scoring=['accuracy'], feature_importances=False,
+                     n_permutations=25, random_state=None):
 
-        # fitting data
-        self.estimator = estimator
+    """
+    Stratified Kfold and GroupFold cross-validation using multiple
+    scoring metrics and permutation feature importances
 
-        # for onehot-encoding
-        self.enc = None
-        self.categorical_var = categorical_var
-        self.category_values = None
+    Args
+    ----
+    estimator: Scikit learn estimator
+    X: 2D numpy array of training data
+    y: 1D numpy array representing response variable
+    groups: 1D numpy array containing group labels
+    sample_weight: 1D numpy array[n_samples,] of sample weights
+    cv: Integer of cross-validation folds or sklearn.model_selection object
+    sampling: Over- or under-sampling object with fit_sample method
+    scoring: List of performance metrics to use
+    feature_importances: Boolean to perform permutation-based importances
+    n_permutations: Number of permutations during feature importance
+    random_state: Seed to pass to the random number generator
+    """
 
-        # for preprocessing of data
-        self.sampling = sampling
-        self.preprocessing = preprocessing
+    from sklearn import metrics
+    from sklearn.model_selection import (
+        RandomizedSearchCV, GridSearchCV, StratifiedKFold)
 
-        # for cross-validation scores
-        self.scores = None
-        self.fimp = None
-        self.mean_tpr = None
-        self.mean_fpr = None
+    estimator = deepcopy(estimator)
 
-    def __onehotencode(self, X):
+    # create model_selection method
+    if isinstance(cv, int):
+        cv = StratifiedKFold(n_splits=cv)
 
-        """
-        Method to convert a list of categorical arrays in X into a suite of
-        binary predictors which are added to the end of the array
+    # fit the model on the training data and predict the test data
+    # need the groups parameter because the estimator can be a
+    # RandomizedSearchCV or GridSearchCV estimator where cv=GroupKFold
+    if isinstance(estimator, RandomizedSearchCV) is True \
+            or isinstance(estimator, GridSearchCV):
+        param_search = True
+    else:
+        param_search = False
 
-        Args
-        ----
-        X: 2D numpy array containing training data
-        """
+    # create dictionary of lists to store metrics
+    scores = dict.fromkeys(scoring)
+    scores = { key: [] for key, value in scores.iteritems()}
+    scoring_methods = {'accuracy': metrics.accuracy_score,
+                       'balanced_accuracy': metrics.recall_score,
+                       'average_precision': metrics.average_precision_score,
+                       'brier_loss': metrics.brier_score_loss,
+                       'kappa': metrics.cohen_kappa_score,
+                       'f1': metrics.f1_score,
+                       'fbeta': metrics.fbeta_score,
+                       'hamming_loss': metrics.hamming_loss,
+                       'jaccard_similarity': metrics.jaccard_similarity_score,
+                       'log_loss': metrics.log_loss,
+                       'matthews_corrcoef': metrics.matthews_corrcoef,
+                       'precision': metrics.precision_score,
+                       'recall': metrics.recall_score,
+                       'specificity': specificity_score,
+                       'roc_auc': metrics.roc_auc_score,
+                       'zero_one_loss': metrics.zero_one_loss,
+                       'r2': metrics.r2_score,
+                       'neg_mean_squared_error': metrics.mean_squared_error}
 
-        from sklearn.preprocessing import OneHotEncoder
+    byclass_methods = {'f1': metrics.f1_score,
+                      'fbeta': metrics.fbeta_score,
+                      'precision': metrics.precision_score,
+                      'recall': metrics.recall_score}
 
-        # store original range of values
-        self.category_values = [0] * len(self.categorical_var)
-        for i, cat in enumerate(self.categorical_var):
-            self.category_values[i] = np.unique(X[:, cat])
+    # create diction to store byclass metrics results
+    n_classes = len(np.unique(y))
+    labels = np.unique(y)
+    byclass_scores = dict.fromkeys(byclass_methods)
+    byclass_scores = { key: np.zeros((0, n_classes)) for key, value in byclass_scores.iteritems()}
 
-        # fit and transform categorical grids to a suite of binary features
-        self.enc = OneHotEncoder(categorical_features=self.categorical_var,
-                                 sparse=False)
-        self.enc.fit(X)
-        X = self.enc.transform(X)
+    # remove any byclass_scorers that are not in the scoring list
+    byclass_scores = {key: value for key, value in byclass_scores.iteritems() if key in scores}
 
-        return(X)
+    # set averaging type for global binary or multiclass scores
+    if len(np.unique(y)) == 2 and all([0, 1] == np.unique(y)):
+        average='binary'
+    else:
+        average='macro'
 
-    def fit(self, X, y, groups=None):
+    # check to see if scoring is a valid sklearn metric
+    for i in scores.keys():
+        try:
+            list(scoring_methods.keys()).index(i)
+        except:
+            print('Scoring ' + i + ' is not a valid scoring method')
+            print('Valid methods are:')
+            print(scoring_methods.keys())
 
-        """
-        Main fit method for the train object
+    # create np array to store feature importance scores
+    if feature_importances is True:
+        fimp = np.zeros((cv.get_n_splits(), X.shape[1]))
+        fimp[:] = np.nan
+    else:
+        fimp = None
 
-        Args
-        ----
-        X, y: training data and labels as numpy arrays
-        groups: groups to be used for cross-validation
-        """
+    # generate Kfold indices
+    if groups is None:
+        k_fold = cv.split(X, y)
+    else:
+        k_fold = cv.split(X, y, groups=groups)
 
-        from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
+    # train on k-1 folds and test of k folds
+    for train_indices, test_indices in k_fold:
 
-        # Balance classes prior to fitting
-        if self.sampling is not None:
-            if groups is None:
-                X, y = self.sampling.fit_sample(X, y)
-            else:
-                X = np.hstack((X, groups.reshape(-1, 1)))
-                X, y = self.sampling.fit_sample(X, y)
-                groups = X[:, -1]
-                X = X[:, :-1]
+        # subset training and test fold data
+        X_train, X_test = X[train_indices], X[test_indices]
+        y_train, y_test = y[train_indices], y[test_indices]
 
-        if self.preprocessing is not None:
-            X = self.__preprocessor(X)
+        # subset training and test fold group ids
+        if groups is not None: groups_train = groups[train_indices]
+        else: groups_train = None
 
-        if self.categorical_var is not None:
-            X = self.__onehotencode(X)
+        # subset training and test fold sample_weight
+        if sample_weight is not None: weights = sample_weight[train_indices]
 
-        # fit the model on the training data and predict the test data
-        # need the groups parameter because the estimator can be a
-        # RandomizedSearchCV or GridSearchCV estimator where cv=GroupKFold
-        if isinstance(self.estimator, RandomizedSearchCV) \
-                or isinstance(self.estimator, GridSearchCV):
-            param_search = True
-        else:
-            param_search = False
-
+        # train estimator on training fold
         if groups is not None and param_search is True:
-            self.estimator.fit(X, y, groups=groups)
+            if sample_weight is None: estimator.fit(X_train, y_train, groups=groups_train)
+            else: estimator.fit(X_train, y_train, groups=groups_train, sample_weight=weights)
         else:
-            self.estimator.fit(X, y)
+            if sample_weight is None: estimator.fit(X_train, y_train)
+            else: estimator.fit(X_train, y_train, sample_weight=weights)
 
-    def __preprocessor(self, X):
-        """
-        Transforms the non-categorical X
-
-        Args
-        ----
-        X; 2D numpy array to transform
-        """
-
-        # create mask so that indices that represent categorical
-        # predictors are not selected
-        if self.categorical_var is not None:
-            idx = np.arange(X.shape[1])
-            mask = np.ones(len(idx), dtype=bool)
-            mask[self.categorical_var] = False
-        else:
-            mask = np.arange(X.shape[1])
-
-        X_continuous = X[:, mask]
-        self.preprocessing.fit(X=X_continuous)
-        X[:, mask] = self.preprocessing.transform(X_continuous)
-
-        return(X)
-
-    def varImp_permutation(self, estimator, X_test, y_true,
-                           n_permutations, scorers,
-                           random_state):
-
-        """
-        Method to perform permutation-based feature importance during
-        cross-validation (cross-validation is applied externally to this
-        method)
-
-        Procedure is:
-        1. Pass fitted estimator and test partition X y
-        2. Assess AUC on the test partition (bestauc)
-        3. Permute each variable and assess the difference between bestauc and
-           the messed-up variable
-        4. Repeat (3) for many random permutations
-        5. Average the repeats
-
-        Args
-        ----
-        estimator: estimator that has been fitted to a training partition
-        X_test, y_true: data and labels from a test partition
-        n_permutations: number of random permutations to apply
-        random_state: seed to pass to the numpy random.seed
-
-        Returns
-        -------
-        scores: AUC scores for each predictor following permutation
-        """
-
-        from sklearn import metrics
-        if scorers == 'binary' or scorers == 'multiclass':
-            scorer = metrics.accuracy_score
-        if scorers == 'regression':
-            scorer = metrics.r2_score
-
-        # calculate score on original variables without permutation
-        # determine best metric type for binary/multiclass/regression scenarios
+        # prediction of test fold
         y_pred = estimator.predict(X_test)
-        best_score = scorer(y_true, y_pred)
 
-        np.random.seed(seed=random_state)
-        rstate = RandomState(random_state)
-        scores = np.zeros((n_permutations, X_test.shape[1]))
+        # calculate global performance metrics
+        for m in scores.keys():
+            # metrics that require probabilties
+            if m == 'brier_loss' or m == 'roc_auc':
+                y_prob = estimator.predict_proba(X_test)[:, 1]
+                scores[m] = np.append(
+                    scores[m], scoring_methods[m](y_test, y_prob))
 
-        # outer loop to repeat the pemutation rep times
-        for rep in range(n_permutations):
+            # metrics that have no averaging for multiclass
+            elif m == 'kappa' or m == 'specificity' or m == 'accuracy' \
+            or m == 'hamming_loss' or m == 'jaccard_similarity' \
+            or m == 'log_loss' or m == 'zero_one_loss':
+                scores[m] = np.append(
+                    scores[m], scoring_methods[m](y_test, y_pred))
 
-            # inner loop to permute each predictor variable and assess
-            # difference in auc
-            for i in range(X_test.shape[1]):
-                Xscram = np.copy(X_test)
-                Xscram[:, i] = rstate.choice(X_test[:, i], X_test.shape[0])
+            # balanced accuracy
+            elif m == 'balanced_accuracy':
+                scores[m] = np.append(
+                    scores[m], scoring_methods[m](
+                        y_test, y_pred, average='macro'))
 
-                # fit the model on the training data and predict the test data
-                y_pred = estimator.predict(Xscram)
-                scores[rep, i] = best_score-scorer(y_true, y_pred)
-                if scores[rep, i] < 0:
-                    scores[rep, i] = 0
+            # metrics that have averaging for multiclass
+            else:
+                scores[m] = np.append(
+                    scores[m], scoring_methods[m](
+                        y_test, y_pred, average=average))
 
-        # average the repetitions
-        scores = scores.mean(axis=0)
+        # calculate per-class performance metrics
+        for key in byclass_scores.keys():
+            byclass_scores[key] = np.vstack((
+                byclass_scores[key], byclass_methods[key](
+                    y_test, y_pred, labels=labels, average=None)))
 
-        return(scores)
-
-    def specificity_score(self, y_true, y_pred):
-
-        """
-        Simple method to calculate specificity score
-
-        Args
-        ----
-        y_true: 1D numpy array of truth values
-        y_pred: 1D numpy array of predicted classes
-
-        Returns
-        -------
-        specificity: specificity score
-        """
-
-        from sklearn.metrics import confusion_matrix
-
-        cm = confusion_matrix(y_true, y_pred)
-
-        tn = float(cm[0][0])
-        # fn = float(cm[1][0])
-        # tp = float(cm[1][1])
-        fp = float(cm[0][1])
-
-        specificity = tn/(tn+fp)
-
-        return (specificity)
-
-    def cross_val(self, splitter, X, y, groups=None, scorers='binary',
-                  feature_importances=False, n_permutations=25,
-                  random_state=None):
-
-        from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
-        from sklearn import metrics
-
-        """
-        Stratified Kfold and GroupFold cross-validation
-        Generates suites of scoring_metrics for binary classification,
-
-        multiclass classification and regression scenarios
-
-        Args
-        ----
-        splitter: Scikit learn model_selection object, e.g. StratifiedKFold
-        X, y: 2D numpy array of training data and 1D array of labels
-        groups: 1D numpy array of groups to be used for cross-validation
-        scorers: String specifying suite of performance metrics to use
-        feature_importances: Boolean to perform permutation-based importances
-        n_permutations: Number of permutations during feature importance
-        random_state: Seed to pass to the random number generator
-        """
-
-        # preprocessing -------------------------------------------------------
-        if self.preprocessing is not None:
-            X = self.__preprocessor(X)
-
-        if self.categorical_var is not None:
-            X = self.__onehotencode(X)
-
-        # create copy of fitting estimator for cross-val fitting
-        fit_train = deepcopy(self.estimator)
-
-        # create dictionary of lists to store metrics -------------------------
-        n_classes = len(np.unique(y))
-
-        if scorers == 'accuracy':
-            self.scores = {
-                'accuracy': []
-            }
-
-        if scorers == 'binary':
-            self.scores = {
-                'accuracy': [],
-                'precision': [],
-                'recall': [],
-                'specificity': [],
-                'f1': [],
-                'kappa': [],
-                'auc': []
-            }
-
-        if scorers == 'multiclass':
-            self.scores = {
-                'accuracy': [],
-                'kappa': [],
-                'precision': np.zeros((0, n_classes)),  # scores per sample
-                'recall': np.zeros((0, n_classes)),
-                'f1': np.zeros((0, n_classes))
-                }
-
-        if scorers == 'regression':
-            self.scores = {
-                'r2': []
-            }
-
-        self.mean_tpr = 0
-        self.mean_fpr = np.linspace(0, 1, 100)
-
-        # create np array to store feature importance scores
-        # for each predictor per fold
+        # feature importances using permutation
         if feature_importances is True:
-            self.fimp = np.zeros((splitter.get_n_splits(), X.shape[1]))
-            self.fimp[:] = np.nan
-
-        # generate Kfold indices ----------------------------------------------
-
-        if groups is None:
-            k_fold = splitter.split(X, y)
-        else:
-            k_fold = splitter.split(
-                X, y, groups=groups)
-
-        # train on k-1 folds and test of k folds ------------------------------
-
-        for train_indices, test_indices in k_fold:
-
-            # get indices for train and test partitions
-            X_train, X_test = X[train_indices], X[test_indices]
-            y_train, y_test = y[train_indices], y[test_indices]
-            if groups is not None:
-                groups_train = groups[train_indices]
-
-            # balance the training fold
-            if self.sampling is not None:
-                if groups is None:
-                    X_train, y_train = self.sampling.fit_sample(
-                            X_train, y_train)
-                else:
-                    X_train = np.hstack((X_train, groups_train.reshape(-1, 1)))
-                    X_train, y_train = self.sampling.fit_sample(
-                            X_train, y_train)
-                    groups_train = X_train[:, -1]
-                    X_train = X_train[:, :-1]
-
+            if bool((np.isnan(fimp)).all()) is True:
+                fimp = varimp_permutation(
+                    estimator, X_test, y_test, n_permutations,
+                    scoring_methods[scoring[0]],
+                    random_state)
             else:
-                # also get indices of groups for the training partition
-                if groups is not None:
-                    groups_train = groups[train_indices]
+                fimp = np.row_stack(
+                    (fimp, varimp_permutation(
+                        estimator, X_test, y_test,
+                        n_permutations, scoring_methods[scoring[0]],
+                        random_state)))
 
-            # fit the model on the training data and predict the test data
-            # need the groups parameter because the estimator can be a
-            # RandomizedSearchCV or GridSearchCV estimator where cv=GroupKFold
-            if isinstance(fit_train, RandomizedSearchCV) is True \
-                    or isinstance(fit_train, GridSearchCV):
-                param_search = True
-            else:
-                param_search = False
+    return(scores, byclass_scores, fimp)
 
-            # train fit_train on training fold
-            if groups is not None and param_search is True:
-                fit_train.fit(X_train, y_train, groups=groups_train)
-            else:
-                fit_train.fit(X_train, y_train)
 
-            # prediction of test fold
-            y_pred = fit_train.predict(X_test)
-            labels = np.unique(y_pred)
+def predict(estimator, predictors, output, predict_type='raw', labels=None,
+            index=None, rowincr=25):
 
-            # calculate metrics
-            if scorers == 'accuracy':
-                self.scores['accuracy'] = np.append(
-                    self.scores['accuracy'],
-                    metrics.accuracy_score(y_test, y_pred))
+    """
+    Prediction on list of GRASS rasters using a fitted scikit learn model
 
-            elif scorers == 'binary':
-                self.scores['accuracy'] = np.append(
-                    self.scores['accuracy'],
-                    metrics.accuracy_score(y_test, y_pred))
+    Args
+    ----
+    estimator: scikit-learn estimator object
+    predictors: list of GRASS rasters
+    output: Name of GRASS raster to output classification results
+    predict_type: character, 'raw' for classification/regression;
+                  'prob' for class probabilities
+    index: Optional, list of class indices to export
+    rowincr: Integer of raster rows to process at one time
+    """
 
-                y_pred_proba = fit_train.predict_proba(X_test)[:, 1]
-                self.scores['auc'] = np.append(
-                    self.scores['auc'],
-                    metrics.roc_auc_score(y_test, y_pred_proba))
+    # current region
+    current = Region()
 
-                self.scores['precision'] = np.append(
-                    self.scores['precision'], metrics.precision_score(
-                        y_test, y_pred, labels, average='binary'))
+    # determine output data type and nodata
+    if labels is not None:
+        ftype = 'CELL'
+        nodata = -2147483648
+    else:
+        ftype = 'FCELL'
+        nodata = np.nan
 
-                self.scores['recall'] = np.append(
-                    self.scores['recall'], metrics.recall_score(
-                        y_test, y_pred, labels, average='binary'))
+    # open predictors as list of rasterrow objects
+    n_features = len(predictors)
+    rasstack = [0] * n_features
 
-                self.scores['specificity'] = np.append(
-                    self.scores['specificity'], self.specificity_score(
-                        y_test, y_pred))
-
-                self.scores['f1'] = np.append(
-                    self.scores['f1'], metrics.f1_score(
-                        y_test, y_pred, labels, average='binary'))
-
-                self.scores['kappa'] = np.append(
-                    self.scores['kappa'],
-                    metrics.cohen_kappa_score(y_test, y_pred))
-
-                fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_proba)
-                self.mean_tpr += scipy.interp(self.mean_fpr, fpr, tpr)
-                self.mean_tpr[0] = 0.0
-
-            elif scorers == 'multiclass':
-
-                self.scores['accuracy'] = np.append(
-                    self.scores['accuracy'],
-                    metrics.accuracy_score(y_test, y_pred))
-
-                self.scores['kappa'] = np.append(
-                    self.scores['kappa'],
-                    metrics.cohen_kappa_score(y_test, y_pred))
-
-                self.scores['precision'] = np.vstack((
-                    self.scores['precision'],
-                    np.array(metrics.precision_score(
-                        y_test, y_pred, average=None))))
-
-                self.scores['recall'] = np.vstack((
-                    self.scores['recall'],
-                    np.array(metrics.recall_score(
-                        y_test, y_pred, average=None))))
-
-                self.scores['f1'] = np.vstack((
-                    self.scores['f1'],
-                    np.array(metrics.f1_score(
-                        y_test, y_pred, average=None))))
-
-            elif scorers == 'regression':
-                self.scores['r2'] = np.append(
-                    self.scores['r2'], metrics.r2_score(y_test, y_pred))
-
-            # feature importances using permutation
-            if feature_importances is True:
-                if bool((np.isnan(self.fimp)).all()) is True:
-                    self.fimp = self.varImp_permutation(
-                        fit_train, X_test, y_test, n_permutations, scorers,
-                        random_state)
-                else:
-                    self.fimp = np.row_stack(
-                        (self.fimp, self.varImp_permutation(
-                            fit_train, X_test, y_test,
-                            n_permutations, scorers, random_state)))
-
-        # summarize data ------------------------------------------------------
-
-        # convert onehot-encoded feature importances back to original vars
-        if self.fimp is not None and self.enc is not None:
-
-            # get start,end positions of each suite of onehot-encoded vars
-            feature_ranges = deepcopy(self.enc.feature_indices_)
-            for i in range(0, len(self.enc.feature_indices_)-1):
-                feature_ranges[i+1] = feature_ranges[i] + \
-                              len(self.category_values[i])
-
-            # take sum of each onehot-encoded feature
-            ohe_feature = [0] * len(self.categorical_var)
-            ohe_sum = [0] * len(self.categorical_var)
-
-            for i in range(len(self.categorical_var)):
-                ohe_feature[i] = \
-                           self.fimp[:, feature_ranges[i]:feature_ranges[i+1]]
-                ohe_sum[i] = ohe_feature[i].sum(axis=1)
-
-            # remove onehot-encoded features from the importances array
-            features_for_removal = np.array(range(feature_ranges[-1]))
-            self.fimp = np.delete(self.fimp, features_for_removal, axis=1)
-
-            # insert summed importances into original positions
-            for index in self.categorical_var:
-                self.fimp = np.insert(self.fimp, np.array(index),
-                                      ohe_sum[0], axis=1)
-
-        if scorers == 'binary':
-            self.mean_tpr /= splitter.get_n_splits(X, y)
-            self.mean_tpr[-1] = 1.0
-
-    def get_roc_curve(self):
-        return (self.mean_tpr, self.mean_fpr)
-
-    def get_cross_val_scores(self):
-        return (self.scores)
-
-    def predict(self, predictors, output, labels=None,
-                class_probabilities=False, rowincr=25):
-
-        """
-        Prediction on list of GRASS rasters using a fitted scikit learn model
-
-        Parameters
-        ----------
-        predictors: List of GRASS rasters
-
-        class_probabilties: Predict class probabilities
-
-        rowincr: Integer of raster rows to process at one time
-        output: Name of GRASS raster to output classification results
-        """
-
-        # determine output data type and nodata
-        if labels is not None:
-            ftype = 'CELL'
-            nodata = -2147483648
+    for i in range(n_features):
+        rasstack[i] = RasterRow(predictors[i])
+        if rasstack[i].exist() is True:
+            rasstack[i].open('r')
         else:
-            ftype = 'FCELL'
-            nodata = np.nan
+            grass.fatal("GRASS raster " + predictors[i] +
+                        " does not exist.... exiting")
 
-        # create a list of rasterrow objects for predictors
-        n_features = len(predictors)
-        rasstack = [0] * n_features
-
-        for i in range(n_features):
-            rasstack[i] = RasterRow(predictors[i])
-            if rasstack[i].exist() is True:
-                rasstack[i].open('r')
-            else:
-                grass.fatal("GRASS raster " + predictors[i] +
-                            " does not exist.... exiting")
-
-        current = Region()
-
-        # create a imagery mask
-        # the input rasters might have different dimensions and null pixels.
-        # r.series used to automatically create a mask by propagating the nulls
-        grass.run_command("r.series", output='tmp_clfmask',
-                          input=predictors, method='count', flags='n',
-                          overwrite=True)
-
-        mask_raster = RasterRow('tmp_clfmask')
-        mask_raster.open('r')
-
-        # create and open RasterRow objects for classification
+    # create and open RasterRow object for writing of classification result
+    if predict_type == 'raw':
         classification = RasterRow(output)
         classification.open('w', ftype, overwrite=True)
 
-        # create and open RasterRow objects for  probabilities if enabled
-        if class_probabilities is True:
+    # Prediction using row blocks
+    for rowblock in range(0, current.rows, rowincr):
+        grass.percent(rowblock, current.rows, rowincr)
 
-            # determine number of classes
-            nclasses = len(labels)
+        # check that the row increment does not exceed the number of rows
+        if rowblock+rowincr > current.rows:
+            rowincr = current.rows - rowblock
+        img_np_row = np.zeros((rowincr, current.cols, n_features))
 
-            prob_out_raster = [0] * nclasses
-            prob = [0] * nclasses
+        # loop through each row, and each band
+        # and add these values to the 2D array img_np_row
+        for row in range(rowblock, rowblock+rowincr, 1):
+            for band in range(n_features):
+                img_np_row[row-rowblock, :, band] = \
+                    np.array(rasstack[band][row])
 
-            for iclass in range(nclasses):
-                prob_out_raster[iclass] = output + \
-                    '_classPr' + str(int(labels[iclass]))
-                prob[iclass] = RasterRow(prob_out_raster[iclass])
-                prob[iclass].open('w', 'FCELL', overwrite=True)
+        # create mask
+        img_np_row[img_np_row == -2147483648] = np.nan
+        mask = np.zeros((img_np_row.shape[0], img_np_row.shape[1]))
+        for feature in range(n_features):
+            invalid_indexes = np.nonzero(np.isnan(img_np_row[:, :, feature]))
+            mask[invalid_indexes] = np.nan
 
-        """
-        Prediction using row blocks
-        """
-        try:
-            for rowblock in range(0, current.rows, rowincr):
-                grass.percent(rowblock, current.rows, rowincr)
-                # check that the row increment does not exceed the number of rows
-                if rowblock+rowincr > current.rows:
-                    rowincr = current.rows - rowblock
-                img_np_row = np.zeros((rowincr, current.cols, n_features))
-                mask_np_row = np.zeros((rowincr, current.cols))
+        # reshape each row-band matrix into a n*m array
+        nsamples = rowincr * current.cols
+        flat_pixels = img_np_row.reshape((nsamples, n_features))
 
-                # loop through each row, and each band
-                # and add these values to the 2D array img_np_row
-                for row in range(rowblock, rowblock+rowincr, 1):
-                    mask_np_row[row-rowblock, :] = np.array(mask_raster[row])
+        # remove NaNs prior to passing to scikit-learn predict
+        flat_pixels = np.nan_to_num(flat_pixels)
 
-                    for band in range(n_features):
-                        img_np_row[row-rowblock, :, band] = \
-                            np.array(rasstack[band][row])
+        # perform prediction for classification/regression
+        if predict_type == 'raw':
+            result = estimator.predict(flat_pixels)
+            result = result.reshape((rowincr, current.cols))
 
-                mask_np_row[mask_np_row == -2147483648] = np.nan
-                nanmask = np.isnan(mask_np_row)  # True in mask means invalid data
+            # replace NaN values so that the prediction does not have a border
+            result[np.nonzero(np.isnan(mask))] = nodata
 
-                # reshape each row-band matrix into a n*m array
-                nsamples = rowincr * current.cols
-                flat_pixels = img_np_row.reshape((nsamples, n_features))
+            # for each row we can perform computation, and write the result
+            for row in range(rowincr):
+                newrow = Buffer((result.shape[1],), mtype=ftype)
+                newrow[:] = result[row, :]
+                classification.put_row(newrow)
 
-                # remove NaN values and GRASS CELL nodata vals
-                flat_pixels[flat_pixels == -2147483648] = np.nan
-                flat_pixels = np.nan_to_num(flat_pixels)
+        # perform prediction for class probabilities
+        if predict_type == 'prob':
+            result_proba = estimator.predict_proba(flat_pixels)
 
-                # rescale
-                if self.preprocessing is not None:
-                    # create mask so that indices that represent categorical
-                    # predictors are not selected
-                    if self.categorical_var is not None:
-                        idx = np.arange(n_features)
-                        mask = np.ones(len(idx), dtype=bool)
-                        mask[self.categorical_var] = False
-                    else:
-                        mask = np.arange(n_features)
-                    flat_pixels_continuous = flat_pixels[:, mask]
-                    flat_pixels[:, mask] = self.preprocessing.transform(
-                            flat_pixels_continuous)
+            # on first loop determine number of probability classes
+            # and open rasterrow objects for writing
+            if rowblock == 0:
+                if index is None:
+                    index = range(result_proba.shape[1])
+                    n_classes = len(index)
+                else:
+                    n_classes = len(np.unique(index))
 
-                # onehot-encoding
-                if self.enc is not None:
-                    try:
-                        flat_pixels = self.enc.transform(flat_pixels)
-                    except:
-                        # if this fails it is because the onehot-encoder was fitted
-                        # on the training samples, but the prediction data contains
-                        # new values, i.e. the training data has not sampled all of
-                        # categories
-                        grass.fatal('There are values in the categorical rasters '
-                                    'that are not present in the training data '
-                                    'set, i.e. the training data has not sampled '
-                                    'all of the categories')
+                # create and open RasterRow objects for probabilities
+                prob_out_raster = [0] * n_classes
+                prob = [0] * n_classes
+                for iclass, label in enumerate(index):
+                    prob_out_raster[iclass] = output + '_classPr' + str(label)
+                    prob[iclass] = RasterRow(prob_out_raster[iclass])
+                    prob[iclass].open('w', 'FCELL', overwrite=True)
 
-                # perform prediction
-                result = self.estimator.predict(flat_pixels)
-                result = result.reshape((rowincr, current.cols))
+            for iclass, label in enumerate(index):
+                result_proba_class = result_proba[:, label]
+                result_proba_class = result_proba_class.reshape((rowincr, current.cols))
 
                 # replace NaN values so that the prediction does not have a border
-                result = np.ma.masked_array(
-                    result, mask=nanmask, fill_value=-99999)
+                result_proba_class[np.nonzero(np.isnan(mask))] = np.nan
 
-                # return a copy of result, with masked values filled with a value
-                result = result.filled([nodata])
-
-                # for each row we can perform computation, and write the result
                 for row in range(rowincr):
-                    newrow = Buffer((result.shape[1],), mtype=ftype)
-                    newrow[:] = result[row, :]
-                    classification.put_row(newrow)
+                    newrow = Buffer((result_proba_class.shape[1],), mtype='FCELL')
+                    newrow[:] = result_proba_class[row, :]
+                    prob[iclass].put_row(newrow)
 
-                # same for probabilities
-                if class_probabilities is True:
-                    result_proba = self.estimator.predict_proba(flat_pixels)
+    # close all maps
+    for i in range(n_features): rasstack[i].close()
 
-                    for iclass in range(result_proba.shape[1]):
+    # close all class probability maps
+    if predict_type == 'raw':
+        classification.close()
+    if predict_type == 'prob':
+        try:
+            for iclass in range(n_classes):
+                prob[iclass].close()
+        except:
+            pass
 
-                        result_proba_class = result_proba[:, iclass]
-                        result_proba_class = result_proba_class.reshape(
-                                                (rowincr, current.cols))
 
-                        result_proba_class = np.ma.masked_array(
-                            result_proba_class, mask=nanmask, fill_value=np.nan)
+def model_classifiers(estimator, random_state, p, weights=None):
 
-                        result_proba_class = result_proba_class.filled([np.nan])
-
-                        for row in range(rowincr):
-                            newrow = Buffer((
-                                        result_proba_class.shape[1],),
-                                        mtype='FCELL')
-
-                            newrow[:] = result_proba_class[row, :]
-                            prob[iclass].put_row(newrow)
-        finally:
-            # close all predictors
-            for i in range(n_features):
-                rasstack[i].close()
-
-            # close classification and mask maps
-            classification.close()
-            mask_raster.close()
-
-            grass.run_command("g.remove", name='tmp_clfmask',
-                              flags="f", type="raster", quiet=True)
-
-            # close all class probability maps
-            try:
-                for iclass in range(nclasses):
-                    prob[iclass].close()
-            except:
-                pass
-
-
-def model_classifiers(estimator='LogisticRegression', random_state=None,
-                      C=1, max_depth=None, max_features='auto',
-                      min_samples_split=2, min_samples_leaf=1,
-                      n_estimators=100, subsample=1.0,
-                      learning_rate=0.1, max_degree=1):
-
     """
     Provides the classifiers and parameters using by the module
 
@@ -737,14 +421,8 @@
     ----
     estimator: Name of estimator
     random_state: Seed to use in randomized components
-    C: Inverse of regularization strength
-    max_depth: Maximum depth for tree-based methods
-    min_samples_split: Minimum number of samples to split a node
-    min_samples_leaf: Minimum number of samples to form a leaf
-    n_estimators: Number of trees
-    subsample: Controls randomization in gradient boosting
-    learning_rate: Used in gradient boosting
-    max_degree: For earth
+    p: Dict, containing classifier setttings
+    weights: None, or 'balanced' to add class_weights
 
     Returns
     -------
@@ -758,7 +436,9 @@
     from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
     from sklearn.naive_bayes import GaussianNB
     from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
-    from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
+    from sklearn.ensemble import (
+        RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier,
+        ExtraTreesRegressor)
     from sklearn.ensemble import GradientBoostingClassifier
     from sklearn.ensemble import GradientBoostingRegressor
     from sklearn.svm import SVC
@@ -770,11 +450,11 @@
             from pyearth import Earth
 
             earth_classifier = Pipeline([('Earth',
-                                          Earth(max_degree=max_degree)),
-                                        ('Logistic', LogisticRegression())])
+                                          Earth(max_degree=p['max_degree'])),
+                                         ('Logistic', LogisticRegression())])
 
             classifiers = {'EarthClassifier': earth_classifier,
-                           'EarthRegressor': Earth(max_degree=max_degree)}
+                           'EarthRegressor': Earth(max_degree=p['max_degree'])}
         except:
             grass.fatal('Py-earth package not installed')
 
@@ -782,73 +462,99 @@
         try:
             from xgboost import XGBClassifier, XGBRegressor
 
-            if max_depth is None:
-                max_depth = int(3)
+            if p['max_depth'] is None:
+                p['max_depth'] = int(3)
 
             classifiers = {
                 'XGBClassifier':
-                    XGBClassifier(learning_rate=learning_rate,
-                                  n_estimators=n_estimators,
-                                  max_depth=max_depth,
-                                  subsample=subsample),
+                    XGBClassifier(learning_rate=p['learning_rate'],
+                                  n_estimators=p['n_estimators'],
+                                  max_depth=p['max_depth'],
+                                  subsample=p['subsample']),
                 'XGBRegressor':
-                    XGBRegressor(learning_rate=learning_rate,
-                                 n_estimators=n_estimators,
-                                 max_depth=max_depth,
-                                 subsample=subsample)}
+                    XGBRegressor(learning_rate=p['learning_rate'],
+                                 n_estimators=p['n_estimators'],
+                                 max_depth=p['max_depth'],
+                                 subsample=p['subsample'])}
         except:
             grass.fatal('XGBoost package not installed')
     else:
         # core sklearn classifiers go here
         classifiers = {
-            'SVC': SVC(C=C, probability=True, random_state=random_state),
+            'SVC': SVC(C=p['C'],
+                       class_weight=weights,
+                       probability=True,
+                       random_state=random_state),
             'LogisticRegression':
-                LogisticRegression(C=C, random_state=random_state, n_jobs=-1,
+                LogisticRegression(C=p['C'],
+                                   class_weight=weights,
+                                   random_state=random_state,
+                                   n_jobs=-1,
                                    fit_intercept=True),
             'DecisionTreeClassifier':
-                DecisionTreeClassifier(max_depth=max_depth,
-                                       max_features=max_features,
-                                       min_samples_split=min_samples_split,
-                                       min_samples_leaf=min_samples_leaf,
+                DecisionTreeClassifier(max_depth=p['max_depth'],
+                                       max_features=p['max_features'],
+                                       min_samples_split=p['min_samples_split'],
+                                       min_samples_leaf=p['min_samples_leaf'],
+                                       class_weight=weights,
                                        random_state=random_state),
             'DecisionTreeRegressor':
-                DecisionTreeRegressor(max_features=max_features,
-                                      min_samples_split=min_samples_split,
-                                      min_samples_leaf=min_samples_leaf,
+                DecisionTreeRegressor(max_features=p['max_features'],
+                                      min_samples_split=p['min_samples_split'],
+                                      min_samples_leaf=p['min_samples_leaf'],
                                       random_state=random_state),
             'RandomForestClassifier':
-                RandomForestClassifier(n_estimators=n_estimators,
-                                       max_features=max_features,
-                                       min_samples_split=min_samples_split,
-                                       min_samples_leaf=min_samples_leaf,
+                RandomForestClassifier(n_estimators=p['n_estimators'],
+                                       max_features=p['max_features'],
+                                       min_samples_split=p['min_samples_split'],
+                                       min_samples_leaf=p['min_samples_leaf'],
+                                       class_weight=weights,
                                        random_state=random_state,
                                        n_jobs=-1,
                                        oob_score=False),
             'RandomForestRegressor':
-                RandomForestRegressor(n_estimators=n_estimators,
-                                      max_features=max_features,
-                                      min_samples_split=min_samples_split,
-                                      min_samples_leaf=min_samples_leaf,
+                RandomForestRegressor(n_estimators=p['n_estimators'],
+                                      max_features=p['max_features'],
+                                      min_samples_split=p['min_samples_split'],
+                                      min_samples_leaf=p['min_samples_leaf'],
                                       random_state=random_state,
                                       n_jobs=-1,
                                       oob_score=False),
+            'ExtraTreesClassifier':
+                ExtraTreesClassifier(n_estimators=p['n_estimators'],
+                                     max_features=p['max_features'],
+                                     min_samples_split=p['min_samples_split'],
+                                     min_samples_leaf=p['min_samples_leaf'],
+                                     class_weight=weights,
+                                     random_state=random_state,
+                                     n_jobs=-1,
+                                     oob_score=False),
+            'ExtraTreesRegressor':
+                ExtraTreesRegressor(n_estimators=p['n_estimators'],
+                                    max_features=p['max_features'],
+                                    min_samples_split=p['min_samples_split'],
+                                    min_samples_leaf=p['min_samples_leaf'],
+                                    random_state=random_state,
+                                    n_jobs=-1,
+                                    oob_score=False),
+
             'GradientBoostingClassifier':
-                GradientBoostingClassifier(learning_rate=learning_rate,
-                                           n_estimators=n_estimators,
-                                           max_depth=max_depth,
-                                           min_samples_split=min_samples_split,
-                                           min_samples_leaf=min_samples_leaf,
-                                           subsample=subsample,
-                                           max_features=max_features,
+                GradientBoostingClassifier(learning_rate=p['learning_rate'],
+                                           n_estimators=p['n_estimators'],
+                                           max_depth=p['max_depth'],
+                                           min_samples_split=p['min_samples_split'],
+                                           min_samples_leaf=p['min_samples_leaf'],
+                                           subsample=p['subsample'],
+                                           max_features=p['max_features'],
                                            random_state=random_state),
             'GradientBoostingRegressor':
-                GradientBoostingRegressor(learning_rate=learning_rate,
-                                          n_estimators=n_estimators,
-                                          max_depth=max_depth,
-                                          min_samples_split=min_samples_split,
-                                          min_samples_leaf=min_samples_leaf,
-                                          subsample=subsample,
-                                          max_features=max_features,
+                GradientBoostingRegressor(learning_rate=p['learning_rate'],
+                                          n_estimators=p['n_estimators'],
+                                          max_depth=p['max_depth'],
+                                          min_samples_split=p['min_samples_split'],
+                                          min_samples_leaf=p['min_samples_leaf'],
+                                          subsample=p['subsample'],
+                                          max_features=p['max_features'],
                                           random_state=random_state),
             'GaussianNB': GaussianNB(),
             'LinearDiscriminantAnalysis': LinearDiscriminantAnalysis(),
@@ -862,14 +568,15 @@
     if estimator == 'LogisticRegression' \
         or estimator == 'DecisionTreeClassifier' \
         or estimator == 'RandomForestClassifier' \
+        or estimator == 'ExtraTreesClassifier' \
         or estimator == 'GradientBoostingClassifier' \
         or estimator == 'GaussianNB' \
         or estimator == 'LinearDiscriminantAnalysis' \
         or estimator == 'QuadraticDiscriminantAnalysis' \
         or estimator == 'EarthClassifier' \
         or estimator == 'XGBClassifier' \
-            or estimator == 'SVC':
-            mode = 'classification'
+        or estimator == 'SVC':
+        mode = 'classification'
     else:
         mode = 'regression'
 
@@ -932,8 +639,7 @@
     return(X, y, groups)
 
 
-def extract(response, predictors, impute=False, shuffle_data=True,
-            lowmem=False, random_state=None):
+def extract(response, predictors, lowmem=False):
 
     """
     Samples a list of GRASS rasters using a labelled raster
@@ -943,6 +649,7 @@
     ----
     response: String; GRASS raster with labelled pixels
     predictors: List of GRASS rasters containing explanatory variables
+    lowmem: Boolean, use numpy memmap to query predictors
 
     Returns
     -------
@@ -952,9 +659,6 @@
     y_indexes: Row and Columns of label positions
     """
 
-    from sklearn.utils import shuffle
-    from sklearn.preprocessing import Imputer
-
     current = Region()
 
     # open response raster as rasterrow and read as np array
@@ -1022,24 +726,10 @@
     # convert indexes of training pixels from tuple to n*2 np array
     is_train = np.array(is_train).T
 
-    # impute missing values
-    if impute is True:
-        missing = Imputer(strategy='median')
-        training_data = missing.fit_transform(training_data)
-
-    # Remove nan rows from training data
-    X = training_data[~np.isnan(training_data).any(axis=1)]
-    y = training_labels[~np.isnan(training_data).any(axis=1)]
-    y_indexes = is_train[~np.isnan(training_data).any(axis=1)]
-
-    # shuffle the training data
-    if shuffle_data is True:
-        X, y, y_indexes = shuffle(X, y, y_indexes, random_state=random_state)
-
     # close the response map
     roi_gr.close()
 
-    return(X, y, y_indexes)
+    return(training_data, training_labels, y_indexes)
 
 
 def maps_from_group(group):
@@ -1065,3 +755,55 @@
         map_names.append(rastername.split('@')[0])
 
     return(maplist, map_names)
+
+
+def extract_points(gvector, grasters, field):
+    """
+    Extract values from grass rasters using vector points input
+
+    Args
+    ----
+    gvector: character, name of grass points vector
+    grasters: list of names of grass raster to query
+    field: character, name of field in table to use as response variable
+
+    Returns
+    -------
+    X: 2D numpy array of training data
+    y: 1D numpy array with the response variable
+    coordinates: 2D numpy array of sample coordinates
+    """
+
+    import pandas as pd
+
+    # open grass vector
+    points = VectorTopo(gvector.split('@')[0])
+    points.open('r')
+
+    # create link to attribute table
+    points.dblinks.by_name(name=gvector)
+    link = points.dblinks[0]
+
+    # convert to pandas array
+    gvector_df = pd.DataFrame(points.table_to_dict()).T
+    gvector_df.columns = points.table.columns
+    y = gvector_df.loc[:, field].as_matrix()
+    y = y.astype(float)
+
+    # extract training data
+    X = np.zeros((points.num_primitives()['point'], len(grasters)), dtype=float)
+    for i, raster in enumerate(grasters):
+        rio = RasterRow(raster)
+        values = np.asarray(get_raster_for_points(points, rio))
+        coordinates = values[:, 1:3]
+        X[:, i] = values[:, 3]
+
+    # remove missing response data
+    X = X[~np.isnan(y)]
+    coordinates = coordinates[~np.isnan(y)]
+    y = y[~np.isnan(y)]
+
+    # close
+    points.close()
+
+    return(X, y, coordinates)



More information about the grass-commit mailing list