[GRASS-SVN] r69660 - grass-addons/grass7/raster/r.randomforest

svn_grass at osgeo.org svn_grass at osgeo.org
Mon Oct 3 14:51:25 PDT 2016


Author: spawley
Date: 2016-10-03 14:51:25 -0700 (Mon, 03 Oct 2016)
New Revision: 69660

Modified:
   grass-addons/grass7/raster/r.randomforest/r.randomforest.html
   grass-addons/grass7/raster/r.randomforest/r.randomforest.py
Log:
Major update to the r.randomforest addon. Several other classifiers have been included. Also bug fixes to the ROC_AUC options which previously was just displaying classification accuracy

Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.html
===================================================================
--- grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-10-03 21:11:31 UTC (rev 69659)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-10-03 21:51:25 UTC (rev 69660)
@@ -1,33 +1,39 @@
 <h2>DESCRIPTION</h2>
 
-<em><b>r.randomforest</b></em> performs Random forests classification and regression on a suite of predictors within a GRASS imagery group. Random forest (Breiman, 2001) is an ensemble classification tree method which constructs a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a prediction probability, and the final classification result is obtained by averaging of the prediction probabilities across all of the trees. The probability of membership to the individual classes (<i>class_probabilities</i> flag) can also be output using the -p flag. The scikit-learn (python-sklearn) randomforest implementation differs from the original Breiman (2001) reference which uses majority voting rather than averaging.
+<em><b>r.randomforest</b></em> represents a front-end to the scikit learn machine learning python package for the purpose of performing classification and regression on a suite of predictors within a GRASS imagery group. The module also provides access random forest classification, and several other classifiers that are commonly used in remote sensing and spatial modelling. For more information concerning the details of any of the algorithms, consult the scikit-learn documentation directly. The choice of classifier is set using the <i>model<i/> parameter.
 
-<br><br>Random forests offers a number of advantages over traditional statistical classifiers because it is non-parametric and can deal with non-linear relationships and  categorical data, and no rescaling is required. Random forests also require relatively few user-specified parameter choices, principally consisting of the number of trees in the forest (<i>ntrees</i>), and the number of variables that are allowed to be chosen from at each node split (<i>mfeatures</i>), which controls the degree of correlation between the trees. There is no accuracy penalty in having a large number of trees apart from increased computational time. For randomforest classification, the default <i>ntrees</i> is 500 and the default setting of <i>mfeatures</i> is equal to the square root of the number of predictors.
+<br><br>The RandomForestsClassifier and RandomForestsRegressor (Breiman, 2001) options represent ensemble classification and regression tree methods, respectively. These methods construct a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a prediction probability, and the final classification result is obtained by averaging of the prediction probabilities across all of the trees. Random forests require relatively few user-specified parameter choices, principally consisting of the number of trees in the forest (<i>ntrees_rf</i>), and the number of variables that are allowed to be chosen from at each node split (<i>m_features_rf</i>), which controls the degree of correlation between the trees. Random forests also includes built-in accuracy assessment, termed the 'out-of-bag' (OOB) error. This is computed through bagging, where 33% of the training data are held-out duri
 ng the construction of each tree, and then OOB data are used to evaluate the prediction accuracy.
 
-<br><br>Random forests includes built-in accuracy assessment and variable selection. Random forests uses bagging, where a randomly selected 66% subset of the training data are held-out 'out-of-bag' (OOB) in the construction of each tree, and then OOB data are used to evaluate the prediction accuracy. The OOB error has been shown to provide a reliable estimate of classification error. However, the <i>-cv</i> can be set to > 1 to perform k-fold cross validation as an external method of accuracy assessment, and a number of global performance metrics are used.
+<br><br>LogisticRegression, despite its name, represents a linear model for classification rather than regression. This module provides access to two parameters, <i>C_lr</i> the inverse of the regularization strength, and <i>i</i> which specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
 
-<br><br><b>Note:</b> the performance are applied on a pixel-basis in the module and assume that the training data are not spatially correlated. This assumption is unlikely to be true of the training data represent rasterized polygons rather than points. In this case, an independent set of polygons should be used to test the accuracy of the classification using <i>i.kappa</i>.
+<br><br>LinearDiscriminantAnalysis and QuadraticDiscriminantAnalysis are two classifiers with a linear and a quadratic decision surface, respectively. These classifiers do not take any parameters.
 
-<br><br>The random forests scikit-learn implementation includes a measure of variable importance based on the Gini impurity criterion, which measures how each variable contributes to the homogeneity of the nodes, with important variables causing a larger decrease in the Gini coefficient in successive node splits. This variable importance allows the contributions of the individual predictors to be determined. The feature importance scores are displayed in the command output.
+<br><br>GaussianNB implements the Gaussian Naive Bayes algorithm for classification. Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features. This classifier does not take any parameters.
 
-<br><br>Random forests does not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by predicting the majority class with a very high accuracy, but at the expense of the minority class. If you have a highly imbalanced dataset, the 'balanced' flag can be set. The scikit-learn implementation balanced mode then automatically adjust weights inversely proportional to class frequencies.
+<br><br>The DecisionTreeClassifier and DecisionTreeRegressor represent non-parametric supervised learning methods used for classification and regression, respectively. Several parameter choices are available, relating to the node splitting method, the number of features to consider at each split, and the minimum number of samples in a split or leaf node.
 
-<br><br>Random forest can also be run in regression mode by setting the <i>mode</i> to the regression option. In this case, the mean square error (mse) is used to measure the quality of each decision split in the tree. Additionally, the the default <i>mfeatures</i> is equal to the number of predictors, and the coefficient of determination R^2 of the prediction is used as the performance measure. The generalization ability of the classifier can also be increased using <i>minsplit</i>, which represents the minimum number of samples required in order to split a node. The balanced and class_probabilities options are ignored for regression. 
+<br><br>The GradientBoostingClassifier and GradientBoostingRegressor use an ensemble of boosted decision trees for classification and regression, respectively. Gradient tree boosting produces a prediction model in the form of an ensemble of weak prediction models from decision trees, but the model is built in a stage-wise fashion. Gradient tree boosting includes many parameter choices, although the module provides access to the most common parameters that may require tuning for optimal performance.
 
-<br><br>The module also offers the ability to save and load a random forests model. The model is saved as a list of filenames (starting with the extension .pkl which is added automatically) for each numpy array. This list can involve a large number of files, so it makes sense to save each model in a separate directory. To load the model, you need to select the .pkl file that was saved. Saving and loading a model represents a useful feature because it allows a model to be built on one imagery group (ie. set of predictor variables), and then the prediction can be performed on other imagery groups. This approach is commonly employed in species prediction modelling, or landslide susceptibility modelling, where a classification or regression model is built with one set of predictors (e.g. which include present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables. The names of the GRASS rasters in the imagery groups
  do not matter because scikit learn saves the model as a series of numpy arrays. However, the new imagery group must contain the same number of rasters, and they should be in the same order as in the imagery group upon which the model was built. As an example, the new imagery group may have a raster named 'mean_precipitation_2050' which substitutes the 'mean_precipitation_2016' in the imagery group that was used to build the model. 
+<br><br>The tree-based classifiers include a measure of variable importance based on the Gini impurity criterion, which measures how each variable contributes to the homogeneity of the nodes, with important variables causing a larger decrease in the Gini coefficient in successive node splits. This variable importance allows the contributions of the individual predictors to be determined. The feature importance scores are displayed in the command output.
 
+<br><br>Most machine learning algorithms do not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by predicting the majority class with a very high accuracy, but at the expense of the minority class. If you have a highly imbalanced dataset, the 'balanced'  <i>b</i> flag can be set. The scikit-learn implementation balanced mode then automatically adjust weights inversely proportional to class frequencies. This only applies to the LogisticRegression, DecisionTree, RandomForest, and GradientBoostingClassifiers.
+
+<br><br>The module also offers the ability to save and load a classification or regression model. The model is saved as a list of filenames (starting with the extension .pkl which is added automatically) for each numpy array. This list can involve a large number of files, so it makes sense to save each model in a separate directory. To load the model, you need to select the .pkl file that was saved. Saving and loading a model represents a useful feature because it allows a model to be built on one imagery group (ie. set of predictor variables), and then the prediction can be performed on other imagery groups. This approach is commonly employed in species prediction modelling, or landslide susceptibility modelling, where a classification or regression model is built with one set of predictors (e.g. which include present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables.
+
+<br><br>For convenience when performing repeated classifications using different classifiers or parameters, the training data can be saved to a csv file using the <i>save_training</i> option. This data can then be loaded into subsequent classification runs, saving time by avoiding the need to repeatedly query the predictors.
+
 <h2>NOTES</h2>
 
 <em><b>r.randomforest</b></em> uses the "scikit-learn" machine learning python package. This python package needs to be installed within your GRASS GIS Python environment for <em><b>r.randomforest</b></em> to work.
 <br>
 For Linux users, this package should be available through the linux package manager in most distributions (named for example "python-scikit-learn").
 <br>
-For MS-Windows users, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'.
+For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows with a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.
 
 <p>
-<em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifier, which is multithreaded by default. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=25</i> should be reasonable for most systems with 4-8 GB of ram. However, if you have a workstation with much larger resources, then <i>lines</i> could be set to a much larger size (including to a value that is equal or greater than the number of rows in the current region setting) in which case the entire image will be loaded into memory to classification.
+<em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifier, which is multithreaded by default. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=25</i> should be reasonable for most systems with 4-8 GB of ram. However, the sampling of the training data set is slow using a row-by-row basis, and the default approach requires enough memory to load one predictor into memory at a time. If this still exceeds the system memory then the <i>l</i> flag can be set to perform row-by-row sampling.
 
-<br><br> The bootstrapping process involved within random forests also causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <i>randst</i> parameter.
+<br><br> Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <i>randst</i> parameter.
 
 <h2>TODO</h2>
 
@@ -54,7 +60,7 @@
 <br><br>Then we can use these training pixels to perform a classification on the more recently obtained landsat 7 image:
 <div class="code"><pre>
 r.randomforest igroup=lsat7_2000 roi=landclass96_roi output=rf_classification \
-  mode=classification ntrees=500 mfeatures=-1 minsplit=2 randst=1 lines=25
+  model=RandomForestClassifier ntrees_rf=500 m_features_rf=-1 minsplit_rf=2 randst=1 lines=25
 
 # copy category labels from landclass training map to result
 r.category rf_classification raster=landclass96_roi

Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.py
===================================================================
--- grass-addons/grass7/raster/r.randomforest/r.randomforest.py	2016-10-03 21:11:31 UTC (rev 69659)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.py	2016-10-03 21:51:25 UTC (rev 69660)
@@ -1,29 +1,29 @@
 #!/usr/bin/env python
 ############################################################################
 #
-# MODULE:       r.randomforest
+# MODULE:       r.scikit.learn
 # AUTHOR:       Steven Pawley
-# PURPOSE:    Provides supervised random forest classification and regression
-#                      using python scikit-learn)
+# PURPOSE:      Supervised classification and regression of GRASS rasters using the 
+#               python scikit-learn package
 #
-# COPYRIGHT: (c) 2015 Steven Pawley, and the GRASS Development Team
-#                      This program is free software under the GNU General Public
-#                      License (>=v2). Read the file COPYING that comes with GRASS
-#                      for details.
+# COPYRIGHT: (c) 2016 Steven Pawley, and the GRASS Development Team
+#                This program is free software under the GNU General Public
+#                License (>=v2). Read the file COPYING that comes with GRASS
+#                for details.
 #
 #############################################################################
 
 #%module
-#% description: Provides supervised random forest classification
+#% description: Supervised classification and regression of GRASS rasters using the python scikit-learn package
 #% keyword: classification
+#% keyword: regression
 #% keyword: machine learning
 #% keyword: scikit-learn
-#% keyword: random forests
 #%end
 
 #%option G_OPT_I_GROUP
 #% key: igroup
-#% label: Imagery group to be classified (predictors)
+#% label: Imagery group to be classified
 #% description: Series of raster maps to be used in the random forest classification
 #% required: yes
 #% multiple: no
@@ -31,7 +31,7 @@
 
 #%option G_OPT_R_INPUT
 #% key: roi
-#% label: Raster map with labelled pixels
+#% label: Labelled pixels
 #% description: Raster map with labelled pixels
 #% required: no
 #% guisection: Required
@@ -41,50 +41,174 @@
 #% key: output
 #% required: yes
 #% label: Output Map
+#% description: Prediction surface result from classification or regression model
 #%end
 
 #%option string
-#% key: mode
+#% key: model
 #% required: yes
-#% label: Classification or regression mode
-#% answer: classification
-#% options: classification,regression
+#% label: Classifier
+#% description: Supervised learning model to use
+#% answer: RandomForestClassifier
+#% options: LogisticRegression,LinearDiscriminantAnalysis,QuadraticDiscriminantAnalysis,DecisionTreeClassifier,DecisionTreeRegressor,RandomForestClassifier,RandomForestRegressor,GradientBoostingClassifier,GradientBoostingRegressor,GaussianNB
 #%end
 
+# Logistic regression options
+
+#%option double
+#% key: c_lr
+#% description: Inverse of regularization strength
+#% answer: 1.0
+#% guisection: Logistic Regression
+#%end
+
+#%flag
+#% key: i
+#% description: Fit intercept in logistic regression
+#% guisection: Logistic Regression
+#%end
+
+# Decision tree options
+#%option string
+#% key: splitter_dt
+#% description: The strategy used to choose the split at each node
+#% answer: best
+#% options: best,random
+#% guisection: Decision Tree
+#%end
+
 #%option
-#% key: ntrees
+#% key: m_features_dt
 #% type: integer
+#% description: The number of features to consider when looking for the best split. Default -1 is sqrt(n_features) for classification, and n_features for regression
+#% answer: -1
+#% guisection: Decision Tree
+#%end
+
+#%option
+#% key: min_samples_split_dt
+#% type: integer
+#% description: The minimum number of samples required to split an internal node
+#% answer: 2
+#% guisection: Decision Tree
+#%end
+
+#%option
+#% key: min_samples_leaf_dt
+#% type: integer
+#% description: The minimum number of samples required to be at a leaf node
+#% answer: 1
+#% guisection: Decision Tree
+#%end
+
+#%option
+#% key: min_weight_fraction_leaf_dt
+#% type: integer
+#% description: The minimum weighted fraction of the input samples required to be at a leaf node
+#% answer: 0
+#% guisection: Decision Tree
+#%end
+
+# Random Forest Options
+
+#%option
+#% key: ntrees_rf
+#% type: integer
 #% description: Number of trees in the forest
 #% answer: 500
-#% required: yes
-#% guisection: Random Forest Options
+#% guisection: Random Forest
 #%end
 
 #%option
-#% key: mfeatures
+#% key: m_features_rf
 #% type: integer
-#% description: The number of features allowed at each split. Sqrt(n_features) is used by default
+#% description: The number of features allowed at each split. Default -1 is sqrt(n_features) for classification, and n_features for regression
 #% answer: -1
-#% required: yes
-#% guisection: Random Forest Options
+#% guisection: Random Forest
 #%end
 
 #%option
-#% key: minsplit
+#% key: minsplit_rf
 #% type: integer
+#% description: The minimum number of samples required to split a node
+#% answer: 2
+#% guisection: Random Forest
+#%end
+
+# Gradient tree boosting options
+
+#%option
+#% key: learning_rate_gtb
+#% type: double
+#% description: learning rate shrinks the contribution of each tree
+#% answer: 0.1
+#% guisection: Gradient Boosted Trees
+#%end
+
+#%option
+#% key: n_estimators_gtb
+#% type: integer
+#% description: The number of boosting stages to perform
+#% answer: 100
+#% guisection: Gradient Boosted Trees
+#%end
+
+#%option
+#% key: max_depth_gtb
+#% type: integer
+#% description: The maximum depth limits the number of nodes in the tree
+#% answer: 3
+#% guisection: Gradient Boosted Trees
+#%end
+
+#%option
+#% key: min_samples_split_gtb
+#% type: integer
 #% description: The minimum number of samples required to split an internal node
 #% answer: 2
-#% required: yes
-#% guisection: Random Forest Options
+#% guisection: Gradient Boosted Trees
 #%end
 
 #%option
+#% key: min_samples_leaf_gtb
+#% type: integer
+#% description: The minimum number of samples required to be at a leaf node
+#% answer: 1
+#% guisection: Gradient Boosted Trees
+#%end
+
+#%option
+#% key: min_weight_fraction_leaf_gtb
+#% type: double
+#% description: The minimum weighted fraction of the input samples required to be at a leaf node
+#% answer: 0.
+#% guisection: Gradient Boosted Trees
+#%end
+
+#%option
+#% key: subsample_gtb
+#% type: double
+#% description: The fraction of samples to be used for fitting the individual base learners
+#% answer: 1.0
+#% guisection: Gradient Boosted Trees
+#%end
+
+#%option
+#% key: max_features_gtb
+#% type: integer
+#% description: The number of features to consider during splitting. Default -1 is sqrt(n_features) for classification mode, and n_features for regression mode
+#% answer: -1
+#% guisection: Gradient Boosted Trees
+#%end
+
+# General options
+
+#%option
 #% key: cv
 #% type: integer
 #% description: Use k-fold cross-validation when cv > 1
 #% answer: 1
-#% required: yes
-#% guisection: Random Forest Options
+#% guisection: Optional
 #%end
 
 #%option
@@ -92,8 +216,7 @@
 #% type: integer
 #% description: Seed to pass onto the random state for reproducible results
 #% answer: 1
-#% required: yes
-#% guisection: Random Forest Options
+#% guisection: Optional
 #%end
 
 #%option
@@ -101,26 +224,37 @@
 #% type: integer
 #% description: Processing block size in terms of number of rows
 #% answer: 25
-#% required: yes
-#% guisection: Random Forest Options
+#% guisection: Optional
 #%end
 
 #%flag
 #% key: p
 #% label: Output class membership probabilities
-#% guisection: Random Forest Options
+#% guisection: Optional
 #%end
 
 #%flag
+#% key: m
+#% description: Build model only - do not perform prediction
+#% guisection: Optional
+#%end
+
+#%flag
+#% key: f
+#% description: Output feature importances for ensemble tree-based models
+#% guisection: Optional
+#%end
+
+#%flag
 #% key: b
-#% description: Balance classes by weighting
-#% guisection: Random Forest Options
+#% description: Balance number of observations by weighting for logistic regression and tree-based classifiers
+#% guisection: Optional
 #%end
 
 #%flag
-#% key: m
-#% description: Build model only - do not perform prediction
-#% guisection: Random Forest Options
+#% key: l
+#% description: Low memory version - samples predictors row-by-row (slower)
+#% guisection: Optional
 #%end
 
 #%option G_OPT_F_OUTPUT
@@ -137,395 +271,198 @@
 #% guisection: Optional
 #%end
 
+#%option G_OPT_F_OUTPUT
+#% key: save_training
+#% label: Save training data to csv file
+#% required: no
+#% guisection: Optional
+#%end
+
+#%option G_OPT_F_INPUT
+#% key: load_training
+#% label: Load training data from file
+#% required: no
+#% guisection: Optional
+#%end
+
 #%rules
 #% exclusive: roi,loadfile
+#% exclusive: save_training,load_training
 #%end
 
 # import standard modules
 import atexit, random, string, re, os
+import numpy as np
+from subprocess import PIPE
+
+import grass.script as grass
 from grass.pygrass.raster import RasterRow
 from grass.pygrass.gis.region import Region
 from grass.pygrass.raster.buffer import Buffer
-import grass.script as grass
-import numpy as np
+from grass.pygrass.modules.shortcuts import imagery as im
        
 def cleanup():
-    # close the GRASS raster objects and the roi raster object
-    for i in range(nbands): rasstack[i].close()
-    roi_raster.close()
-    if rfmask != '': grass.run_command("g.remove", name=rfmask, flags="f",
-                                       type="raster")
 
-def score_classification_results(X, y, clf, kfolds, rstate):
-    # PURPOSE: custom function performs cross validation on a classification model,
-    # RETURNS: a 1D list representing accuracy, AUROC, precision, recall, kappa and specificity
+    grass.run_command("g.remove", name='clfmasktmp', flags="f", type="raster", quiet=True)
+
+def sample_predictors_byrow(response, predictors):
+
+    # create response rasterrow and open
+    roi_raster = RasterRow(response)
+    roi_raster.open('r')
     
-    # lazy import of sklearn
-    try:
-        from sklearn import cross_validation, metrics
-    except:
-        grass.fatal("Scikit-learn python module (python-sklearn) is not installed.....exiting")
+    # create a list of rasterrow objects
+    # Then each GRASS rasterrow can be referred to by rastack[band][row]:
+    n_features = len(predictors)
+    rasstack = [0] * n_features
+    for i in range(n_features):
+        rasstack[i] = RasterRow(predictors[i])
+        if rasstack[i].exist() == True:
+            rasstack[i].open('r')
+        else:
+            grass.fatal("GRASS raster " + maplist[i] + " does not exist.... exiting")
         
-    kf = cross_validation.KFold(len(y), n_folds=kfolds, shuffle=True, random_state=rstate)
+    # use grass.pygrass.gis.region to get information about the current region, particularly
+    current = Region()
 
-    acc, auc, kappa = [], [], []
+    # determine cell storage type of training roi raster
+    roi_type = grass.read_command("r.info", map=response, flags='g')
+    roi_list = str(roi_type).split(os.linesep)
+    dtype = roi_list[9].split('=')[1]
 
-    for train_index, test_index in kf:
-        X_train, X_test =  X[train_index], X[test_index]
-        y_train, y_test = y[train_index], y[test_index]
-        fit = clf.fit(X_train, y_train)
-        y_pred = fit.predict(X_test)
+    # Count number of labelled pixels
+    roi_stats = str(grass.read_command("r.univar", flags=("g"), map=response))
+    roi_stats = roi_stats.split(os.linesep)[0]
+    nlabel_pixels = int(str(roi_stats).split('=')[1])
 
-        acc = np.append(acc, metrics.accuracy_score(y_test, y_pred))
-        classes_in_fold = np.unique(y_test)
+    # Create a zero numpy array with the dimensions of the number of columns in the region
+    # and the number of bands plus an additional band to attach the labels
+    tindex = 0
+    training_labels = []
+    training_data = np.zeros((nlabel_pixels, n_features+1))
+    training_data[:] = np.NAN
 
-        # check if the response variable is binary, then calculate roc_auc and set average to binary
-        if len(np.unique(y)) == 2:
-            statmethod = 'binary'
-            auc = np.append(auc, metrics.roc_auc_score(y_test, y_pred))
-        else:
-            auc=0
-            statmethod = 'micro'
+    # Loop through each row of the raster
+    for row in range(current.rows):
+        # get the pixels from that row in the ROI raster
+        roi_row_np = roi_raster[row]
 
-        kappa = np.append(kappa, metrics.cohen_kappa_score(y_test, y_pred))
-        
-    return(acc, auc, kappa)
+        # check if any of those pixels are labelled (not equal to nodata)
+        # can use even if roi is FCELL because nodata will be nan and this is not returned anyway
+        is_train = np.nonzero(roi_row_np > -2147483648)
+        training_labels = np.append(training_labels, roi_row_np[is_train])
+        nlabels_in_row = np.array(is_train).shape[1]
 
-def score_regression_results(X, y, clf, kfolds, rstate):
-    # PURPOSE: performs cross validation on a regression model
-    # RETURNS: single value of R2
-    
-    try:
-        from sklearn import cross_validation, metrics
-    except:
-        grass.fatal("Scikit-learn python module (python-sklearn) is not installed.....exiting") 
-        
-    kf = cross_validation.KFold(len(y), n_folds=kfolds, shuffle=True, random_state=rstate)
+        # if there are any labelled pixels
+        # loop through each imagery band for that row and put the data into the img_row_band_np array
+        if np.isnan(roi_row_np).all() != True:
 
-    pr = []
+            for band in range(n_features):
+                imagerow_np = rasstack[band][row]
 
-    for train_index, test_index in kf:
-        X_train, X_test =  X[train_index], X[test_index]
-        y_train, y_test = y[train_index], y[test_index]
-        fit = clf.fit(X_train, y_train)
-        y_pred = fit.predict(X_test)
-        
-        pr = np.append(pr, metrics.r2_score(y_test, y_pred))
+                # attach the label values onto the last column
+                training_data[tindex : tindex+nlabels_in_row, band] = imagerow_np[is_train]
 
-    return(pr)
+            tindex = tindex + nlabels_in_row
 
-def cv_performance_byClass(X, y, clf, kfolds, rstate):
-    # PURPOSE: k-fold cross-validation performance measures by class
-    # RETURNS: 2D list of CLASS and mean performance measure result
-    # RETURNS: 2D list of CLASS and standard deviation of performance measure result
-    # Performance measures are sensitivity, recall
-    
-    # lazy import of sklearn
-    try:
-        from sklearn import cross_validation, metrics
-    except:
-        grass.fatal("Scikit-learn python module (python-sklearn) is not installed.....exiting")
+    # attach training label values onto last dimension of numpy array
+    training_data[0:nlabel_pixels, n_features] = training_labels
 
-    class_list = np.unique(y)
-    nclasses = len(np.unique(y))
+    # convert any CELL maps no data vals to NaN in the training data
+    for i in range(n_features):
+        training_data[training_data[:, i] == -2147483648] = np.nan
 
-    kf = cross_validation.KFold(len(y), n_folds=kfolds, shuffle=True,
-                                random_state=rstate)
+    # Remove nan rows from training data
+        training_data = training_data[~np.isnan(training_data).any(axis=1)]
 
-    # Store performance measures per class
-    recall = np.zeros((nclasses, kfolds))
-    precision = np.zeros((nclasses, kfolds))
+    # Split the numpy array into training_labels and training_data arrays
+    training_labels = training_data[:, n_features]
+    training_data = training_data[:, 0:n_features]
 
-    classmeans = np.zeros((nclasses, 2))
-    classstds = np.zeros((nclasses, 2))
+    # close maps
+    roi_raster.close()
+    
+    for i in range(n_features):
+        rasstack[i].close()
+    
+    return(training_data, training_labels)
 
-    # Loop through CV partitions
-    fold=0
-    for train_index, test_index in kf:
-        X_train, X_test =  X[train_index], X[test_index]
-        y_train, y_test = y[train_index], y[test_index]
-        fit = clf.fit(X_train, y_train)
-        y_pred = fit.predict(X_test)
+def sample_predictors(response, predictors):
+    
+    # open response raster as rasterrow and read as np array
+    if RasterRow(response).exist() == True:
+        roi_gr = RasterRow(response)
+        roi_gr.open('r')
+        response_np = np.array(roi_gr)
+    else:
+        grass.fatal("GRASS response raster does not exist.... exiting")
 
-        # Get performance measures by class     
-        for cindex in range(nclasses):
-            recall[cindex, fold] = metrics.recall_score(y_test, y_pred, labels = [class_list[cindex]], average = 'macro')
-            precision[cindex, fold] = metrics.precision_score(y_test, y_pred, labels = [class_list[cindex]], average = 'macro')
-        fold+= 1
+    # check to see if all predictors exist
+    n_features = len(predictors)
+    
+    for i in range(n_features):
+        if RasterRow(predictors[i]).exist() != True:
+            grass.fatal("GRASS raster " + predictors[i] + " does not exist.... exiting")
 
-    for cindex in range(nclasses):
-        classmeans[cindex, 0] = recall[cindex, :].mean()
-        classmeans[cindex, 1] = precision[cindex, :].mean()
+    # check if any of those pixels are labelled (not equal to nodata)
+    # can use even if roi is FCELL because nodata will be nan and this is not returned anyway
+    is_train = np.nonzero(response_np > -2147483648)
+    training_labels = response_np[is_train]
+    n_labels = np.array(is_train).shape[1]
 
-        classstds[cindex, 0] = recall[cindex, :].std()
-        classstds[cindex, 1] = precision[cindex,:].std()
+    # Create a zero numpy array of len training labels and n_features+1 for y
+    training_data = np.zeros((n_labels, n_features+1))
+
+    # Loop through each raster and sample pixel values at training indexes
+    for f in range(n_features):
+        predictor_gr = RasterRow(predictors[f])
+        predictor_gr.open('r')
+        feature_np = np.array(predictor_gr)
+        training_data[0:n_labels, f] = feature_np[is_train]
+        predictor_gr.close()
     
-    return(classmeans, classstds)
+    # attach training labels to last column
+    training_data[0:n_labels, n_features] = training_labels
 
-def main():
-    igroup = options['igroup']
-    roi = options['roi']
-    output = options['output']
-    mode = options['mode']
-    ntrees = options['ntrees']
-    cv = int(options['cv'])
-    balanced = flags['b']
-    modelonly = flags['m']
-    class_probabilities = flags['p']
-    rowincr = int(options['lines'])
-    mfeatures = int(options['mfeatures'])
-    minsplit = int(options['minsplit'])
-    randst = int(options['randst'])
-    model_save = options['savefile']
-    model_load = options['loadfile']
+    # convert any CELL maps no datavals to NaN in the training data
+    for i in range(n_features):
+        training_data[training_data[:, i] == -2147483648] = np.nan
 
-    """
-    Error checking for valid input parameters
-    """
-   
-    if mfeatures == -1:
-        mfeatures = str('auto')
-    if mfeatures == 0:
-        grass.fatal("mfeatures must be greater than zero, or -1 which uses the sqrt(nfeatures)...exiting")
-        exit()
-    if minsplit == 0:
-        grass.fatal("minsplit must be greater than zero.....exiting")
-        exit()
-    if rowincr <= 0:
-        grass.fatal("rowincr must be greater than zero....exiting")
-        exit()
-    if ntrees < 1:
-        grass.fatal("ntrees must be greater than zero.....exiting")
-        exit()
-    if mode == 'regression' and balanced == True:
-        grass.warning(_("balanced mode is ignored in Random Forests in regression mode....continuing"))
-    if mode == 'regression' and class_probabilities == True:
-        grass.warning(_("option to output class probabiltities is ignored in regression mode....continuing"))
-    if model_save != '' and model_load != '':
-        grass.fatal("Cannot save and load a model at the same time.....exiting")
-    if model_load == '' and roi == '':
-        grass.fatal("Require labelled pixels regions of interest.....exiting")
+    # Remove nan rows from training data
+    training_data = training_data[~np.isnan(training_data).any(axis=1)]
 
-    """
-    Obtain information about GRASS rasters to be classified
-    """
+    roi_gr.close()
+    
+    # return X and y data
+    return(training_data[:, 0:n_features], training_data[:, n_features])
 
-   # fetch individual raster names from group
+def prediction(clf, predictors, class_probabilities, rowincr, output, mode, labels):
     
-    groupmaps = grass.read_command("i.group", group=igroup, flags="g")
-    maplist = groupmaps.split(os.linesep)
-    maplist = maplist[0:len(maplist)-1]
-
-    # determine number of bands and then create a list of GRASS rasterrow objects
-    global nbands
-    nbands = len(maplist)
-
-    global rasstack
-    rasstack = [0] * nbands
-    for i in range(nbands):
-        rasstack[i] = RasterRow(maplist[i])
-
-    # check to see if each raster in the list exists
-    for i in range(nbands):
+    class_list = np.unique(labels)
+    nclasses = len(class_list)
+    
+    # create a list of rasterrow objects for predictors
+    n_features = len(predictors)
+    rasstack = [0] * n_features
+    for i in range(n_features):
+        rasstack[i] = RasterRow(predictors[i])
         if rasstack[i].exist() == True:
             rasstack[i].open('r')
         else:
             grass.fatal("GRASS raster " + maplist[i] + " does not exist.... exiting")
-
+    
     # use grass.pygrass.gis.region to get information about the current region, particularly
     current = Region()
-
-    # Define name of mask raster
-    global rfmask
-    rfmask = ''
-
-    # lazy import of sklearn
-    try:
-        from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
-        from sklearn.externals import joblib
-        from sklearn import cross_validation, metrics
-        import warnings
-        warnings.filterwarnings("ignore")
-    except:
-        grass.fatal("Scikit-learn python module (python-sklearn) is not installed.....exiting")
     
-    """
-    Sample training data using training ROI
-    """
-    
-    global roi_raster
-    roi_raster = RasterRow(roi)
-
-    # load the model
-    if model_load != '':
-        rf = joblib.load(model_load)
-    else:
-        # Check if ROI raster exists if model is not loaded
-        if roi_raster.exist() == True:
-            roi_raster.open('r')
-        else:
-            grass.fatal("ROI raster does not exist.... exiting")
-            
-        # determine cell storage type of training roi raster
-        roi_type = grass.read_command("r.info", map=roi, flags='g')
-        roi_list = str(roi_type).split(os.linesep)
-        roi_type.split('\n')
-        dtype = roi_list[9].split('=')[1]
-
-        # check if training rois are valid for classification and regression
-        if mode == 'classification' and dtype != 'CELL':
-            grass.fatal("Classification mode requires an integer CELL type training roi map.....exiting")
-
-        # Count number of labelled pixels
-        roi_stats = str(grass.read_command("r.univar", flags=("g"), map=roi))
-        roi_stats = roi_stats.split(os.linesep)[0]
-        ncells = str(roi_stats).split('=')[1]
-        nlabel_pixels = int(ncells)
-
-        # Create a zero numpy array with the dimensions of the number of columns in the region
-        # and the number of bands plus an additional band to attach the labels
-        tindex = 0
-        training_labels = []
-        training_data = np.zeros((nlabel_pixels, nbands+1))
-        training_data[:] = np.NAN
-
-        # Loop through each row of the raster, get the pixels from that row in the ROI raster
-        # and check if any of those pixels are labelled (i.e. they are not nan).
-        # If labelled pixels are encountered, loop through each band for that row and put the data
-        # into the img_row_band_np array. Also attach the label values onto the last column
-
-        for row in range(current.rows):
-            roi_row_np = roi_raster[row]
-            is_train = np.nonzero(roi_row_np > -2147483648)
-            training_labels = np.append(training_labels, roi_row_np[is_train])
-            nlabels_in_row = np.array(is_train).shape[1]
-            if np.isnan(roi_row_np).all() != True:
-                for band in range(nbands):
-                    imagerow_np = rasstack[band][row]
-                    training_data[tindex : tindex+nlabels_in_row, band] = imagerow_np[is_train]
-                tindex = tindex + nlabels_in_row
-
-        # Determine the number of class labels using np.unique
-        nclasses = len(np.unique(training_labels))
-        class_list = np.unique(training_labels)
-
-        # attach training label values onto last dimension of numpy array
-        training_data[0:nlabel_pixels, nbands] = training_labels
-
-        # Remove nan rows from numpy array
-        training_data = training_data[~np.isnan(training_data).any(axis=1)]
-
-        # Split the numpy array into training_labels and training_data arrays
-        training_labels = training_data[:, nbands]
-        training_data = training_data[:, 0:nbands]
-
-    """
-    Train the randomforest classifier and perform cross-validation
-    """
-
-    # define classifier for classification or regression mode, and balanced or unbalanced datasets
-    if model_load == '':
-        if mode == 'classification':
-            if balanced == True:
-                rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
-                                            class_weight='balanced', max_features=mfeatures, min_samples_split=minsplit, \
-                                            random_state=randst)
-            else:
-                rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
-                                            max_features=mfeatures, min_samples_split=minsplit, random_state=randst)
-        else:
-            rf = RandomForestRegressor(n_jobs=-1, n_estimators=int(ntrees), oob_score=False, \
-                                       max_features=mfeatures, min_samples_split=minsplit, random_state=randst)
-
-        # train classifier
-        rf = rf.fit(training_data, training_labels)
-        
-        # output internal performance measures
-        grass.message(_("\r\n"))
-        if mode == 'classification':
-            grass.message(_('RF OOB prediction  accuracy: \t {oob}%'.format(oob=rf.oob_score_ * 100)))
-        else:
-            grass.message(_('Rf coefficient of determination R^2: \t {r2}%'.format \
-                            (r2=rf.score(X=training_data, y=training_labels))))
-       
-        # use cross-validation
-        if cv > 1:
-            grass.message(_('\r\n'))
-            grass.message(_("Cross validation global performance measures......:"))
-        
-            if mode == 'classification':
-                acc, auc, kappa  = score_classification_results(training_data, training_labels, rf, cv, randst)
-                
-                grass.message(_("Acc:\t{mean}\t+/-2SD\t{error}".format(mean=round(acc.mean(),3), error=round(acc.std() * 2,3))))
-                
-                if len(np.unique(training_labels)) == 2:
-                    grass.message(_("AUROC:\t{mean}\t+/-2SD\t{error}".format(mean=round(auc.mean(),3), error=round(auc.std() * 2,3))))
-                
-                grass.message(_("Kappa:\t{mean}\t+/-2SD\t{error}".format(mean=round(kappa.mean(),3), error=round(kappa.std() * 2,3))))
-                        
-                # calculate scores by class
-                Cmean, Cstd = cv_performance_byClass(training_data, training_labels, rf, cv, randst)
-    
-                grass.message(_("\r\n"))
-                grass.message(_("Performance measures by class......:"))
-                grass.message(_("CLASS\tRecall\t2SD\tPrecision\t2SD"))
-                for i in range(nclasses):
-                    row = str(int(class_list[i])) + '\t'
-                    for j in range(2):
-                        row += str( round(Cmean[i,j],3) ) + '\t' +  str( round(Cstd[i,j]*2,3) ) + '\t'
-                    grass.message(_(row))
-                
-            else:
-                scores = score_regression_results(training_data, training_labels, rf, cv, randst)
-                grass.message(_("R2:\t{mean}\t+/-\t{error}".format(mean=scores[:].mean(), error=scores[:].std() * 2)))
-        
-        # diagnostics
-        rfimp = rf.feature_importances_
-        grass.message(_("\r\n"))
-        grass.message(_("Random forest feature importance"))
-        grass.message(_("id" + "\t" + "Raster" + "\t" + "Importance"))
-        for i in range(len(rfimp)):
-             grass.message(_(str(i) + "\t" + maplist[i] + "\t" + str(rfimp[i])))
-        
-        # save the model
-        if model_save != '':
-            joblib.dump(rf, model_save + ".pkl")
-        
-        if modelonly == True:
-            grass.fatal("Model built and now exiting")
-
-    """
-    Prediction on the rest of the GRASS rasters in the imagery group
-    """
-
     # create a imagery mask
     # the input rasters might have different dimensions in terms of value and non-value pixels.
     # r.series used to automatically create a mask by propagating the null values
-    rfmask = 'tmp_' + ''.join([random.choice(string.ascii_letters + string.digits) \
-                               for n in xrange(8)])
-    grass.run_command("r.series", output=rfmask, input=maplist, method='count', flags='n')
+    clfmask = 'clfmasktmp'
+    grass.run_command("r.series", output=clfmask, input=predictors, method='count', flags='n')
 
-    global mask_raster
-    mask_raster = RasterRow(rfmask)
+    mask_raster = RasterRow(clfmask)
     mask_raster.open('r')
     
-    """
-    PROCEDURE
-    1. Create a np.array that can store each raster row for all of the bands
-    2. Loop through the raster, row-by-row and get the row values for each band
-        adding these to the img_np_row np.array,
-        which bundles rowincr rows together to pass to the classifier
-    3. The scikit learn predict function expects a list of pixels, not an NxM matrix.
-        We therefore need to reshape each row matrix into a list.
-        The total matrix size = cols * nbands.
-        Therefore we can use the np.reshape function to convert the image into a list
-        with the number of rows = n_samples, and the number of columns = the number of bands.
-    4. Then we remove any NaN values because the scikit-learn predict function cannot handle NaNs.
-        Here we replace them with a small value using the np.nan_to_num function.
-    5. The flat_pixels is then passed onto the prediction function.
-    6. After the prediction is performed on the row, to save keeping
-        anything in memory, we save it to a GRASS raster object, row-by-row.
-    """
     # create and open RasterRow objects for classification
     classification = RasterRow(output)
     if mode == 'classification':
@@ -535,9 +472,9 @@
         ftype = 'FCELL'
         nodata = np.nan
     classification.open('w', ftype, overwrite=True)
-
+    
     # create and open RasterRow objects for  probabilities if enabled
-    if class_probabilities == True and mode == 'classification':
+    if class_probabilities == True:
         prob_out_raster = [0] * nclasses
         prob = [0] * nclasses
         for iclass in range(nclasses):
@@ -548,17 +485,18 @@
     """
     Prediction using row blocks
     """
-
+            
     for rowblock in range(0, current.rows, rowincr):
         # check that the row increment does not exceed the number of rows
-        if rowblock+rowincr > current.rows: rowincr = current.rows - rowblock
-        img_np_row = np.zeros((rowincr, current.cols, nbands))
+        if rowblock+rowincr > current.rows:
+            rowincr = current.rows - rowblock
+        img_np_row = np.zeros((rowincr, current.cols, n_features))
         mask_np_row = np.zeros((rowincr, current.cols))
 
         # loop through each row, and each band and add these values to the 2D array img_np_row
         for row in range(rowblock, rowblock+rowincr, 1):
             mask_np_row[row-rowblock, :] = np.array(mask_raster[row])
-            for band in range(nbands):
+            for band in range(n_features):
                 img_np_row[row-rowblock, :, band] = np.array(rasstack[band][row])
 
         mask_np_row[mask_np_row == -2147483648] = np.nan
@@ -566,11 +504,11 @@
 
         # reshape each row-band matrix into a list
         nsamples = rowincr * current.cols
-        flat_pixels = img_np_row.reshape((nsamples, nbands))
+        flat_pixels = img_np_row.reshape((nsamples, n_features))
 
         # remove NaN values and perform the prediction
         flat_pixels = np.nan_to_num(flat_pixels)
-        result = rf.predict(flat_pixels)
+        result = clf.predict(flat_pixels)
         result = result.reshape((rowincr, current.cols))
 
         # replace NaN values so that the prediction surface does not have a border
@@ -587,7 +525,7 @@
 
         # same for probabilities
         if class_probabilities == True and mode == 'classification':
-            result_proba = rf.predict_proba(flat_pixels)
+            result_proba = clf.predict_proba(flat_pixels)
             for iclass in range(nclasses):
                 result_proba_class = result_proba[:, iclass]
                 result_proba_class = result_proba_class.reshape((rowincr, current.cols))
@@ -604,6 +542,355 @@
     if class_probabilities == True and mode == 'classification':
         for iclass in range(nclasses): prob[iclass].close()
 
+def shuffle_data(X, y, rstate):
+
+    from sklearn.utils import shuffle
+
+    # combine XY data into a single numpy array
+    XY = np.empty((X.shape[0], X.shape[1]+1))
+    XY[:,0] = y
+    XY[:,1:] = X
+    
+    XY = shuffle(XY, random_state=rstate)
+
+    # split XY into train_xs and train_y
+    X = XY[:,1:]
+    y = XY[:,0]
+    
+    return(X, y)
+
+def save_training_data(X, y, file):
+    # append X and y and save to csv
+    training_data = np.zeros((y.shape[0], X.shape[1]+1))
+    training_data[:, 0:X.shape[1]] = X
+    training_data[:, X.shape[1]] = y
+    np.savetxt(file, training_data, delimiter = ',')
+
+def load_training_data(file):
+    training_data = np.loadtxt(file, delimiter = ',')
+    n_features = training_data.shape[1]
+    X = training_data[:, 0:n_features-1]
+    y = training_data[:, n_features-1]
+    return(X, y)
+
+def main():
+    """
+    GRASS options and flags
+    """
+    # General options and flags
+    igroup = options['igroup']
+    roi = options['roi']
+    output = options['output']
+    model = options['model']
+    cv = int(options['cv'])
+    modelonly = flags['m']
+    class_probabilities = flags['p']
+    rowincr = int(options['lines'])
+    randst = int(options['randst'])
+    model_save = options['savefile']
+    model_load = options['loadfile']
+    save_training = options['save_training']
+    load_training = options['load_training']
+    importances = flags['f']
+    weighting = flags['b']
+    lowmem = flags['l']
+
+    # logistic regression
+    c_lr = float(options['c_lr'])
+    fi = flags['i']
+    
+    # decision trees
+    splitter_dt = options['splitter_dt']
+    m_features_dt = int(options['m_features_dt'])
+    min_samples_split_dt = int(options['min_samples_split_dt'])
+    min_samples_leaf_dt = int(options['min_samples_leaf_dt'])
+    min_weight_fraction_leaf_dt = float(options['min_weight_fraction_leaf_dt'])
+    
+    # random forests
+    ntrees_rf = int(options['ntrees_rf'])
+    m_features_rf = int(options['m_features_rf'])
+    minsplit_rf = int(options['minsplit_rf'])
+    
+    # gradient tree boosting
+    learning_rate_gtb = float(options['learning_rate_gtb'])
+    n_estimators_gtb = int(options['n_estimators_gtb'])
+    max_depth_gtb = int(options['max_depth_gtb'])
+    min_samples_split_gtb = int(options['min_samples_split_gtb'])
+    min_samples_leaf_gtb = int(options['min_samples_leaf_gtb'])
+    min_weight_fraction_leaf_gtb = float(options['min_weight_fraction_leaf_gtb'])
+    subsample_gtb = float(options['subsample_gtb'])
+    max_features_gtb = int(options['max_features_gtb'])
+    
+    # classification or regression
+    if model == 'LogisticRegression' \
+    or model == 'DecisionTreeClassifier' \
+    or model == 'RandomForestClassifier' \
+    or model == 'GradientBoostingClassifier' \
+    or model == 'GaussianNB' \
+    or model == 'LinearDiscriminantAnalysis' \
+    or model == 'QuadraticDiscriminantAnalysis':
+
+        mode = 'classification'
+    else:
+        mode = 'regression'
+
+    """
+    Error checking for valid input parameters
+    """
+
+    # decision trees
+    if model == 'DecisionTreeClassifier' or model == 'DecisionTreeRegressor':
+        if m_features_dt == -1:
+            m_features_dt = str('auto')
+        if m_features_dt == 0:
+            grass.fatal("m_features_dt must be greater than zero, or -1 which uses the sqrt(nfeatures)...exiting")
+        if min_samples_split_dt < 1:
+            grass.fatal("min_samples_split_dt must be >=1.....exiting")
+        if min_samples_leaf_dt < 1:
+            grass.fatal("min_samples_leaf_dt must be >=1.....exiting")
+
+    # random forests
+    if model == 'RandomForestClassifier' or model == 'RandomForestRegressor':
+        if m_features_rf == -1:
+            m_features_rf = str('auto')
+        if m_features_rf == 0:
+            grass.fatal("mfeatures must be greater than zero, or -1 which uses the sqrt(nfeatures)...exiting")
+        if minsplit_rf < 1:
+            grass.fatal("minsplit must be greater than zero.....exiting")
+        if ntrees_rf < 1:
+            grass.fatal("ntrees must be greater than zero.....exiting")
+
+    # gradient tree boosting
+    if model == 'GradientBoostingClassifier' or model == 'GradientBoostingRegressor':
+        if n_estimators_gtb < 1:
+            grass.fatal("n_estimators_gtb must be greater than zero...exiting")
+        if max_depth_gtb < 1:
+            grass.fatal("max_depth_gtb must be greater than zero...exiting")
+        if min_samples_split_gtb < 1:
+            grass.fatal("min_samples_split_gtb must be greater than zero...exiting")
+        if min_samples_leaf_gtb < 1:
+            grass.fatal("min_samples_leaf_gtb must be greater than zero...exiting")
+        if max_features_gtb == -1:
+            max_features_gtb = str('auto')
+        if max_features_gtb == 0:
+            grass.fatal("max_features_gtb must be greater than zero, or -1 which uses the sqrt(nfeatures)...exiting")
+
+    # general options
+    if rowincr <= 0:
+        grass.fatal("rowincr must be greater than zero....exiting")
+    if model_save != '' and model_load != '':
+        grass.fatal("Cannot save and load a model at the same time.....exiting")
+    if model_load == '' and roi == '':
+        grass.fatal("Require labelled pixels regions of interest.....exiting")
+    if weighting == True:
+        weighting = 'balanced'
+    else:
+        weighting = None
+
+    """
+    Obtain information about GRASS rasters to be classified
+    """
+    
+    # fetch individual raster names from group
+    groupmaps = im.group(group=igroup, flags="g", quiet = True, stdout_=PIPE).outputs.stdout
+    maplist = groupmaps.split(os.linesep)
+    maplist = maplist[0:len(maplist)-1]
+    n_features = len(maplist)
+
+    # use grass.pygrass.gis.region to get information about the current region
+    current = Region()
+
+    # lazy import of sklearn
+    try:
+        from sklearn.externals import joblib
+        import warnings
+        warnings.filterwarnings("ignore")
+    except:
+        grass.fatal("Scikit-learn python module is not installed...exiting")
+    
+    """
+    Sample training data using training ROI
+    """
+    
+    # load the model or training data
+    if model_load != '':
+        clf = joblib.load(model_load)
+    else:
+        if load_training != '':
+            X, y = load_training_data(load_training)
+        else:
+            # query predictor rasters with training features
+            if lowmem != True:
+                X, y = sample_predictors(roi, maplist)
+            else:
+                X, y = sample_predictors_byrow(roi, maplist)
+            
+            # shuffle the training data
+            X, y = shuffle_data(X, y, rstate=randst)
+            
+            # use GRASS option to save the training data
+            if save_training != '': save_training_data(X, y, save_training)
+
+        # determine the number of class labels using np.unique
+        nclasses = len(np.unique(y))
+        class_list = np.unique(y)
+    
+    # Error checking for m_features settings
+    if m_features_dt > n_features: m_features_dt = n_features
+    if m_features_rf > n_features: m_features_rf = n_features
+    if max_features_gtb > n_features: max_features_gtb = n_features
+    
+    """
+    Train the classifier
+    """
+
+    # define classifier unless model is to be loaded from file
+    if model_load == '':
+        
+        # classifiers
+        from sklearn.linear_model import LogisticRegression
+        from sklearn.tree import DecisionTreeClassifier
+        from sklearn.ensemble import RandomForestClassifier
+        from sklearn.ensemble import GradientBoostingClassifier
+        from sklearn.naive_bayes import GaussianNB
+        from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
+        from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
+        
+        from sklearn.neighbors import KNeighborsRegressor
+        from sklearn.tree import DecisionTreeRegressor
+        from sklearn.ensemble import RandomForestRegressor
+        from sklearn.ensemble import GradientBoostingRegressor
+        
+        classifiers = {
+            'LogisticRegression': LogisticRegression(C=c_lr, fit_intercept=fi, n_jobs=-1, class_weight=weighting),
+            'DecisionTreeClassifier': DecisionTreeClassifier(splitter=splitter_dt,
+                                                             max_features=m_features_dt,
+                                                             min_samples_split=min_samples_split_dt,
+                                                             min_samples_leaf=min_samples_leaf_dt,
+                                                             min_weight_fraction_leaf=min_weight_fraction_leaf_dt,
+                                                             random_state=randst,
+                                                             class_weight=weighting),
+            'DecisionTreeRegressor': DecisionTreeRegressor(splitter=splitter_dt,
+                                                           max_features=m_features_dt,
+                                                           min_samples_split=min_samples_split_dt,
+                                                           min_samples_leaf=min_samples_leaf_dt,
+                                                           min_weight_fraction_leaf=min_weight_fraction_leaf_dt,
+                                                           random_state=randst),
+            'RandomForestClassifier': RandomForestClassifier(n_estimators=ntrees_rf,
+                                                             oob_score=True,
+                                                             max_features=m_features_rf,
+                                                             min_samples_split=minsplit_rf,
+                                                             random_state=randst,
+                                                             n_jobs=-1,
+                                                             class_weight=weighting),
+            'RandomForestRegressor': RandomForestRegressor(n_jobs=-1,
+                                                           n_estimators=ntrees_rf,
+                                                           oob_score=False,
+                                                           max_features=m_features_rf,
+                                                           min_samples_split=minsplit_rf,
+                                                           random_state=randst),
+            'GradientBoostingClassifier': GradientBoostingClassifier(learning_rate=learning_rate_gtb,
+                                                                     n_estimators=n_estimators_gtb,
+                                                                     max_depth=max_depth_gtb,
+                                                                     min_samples_split=min_samples_split_gtb,
+                                                                     min_samples_leaf=min_samples_leaf_gtb,
+                                                                     min_weight_fraction_leaf=min_weight_fraction_leaf_gtb,
+                                                                     subsample=subsample_gtb,
+                                                                     max_features=max_features_gtb,
+                                                                     random_state=randst),
+            'GradientBoostingRegressor': GradientBoostingRegressor(learning_rate=learning_rate_gtb,
+                                                                   n_estimators=n_estimators_gtb,
+                                                                   max_depth=max_depth_gtb,
+                                                                   min_samples_split=min_samples_split_gtb,
+                                                                   min_samples_leaf=min_samples_leaf_gtb,
+                                                                   min_weight_fraction_leaf=min_weight_fraction_leaf_gtb,
+                                                                   subsample=subsample_gtb,
+                                                                   max_features=max_features_gtb,
+                                                                   random_state=randst),
+            'GaussianNB': GaussianNB(),
+            'LinearDiscriminantAnalysis': LinearDiscriminantAnalysis(),
+            'QuadraticDiscriminantAnalysis': QuadraticDiscriminantAnalysis(),
+        }
+        
+        # define classifier
+        clf = classifiers[model]
+        
+        # train classifier
+        clf.fit(X, y)
+        grass.message(_("Model built with: " + model))
+        
+        """
+        Cross Validation
+        """
+
+        # output internal performance measures for random forests
+        if model == 'RandomForestClassifier':
+            grass.message(_("\r\n"))
+            grass.message(_('RF OOB prediction  accuracy: \t %0.3f' %
+                            (clf.oob_score_ * 100)))
+        if model == 'RandomForestRegressor':
+            grass.message(_("\r\n"))
+            grass.message(_('Coefficient of determination R^2: \t %0.3f' %
+                         (clf.score(X=training_data, y=training_labels))))
+       
+        # If cv > 1 then use cross-validation to generate performance measures
+        if cv > 1:
+            from sklearn.cross_validation import cross_val_score
+            from sklearn.metrics import classification_report
+            from sklearn.cross_validation import cross_val_predict
+
+            grass.message(_('\r\n'))
+            grass.message(_("Cross validation global performance measures......:"))
+
+            if mode == 'classification':
+                accuracy = cross_val_score(clf, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
+                grass.message(_("Accuracy   :\t%0.2f\t+/-2SD\t%0.2f" %(accuracy.mean(), accuracy.std() * 2)))
+                if len(np.unique(y)) == 2:
+                    auc = cross_val_score(clf, X, y, cv=cv, scoring='roc_auc', n_jobs=-1)
+                    grass.message(_("ROC AUC    :\t%0.2f\t+/-2SD\t%0.2f" % (auc.mean(), auc.std() * 2)))
+                
+                # calculate scores by class using classification report
+                y_pred = cross_val_predict(clf, X, y, cv=cv, n_jobs=-1)
+                grass.message(_("\n"))
+                grass.message(_("Classification report:"))
+                grass.message(_(classification_report(y, y_pred)))
+                
+            else:
+                r2 = cross_val_score(clf, X, y, cv=cv, scoring='r2', n_jobs=-1)
+                grass.message(_("R2:\t%0.2f\t+/-\t%0.2f" % (r2.mean(), r2.std() * 2)))
+                
+
+        # diagnostics
+        if importances == True:
+            if (model == 'RandomForestClassifier' or 
+                model == 'GradientBoostingClassifier' or
+                model == 'RandomForestRegressor' or
+                model == 'GradientBoostingRegressor' or
+                model == 'DecisionTreeClassifier' or
+                model == 'DecisionTreeRegressor'):
+                    
+                    clfimp = clf.feature_importances_
+                    grass.message(_("\r\n"))
+                    grass.message(_("Feature importances"))
+                    grass.message(_("id" + "\t" + "Raster" + "\t" + "Importance"))
+                    
+                    for i in range(len(clfimp)):
+                         grass.message(_(str(i) + "\t" + maplist[i] + "\t" + str(round(clfimp[i], 4))))
+        
+        # save the model
+        if model_save != '':
+            joblib.dump(clf, model_save + ".pkl")
+        
+        if modelonly == True:
+            grass.fatal("Model built and now exiting")
+
+    """
+    Prediction on the rest of the GRASS rasters in the imagery group
+    """
+    
+    prediction(clf, maplist, class_probabilities, rowincr, output, mode, y)
+    
+
 if __name__ == "__main__":
     options, flags = grass.parser()
     atexit.register(cleanup)



More information about the grass-commit mailing list