[GRASS-SVN] r70250 - in grass-addons/grass7/raster: . r.learn.ml

svn_grass at osgeo.org svn_grass at osgeo.org
Wed Jan 4 14:00:23 PST 2017


Author: spawley
Date: 2017-01-04 14:00:23 -0800 (Wed, 04 Jan 2017)
New Revision: 70250

Added:
   grass-addons/grass7/raster/r.learn.ml/
   grass-addons/grass7/raster/r.learn.ml/Makefile
   grass-addons/grass7/raster/r.learn.ml/lsat7_2000_b742.png
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
   grass-addons/grass7/raster/r.learn.ml/rfclassification.png
Log:
'Renaming of r.randomforest to r.learn.ml. Added automatic parameter tuning through nested cross-validation, and ability to supply grouping raster to performed spatial cross-validation'

Added: grass-addons/grass7/raster/r.learn.ml/Makefile
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/Makefile	                        (rev 0)
+++ grass-addons/grass7/raster/r.learn.ml/Makefile	2017-01-04 22:00:23 UTC (rev 70250)
@@ -0,0 +1,7 @@
+MODULE_TOPDIR = ../..
+
+PGM = r.learn.ml
+
+include $(MODULE_TOPDIR)/include/Make/Script.make
+
+default: script

Added: grass-addons/grass7/raster/r.learn.ml/lsat7_2000_b742.png
===================================================================
(Binary files differ)


Property changes on: grass-addons/grass7/raster/r.learn.ml/lsat7_2000_b742.png
___________________________________________________________________
Added: svn:mime-type
   + application/octet-stream

Added: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	                        (rev 0)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-01-04 22:00:23 UTC (rev 70250)
@@ -0,0 +1,103 @@
+<h2>DESCRIPTION</h2>
+
+<p><em>r.learn.ml</em> represents a front-end to the scikit learn python package for the purpose of performing classification and regression on GRASS rasters as part of an imagery group. The module enables classification and regression using several commonly used classifiers in remote sensing and spatial modelling. The choice of classifier is set using the <em>classifier</em> parameter. For more details relating to the classifiers, refer to the <a href="http://scikit-learn.org/stable/">scikit learn documentation.</a> The following classification and regression methods are available:</p>
+
+<ul>
+	<li><em>LogisticRegression</em> represents a linear model for classification and is a modification of linear regression, but using the logistic distribution, which enables the use of a categorical response variable. If the <em>response</em> raster is coded to 0 and 1, then a binary classification occurs, but the scikit-learn logistic regression can also perform a multiclass classification using a one-versus-rest scheme.</li>
+	
+	<li><em>LinearDiscriminantAnalysis</em> and <em>QuadraticDiscriminantAnalysis</em> are classifiers with linear and quadratic decision surfaces. These classifiers do not take any parameters and are inherently multiclass. They can only be used for classification.</li>
+	
+	<li><em>GaussianNB</em> is the Gaussian Naive Bayes algorithm and can be used for classification only. Naive Bayes is a supervised learning algorithm based on applying Bayes theorem with the naive assumption of independence between every pair of features. This classifier does not take any parameters.</li>
+	
+	<li>The <em>DecisionTreeClassifier</em> and <em>DecisionTreeRegressor</em> map observations to a response variable using a hierarchy of splits and branches. The terminus of these branches, termed leaves, represent the prediction of the response variable. Decision trees are non-parametric and can model non-linear relationships between a response and predictor variables, and are insensitive the scaling of the predictors.</li>
+	
+	<li>The <em>RandomForestsClassifier</em> and <em>RandomForestsRegressor</em> represent ensemble classification and regression tree methods. Random forests overcome some of the disadvantages of single decision trees by constructing an ensemble of uncorrelated trees. Each tree is constructed from a random subsample of the training data and only a random subset of the predictors based on <em>max_features</em> is made available during each node split. Each tree produces a prediction probability and the final classification result is obtained by averaging of the prediction probabilities across all of the trees.</li>
+	
+	<li>The <em>GradientBoostingClassifier</em> and <em>GradientBoostingRegressor</em> also represent ensemble tree-based methods. However, in a boosted model the learning processes is additive in a forward step-wise fashion whereby <i>n_estimators</i> are fit during each model step, and each model step is designed to better fit samples that are not currently well predicted by the previous step. This incrementally improves the performance of the entire model ensemble by fitting to the model residuals.</li>
+	
+	<li>The <em>SVC</em> model is C-Support Vector Classifier. Only a linear kernel is supported because non-linear kernels using scikit learn for typical remote sensing and spatial analysis datasets which consist of large numbers of samples are too slow to be practical. This classifier can still be slow for large datasets.</li>
+	
+	<li>The <em>EarthClassifier</em> and <em>EarthRegressor</em> is a python-based version of Friedman's multivariate adaptive regression splines. This classifier depends on the <a href="https://github.com/scikit-learn-contrib/py-earth">py-earth package</a>, which optionally can be installed in addition to scikit-learn. Earth represents a non-parametric extension to linear models such as logistic regression which improves model fit by partitioning the data into subregions, with each region being fitted by a separate regression term.</li>
+</ol>
+
+<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. The classifier defaults are supplied but these parameters can be automatically tuning using a randomized search by setting the <em>n_iter</em> option to &gt 1. Parameter tuning can be accomplished simultaneously with nested cross-validation by also settings the <em>cv</em> option to &gt > 1. The parameters consist of:
+
+<ul>
+	<li><em>C</em> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <i>C</i> applies to the LogisticRegression and SVC models. Tuning occurs over the range of 1-1000. </li>
+	
+	<li><em>n_estimators</em> represents the number of trees in Random Forest model, and the number of trees used in each model step during Gradient Boosting. Tuning occurs over 50-500 for gradient boosting, whereas <em>n_estimators</em> is not tuning for random forests.</li>
+	
+	<li><em>max_features</em> controls the number of variables that are allowed to be chosen from at each node split in the tree-based models, and can be considered to control the degree of correlation between the trees in ensemble tree methods. Tuning occurs over 1 to all of the features being available for random forests and gradient boosting. Single decision trees are not tuned on this parameter.</li>
+	
+	<li><em>min_samples_split</em> and <em>min_samples_leaf</em> control the number of samples required to split a node, or form a leaf node, respectively. Tuning varies these parameters by allowing up to 20% of the samples to be required form a node split or leaf node.</li>
+	
+	<li>The <em>learning_rate</em> and <em>subsample</em> parameters apply only to Gradient Boosting. <em>learning_rate</em> shrinks the contribution of each tree, and <em>subsample</em> is the fraction of randomly selected samples for each tree. <em>learning_rate</em> is tuning over 0.001-0.1, and <em>subsample</em> is tuned over 0-1.0.</li>
+	
+	<li>Parameters relating to the Earth classifier consist of: <em>max_degree</em> which is the maximum degree of terms generated by the forward pass; <em>penalty</em> is a smoothing parameter; and <em>minspan_alpha</em> is the probability between 0 and 1 that controls the number of data points between knots. These are tuned over 1-10 for <em>max_degree</em>, 0.5-5 for <em>penalty</em>, and 0.05-1 for <em>minspan_alpha.</em></li>
+
+<p>In addition to model fitting and prediction, feature selection can be performed using the <i>f</i> flag. The feature selection method employed consists of a custom permutation-based method that can be applied to all of the classifiers as part of a cross-validation. The method consists of: (1) determining a performance metric on a test partition of the data; (2) permuting each variable and assessing the difference in performance between the original and permutation; (3) repeating step 2 for <i>n_permutations</i>; (4) averaging the results. Steps 1-4 are repeated on each k partition. The feature importance represent the average decrease in performance of each variable when permuted. For binary classifications, the AUC is used as the metric. Multiclass classifications use accuracy, and regressions use R2.</p>
+
+<p>Cross validation can be performed by setting the <i>cv</i> parameters to &gt 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced. Also note that this cross-validation is performed on a pixel basis. If there is a strong autocorrelation between pixels (i.e. the pixels represent polygons) then the training/test splits will not represent independent samples and will overestimate the accuracy. In this case, the <i>cvtype</i> parameter can be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into groups by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic perform
 ance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the group_ids of the partitions can be supplied using the <i>group_raster</i> option.</p>
+
+<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as LogisticRegression and SVC may not perform optimally if some predictors have variances that are orders of magnitude larger than others, and will therefore dominate the objective function. The <i>s</i> flag can be used to add a standardization preprocessing step to the classification and prediction, which will standardize each predictor relative to its standard deviation.</p>
+
+<p>The module also offers the ability to save and load a classification or regression model. Saving and loading a model allows a model to be fitted on one imagery group, with the prediction applied to additional imagery groups. This approach is commonly employed in species distribution or landslide susceptibility modelling whereby a classification or regression model is built with one set of predictors (e.g. present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables.</p>
+
+<p>For convenience when performing repeated classifications using different classifiers or parameters, the training data can be saved to a csv file using the <i>save_training</i> option. This data can then be loaded into subsequent classification runs, saving time by avoiding the need to repeatedly query the predictors.</p>
+
+<h2>NOTES</h2>
+
+<p><em>r.learn.ml</em> uses the "scikit-learn" machine learning python package along with the "pandas" package. These packages need to be installed within your GRASS GIS Python environment. For Linux users, these packages should be available through the linux package manager. For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows with a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.</p>
+
+<p><em>r.learn.ml</em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=25</i> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <i>l</i> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on rasters of any size irrespective of the available mem
 ory.</p>
+
+<p>Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <i>randst</i> parameter.</p>
+
+<h2>NOTES</h2>
+
+<p>The balancing option in scikit-learn, which seeks to reduce class imbalances using weights that are inversely proportional to class frequencies, only applies to a few of the classifiers (LogisticRegression, DecisionTree, RandomForest, and GradientBoostingClassifiers). An separate python package called imbalanced-learn provides more sophisticated up- and down-sampling methods, e.g. using SMOTE, ROSE, etc. The option to balance the training data using this optionally installed package will be added in the future.</p>
+
+<h2>EXAMPLE</h2>
+
+<p>Here we are going to use the GRASS GIS sample North Carolina data set as a basis to perform a landsat classification. We are going to classify a Landsat 7 scene from 2000, using training information from an older (1996) land cover dataset.</p>
+
+<p>Landsat 7 (2000) bands 7,4,2 color composite example:</p>
+<center>
+<img src="lsat7_2000_b742.png" alt="Landsat 7 (2000) bands 7,4,2 color composite example">
+</center>
+
+<p>Note that this example must be run in the "landsat" mapset of the North Carolina sample data set location.</p>
+
+<p>First, we are going to generate some training pixels from an older (1996) land cover classification:</p>
+<div class="code"><pre>
+g.region raster=landclass96 -p
+r.random input=landclass96 npoints=1000 raster=landclass96_roi
+</pre></div>
+
+<p>Then we can use these training pixels to perform a classification on the more recently obtained landsat 7 image:</p>
+<div class="code"><pre>
+r.learn.ml group=lsat7_2000 trainingmap=landclass96_roi output=rf_classification \
+  classifier=RandomForestClassifier n_estimators=500 max_features=-1 min_samples_split=2 randst=1 lines=25
+
+# copy category labels from landclass training map to result
+r.category rf_classification raster=landclass96_roi
+
+# copy color scheme from landclass training map to result
+r.colors rf_classification raster=landclass96_roi
+r.category rf_classification
+</pre></div>
+
+<p>Random forest classification result:</p>
+<center>
+<img src="rfclassification.png" alt="Random forest classification result">
+</center>
+
+<h2>ACKNOWLEDGEMENTS</h2>
+
+<p>Thanks for Paulo van Breugel and Vaclav Petras for general testing, and Paulo for the suggestion to enable saving of the fitted models.</p>
+
+<h2>AUTHOR</h2>
+
+Steven Pawley
+
+<p><i>Last changed: $Date: 2017-01-02 16:05:45 -0700 (Mon, 02 Jan 2017) $</i></p>
\ No newline at end of file

Added: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	                        (rev 0)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-01-04 22:00:23 UTC (rev 70250)
@@ -0,0 +1,1508 @@
+#!/usr/bin/env python
+############################################################################
+# MODULE:        r.learn.ml
+# AUTHOR:        Steven Pawley
+# PURPOSE:       Supervised classification and regression of GRASS rasters
+#                using the python scikit-learn package
+#
+# COPYRIGHT: (c) 2016 Steven Pawley, and the GRASS Development Team
+#                This program is free software under the GNU General Public
+#                for details.
+#
+#############################################################################
+
+#%module
+#% description: Supervised classification and regression of GRASS rasters using the python scikit-learn package
+#% keyword: classification
+#% keyword: regression
+#% keyword: machine learning
+#% keyword: scikit-learn
+#%end
+
+#%option G_OPT_I_GROUP
+#% key: group
+#% label: Imagery group to be classified
+#% description: Series of raster maps to be used in the random forest classification
+#% required: yes
+#% multiple: no
+#%end
+
+#%option G_OPT_R_INPUT
+#% key: trainingmap
+#% label: Labelled pixels
+#% description: Raster map with labelled pixels for training
+#% required: no
+#% guisection: Required
+#%end
+
+#%option G_OPT_R_OUTPUT
+#% key: output
+#% required: yes
+#% label: Output Map
+#% description: Prediction surface result from classification or regression model
+#%end
+
+#%option string
+#% key: classifier
+#% required: yes
+#% label: Classifier
+#% description: Supervised learning model to use
+#% answer: RandomForestClassifier
+#% options: LogisticRegression,LinearDiscriminantAnalysis,QuadraticDiscriminantAnalysis,GaussianNB,DecisionTreeClassifier,DecisionTreeRegressor,RandomForestClassifier,RandomForestRegressor,GradientBoostingClassifier,GradientBoostingRegressor,SVC,EarthClassifier,EarthRegressor
+#%end
+
+#%option double
+#% key: c
+#% description: Inverse of regularization strength (logistic regresson and SVC)
+#% answer: 1.0
+#% guisection: Classifier Parameters
+#%end
+
+#%option
+#% key: max_features
+#% type: integer
+#% description: Number of features to consider during splitting for tree-based classifiers. Default -1 is sqrt(n_features) for classification, and n_features for regression
+#% answer: -1
+#% guisection: Classifier Parameters
+#%end
+
+#%option
+#% key: max_depth
+#% type: integer
+#% description: Maximum tree depth for tree-based classifiers. Value of -1 uses classifier defaults
+#% answer: -1
+#% guisection: Classifier Parameters
+#%end
+
+#%option
+#% key: min_samples_split
+#% type: integer
+#% description: The minimum number of samples required for node splitting in tree-based classifiers
+#% answer: 2
+#% guisection: Classifier Parameters
+#%end
+
+#%option
+#% key: min_samples_leaf
+#% type: integer
+#% description: The minimum number of samples required to form a leaf node for tree-based classifiers
+#% answer: 1
+#% guisection: Classifier Parameters
+#%end
+
+#%option
+#% key: n_estimators
+#% type: integer
+#% description: Number of estimators for tree-based classifiers
+#% answer: 500
+#% guisection: Classifier Parameters
+#%end
+
+#%option
+#% key: learning_rate
+#% type: double
+#% description: learning rate for gradient boosting
+#% answer: 0.1
+#% guisection: Classifier Parameters
+#%end
+
+#%option
+#% key: subsample
+#% type: double
+#% description: The fraction of samples to be used for fitting for gradient boosting
+#% answer: 1.0
+#% guisection: Classifier Parameters
+#%end
+
+#%option integer
+#% key: max_degree
+#% description: The maximum degree of terms generated by the forward pass in Earth
+#% answer: 1.0
+#% guisection: Classifier Parameters
+#%end
+
+#%option double
+#% key: penalty
+#% description: A smoothing parameter used to calculate GCV and GRSQ in Earth
+#% answer: 3.0
+#% guisection: Classifier Parameters
+#%end
+
+#%option double
+#% key: minspan_alpha
+#% description: probability between 0 and 1 controlling the number of data points between knots in Earth
+#% answer: 0.05
+#% guisection: Classifier Parameters
+#%end
+
+# General options
+
+#%flag
+#% key: s
+#% label: Standardization preprocessing
+#% guisection: Optional
+#%end
+
+#%option string
+#% key: cvtype
+#% required: no
+#% label: Non-spatial or spatial cross-validation
+#% description: Non-spatial, clumped or clustered k-fold cross-validation
+#% answer: Non-spatial
+#% options: non-spatial,clumped,kmeans
+#%end
+
+#%option G_OPT_R_INPUT
+#% key: group_raster
+#% label: Custom group ids for labelled pixels from GRASS raster
+#% description: GRASS raster containing group ids for labelled pixels
+#% required: no
+#% guisection: Optional
+#%end
+
+#%option
+#% key: cv
+#% type: integer
+#% description: Number of cross-validation folds to perform in cv > 1
+#% answer: 1
+#% guisection: Optional
+#%end
+
+#%option
+#% key: random_state
+#% type: integer
+#% description: Seed to pass onto the random state for reproducible results
+#% answer: 1
+#% guisection: Optional
+#%end
+
+#%option G_OPT_F_OUTPUT
+#% key: errors_file
+#% label: Save cross-validation global accuracy results to csv
+#% required: no
+#% guisection: Optional
+#%end
+
+#%option
+#% key: lines
+#% type: integer
+#% description: Processing block size in terms of number of rows
+#% answer: 25
+#% guisection: Optional
+#%end
+
+#%flag
+#% key: p
+#% label: Output class membership probabilities
+#% guisection: Optional
+#%end
+
+#%flag
+#% key: m
+#% description: Build model only - do not perform prediction
+#% guisection: Optional
+#%end
+
+#%flag
+#% key: f
+#% description: Calculate feature importances using permutation
+#% guisection: Optional
+#%end
+
+#%option
+#% key: n_iter
+#% type: integer
+#% description: Number of randomized parameter tuning steps
+#% answer: 1
+#% guisection: Optional
+#%end
+
+#%option
+#% key: n_permutations
+#% type: integer
+#% description: Number of permutations to perform for feature importances
+#% answer: 10
+#% guisection: Optional
+#%end
+
+#%option G_OPT_F_OUTPUT
+#% key: fimp_file
+#% label: Save feature importances to csv
+#% required: no
+#% guisection: Optional
+#%end
+
+#%flag
+#% key: b
+#% description: Balance number of observations by weighting for logistic regression, CART and RF methods
+#% guisection: Optional
+#%end
+
+#%option G_OPT_F_OUTPUT
+#% key: save_training
+#% label: Save training data to csv
+#% required: no
+#% guisection: Optional
+#%end
+
+#%option G_OPT_F_INPUT
+#% key: load_training
+#% label: Load training data from csv
+#% required: no
+#% guisection: Optional
+#%end
+
+#%option G_OPT_F_OUTPUT
+#% key: save_model
+#% label: Save model from file
+#% required: no
+#% guisection: Optional
+#%end
+
+#%option G_OPT_F_INPUT
+#% key: load_model
+#% label: Load model from file
+#% required: no
+#% guisection: Optional
+#%end
+
+#%flag
+#% key: l
+#% label: Use memory swap
+#% guisection: Optional
+#%end
+
+#%rules
+#% exclusive: trainingmap,load_model
+#% exclusive: save_training,load_training
+#%end
+
+import atexit
+import os
+import numpy as np
+import copy
+import grass.script as grass
+import tempfile
+from grass.pygrass.modules.shortcuts import imagery as im
+from grass.pygrass.modules.shortcuts import raster as r
+from subprocess import PIPE
+from grass.pygrass.raster import RasterRow
+from grass.pygrass.gis.region import Region
+from grass.pygrass.raster.buffer import Buffer
+
+
+class train():
+
+    def __init__(self, estimator, X, y, groups=None):
+        """
+        Train class to perform preprocessing, fitting, parameter search and
+        cross-validation in a single step
+
+        Args
+        ----
+        estimator: Scikit-learn compatible estimator object
+        X, y: training data and labels as numpy arrays
+        groups: groups to be used for cross-validation
+        """
+
+        self.estimator = estimator
+        self.X = X
+        self.y = y
+        self.groups = groups
+
+        self.scaler = None
+
+        self.scores = None
+        self.scores_cm = None
+        self.fimp = None
+
+
+    def fit(self, param_distribution=None, n_iter=3, scorers='multiclass',
+            cv=3, feature_importances=False, n_permutations=1,
+            random_state=None):
+
+        from sklearn.model_selection import RandomizedSearchCV
+
+        """
+        Main fit method for the train object. Performs fitting, hyperparameter
+        search and cross-validation in one step (inspired from R's CARET)
+
+        Args
+        ----
+        param_distribution: continuous parameter distribution to be used in a
+        randomizedCVsearch
+        n_iter: Number of randomized search iterations
+        scorers: Suite of metrics to obtain
+        cv: Number of cross-validation folds
+        feature_importances: Boolean to perform permuatation-based importances
+        during cross-validation
+        n_permutations: Number of random permutations during feature importance
+
+        random_state: seed to be used during random number generation
+        """
+
+        if param_distribution is not None and n_iter > 1:
+            self.estimator = RandomizedSearchCV(
+                estimator=self.estimator,
+                param_distributions=param_distribution,
+                n_iter=n_iter, cv=n_iter, random_state=random_state)
+
+        self.estimator.fit(self.X, self.y)
+
+        if cv > 1:
+            self.cross_val(
+                scorers, cv, feature_importances, n_permutations, random_state)
+
+
+    def standardization(self):
+        """
+        Transforms the train objects X data using standardization
+        """
+
+        from sklearn.preprocessing import StandardScaler
+
+        scaler = StandardScaler().fit(self.X)
+        self.X = scaler.transform(self.X)
+
+
+    def pred_func(self, estimator, X_test, y_true, scorers):
+        """
+        Calculates a single performance metric depending on if scorer type
+        is binary, multiclass or regression
+
+        To be called from the varImp_permutation
+
+        Args
+        ----
+        estimator: fitted estimator on training set
+        X_test: Test training data
+        y_true: Test labelled data
+        scorers: String indicating which type of scorer to be used
+        """
+
+        from sklearn import metrics
+
+        if scorers == 'binary':
+            scorer = metrics.roc_auc_score
+            y_pred = estimator.predict_proba(X_test)[:, 1]
+        if scorers == 'multiclass':
+            scorer = metrics.accuracy_score
+            y_pred = estimator.predict(X_test)
+        if scorers == 'regression':
+            scorer = metrics.r2_score
+            y_pred = estimator.predict(X_test)
+
+        score = scorer(y_true, y_pred)
+
+        return (score)
+
+
+    def varImp_permutation(self, estimator, X_test, y_true,
+                           n_permutations, scorers,
+                           random_state):
+
+        """
+        Method to perform permutation-based feature importance during
+        cross-validation (cross-validation is applied externally to this
+        method)
+
+        Procedure is:
+        1. Pass fitted estimator and test partition X y
+        2. Assess AUC on the test partition (bestauc)
+        3. Permute each variable and assess the difference between bestauc and
+           the messed-up variable
+        4. Repeat (3) for many random permutations
+        5. Average the repeats
+
+        Args
+        ----
+        estimator: estimator that has been fitted to a training partition
+        X_test, y_true: data and labels from a test partition
+        n_permutations: number of random permutations to apply
+        random_state: seed to pass to the numpy random.seed
+
+        Returns
+        -------
+        scores: AUC scores for each predictor following permutation
+        """
+
+        # calculate score on original variables without permutation
+        # determine best metric type for binary/multiclass/regression scenarios
+        best_score = self.pred_func(estimator, X_test, y_true, scorers)
+
+        np.random.seed(seed=random_state)
+        scores = np.zeros((X_test.shape[1], n_permutations))
+
+        # outer loop to repeat the pemutation rep times
+        for rep in range(n_permutations):
+
+            # inner loop to permute each predictor variable and assess
+            # difference in auc
+            for i in range(X_test.shape[1]):
+                Xscram = np.copy(X_test)
+                Xscram[:, i] = np.random.choice(X_test[:, i], X_test.shape[0])
+
+                # fit the model on the training data and predict the test data
+                scores[i, rep] = best_score-self.pred_func(
+                    estimator, Xscram, y_true, scorers)
+                if scores[i, rep] < 0: scores[i, rep] = 0
+
+        # average the repetitions
+        scores = scores.mean(axis=1)
+
+        return(scores)
+
+
+    def specificity_score(self, y_true, y_pred):
+
+        from sklearn.metrics import confusion_matrix
+
+        cm = confusion_matrix(y_true, y_pred)
+
+        tn = float(cm[0][0])
+        #fn = float(cm[1][0])
+        #tp = float(cm[1][1])
+        fp = float(cm[0][1])
+
+        specificity = tn/(tn+fp)
+
+        return (specificity)
+
+
+    def cross_val(self, scorers, cv, feature_importances, n_permutations,
+                  random_state):
+
+        from sklearn.model_selection import StratifiedKFold
+        from sklearn.model_selection import GroupKFold
+        from sklearn import metrics
+
+        """
+        Stratified Kfold and GroupFold cross-validation
+        Generates suites of scoring_metrics for binary classification,
+
+        multiclass classification and regression scenarios
+
+        Args
+        ----
+        scorers: Suite of performance metrics to use
+        cv: Integer of cross-validation folds
+        feature_importances: Boolean to perform permutation-based importances
+        n_permutations: Number of permutations during feature importance
+        random_state: Seed to pass to the random number generator
+        """
+
+        # dictionary of lists to store metrics
+        if scorers == 'accuracy':
+            self.scores = {
+                'accuracy': []
+            }
+
+        if scorers == 'binary':
+            self.scores = {
+                'accuracy': [],
+                'precision': [],
+                'recall': [],
+                'specificity': [],
+                'f1': [],
+                'kappa': [],
+                'auc': []
+            }
+
+        if scorers == 'multiclass':
+            self.scores = {
+                'accuracy': [],
+                'f1': [],
+                'kappa': []
+            }
+
+        if scorers == 'regression':
+            self.scores = {
+                'r2': []
+            }
+
+        y_test_agg = []
+        y_pred_agg = []
+        self.fimp = np.zeros((self.X.shape[1], cv))
+
+        # generate Kfold indices
+        if self.groups is None:
+            k_fold = StratifiedKFold(
+                n_splits=cv,
+                shuffle=False,
+                random_state=random_state).split(self.X, self.y)
+        else:
+            k_fold = GroupKFold(n_splits=cv).split(
+                self.X, self.y, groups=self.groups)
+
+        for train_indices, test_indices in k_fold:
+
+            X_train, X_test = self.X[train_indices], self.X[test_indices]
+            y_train, y_test = self.y[train_indices], self.y[test_indices]
+
+            # fit the model on the training data and predict the test data
+            fit = self.estimator.fit(X_train, y_train)
+            y_pred = fit.predict(X_test)
+
+            y_test_agg = np.append(y_test_agg, y_test)
+            y_pred_agg = np.append(y_pred_agg, y_pred)
+
+            labels = np.unique(y_pred)
+
+            # calculate metrics
+            if scorers == 'accuracy':
+                                self.scores['accuracy'] = np.append(
+                    self.scores['accuracy'],
+                    metrics.accuracy_score(y_test, y_pred))
+
+            if scorers == 'binary':
+                self.scores['accuracy'] = np.append(
+                    self.scores['accuracy'],
+                    metrics.accuracy_score(y_test, y_pred))
+
+                if len(np.unique(self.y)) == 2 and \
+                    all([0, 1] == np.unique(self.y)):
+
+                    y_pred_proba = fit.predict_proba(X_test)[:, 1]
+                    self.scores['auc'] = np.append(
+                        self.scores['auc'],
+                        metrics.roc_auc_score(y_test, y_pred_proba))
+
+                self.scores['precision'] = np.append(
+                    self.scores['precision'], metrics.precision_score(
+                        y_test, y_pred, labels, average='binary'))
+
+                self.scores['recall'] = np.append(
+                    self.scores['recall'], metrics.recall_score(
+                        y_test, y_pred, labels, average='binary'))
+
+                self.scores['specificity'] = np.append(
+                    self.scores['specificity'], self.specificity_score(
+                        y_test, y_pred))
+
+                self.scores['f1'] = np.append(
+                    self.scores['f1'], metrics.f1_score(
+                        y_test, y_pred, labels, average='binary'))
+
+                self.scores['kappa'] = np.append(
+                    self.scores['kappa'],
+                    metrics.cohen_kappa_score(y_test, y_pred))
+
+            elif scorers == 'multiclass':
+
+                self.scores['accuracy'] = np.append(
+                    self.scores['accuracy'],
+                    metrics.accuracy_score(y_test, y_pred))
+
+                self.scores['kappa'] = np.append(
+                    self.scores['kappa'],
+                    metrics.cohen_kappa_score(y_test, y_pred))
+
+                self.scores['f1'] = np.append(
+                    self.scores['f1'], metrics.f1_score(
+                        y_test, y_pred, labels, average='weighted'))
+
+            elif scorers == 'regression':
+                self.scores['r2'] = np.append(
+                    self.scores['r2'], metrics.r2_score(y_test, y_pred))
+
+            # feature importances using permutation
+            if feature_importances == True:
+                if (self.fimp==0).all() == True:
+                    self.fimp = self.varImp_permutation(
+                        fit, X_test, y_test, n_permutations, scorers,
+                        random_state)
+                else:
+                    self.fimp = np.column_stack(
+                        (self.fimp, self.varImp_permutation(
+                            fit, X_test, y_test,
+                            n_permutations, scorers, random_state)))
+
+        if feature_importances == True:
+            self.fimp = self.fimp.mean(axis=1)
+
+        self.scores_cm = metrics.classification_report(y_test_agg, y_pred_agg)
+
+
+    def predict(self, predictors, output, class_probabilities=False,
+               rowincr=25):
+
+        """
+        Prediction on list of GRASS rasters using a fitted scikit learn model
+
+        Parameters
+        ----------
+        predictors: List of GRASS rasters
+
+        class_probabilties: Predict class probabilities
+
+        rowincr: Integer of raster rows to process at one time
+        output: Name of GRASS raster to output classification results
+        """
+
+        # determine output data type and nodata
+        predicted = self.estimator.predict(self.X)
+
+        if (predicted % 1 == 0).all() == True:
+            ftype = 'CELL'
+            nodata = -2147483648
+        else:
+            ftype = 'FCELL'
+            nodata = np.nan
+
+        # create a list of rasterrow objects for predictors
+        n_features = len(predictors)
+        rasstack = [0] * n_features
+
+        for i in range(n_features):
+            rasstack[i] = RasterRow(predictors[i])
+            if rasstack[i].exist() is True:
+                rasstack[i].open('r')
+            else:
+                grass.fatal("GRASS raster " + predictors[i] +
+                            " does not exist.... exiting")
+
+        # use grass.pygrass.gis.region to get information about the current region
+        current = Region()
+
+        # create a imagery mask
+        # the input rasters might have different dimensions and non-value pixels.
+        # r.series used to automatically create a mask by propagating the nulls
+        grass.run_command("r.series", output='tmp_clfmask',
+                          input=predictors, method='count', flags='n',
+                          overwrite=True)
+
+        mask_raster = RasterRow('tmp_clfmask')
+        mask_raster.open('r')
+
+        # create and open RasterRow objects for classification
+        classification = RasterRow(output)
+        classification.open('w', ftype, overwrite=True)
+
+        # create and open RasterRow objects for  probabilities if enabled
+        if class_probabilities is True:
+
+            # determine number of classes
+            labels = np.unique(self.y)
+            nclasses = len(labels)
+
+            prob_out_raster = [0] * nclasses
+            prob = [0] * nclasses
+
+            for iclass in range(nclasses):
+                prob_out_raster[iclass] = output + \
+                    '_classPr' + str(int(labels[iclass]))
+                prob[iclass] = RasterRow(prob_out_raster[iclass])
+                prob[iclass].open('w', 'FCELL', overwrite=True)
+
+        """
+        Prediction using row blocks
+        """
+
+        for rowblock in range(0, current.rows, rowincr):
+
+            # check that the row increment does not exceed the number of rows
+            if rowblock+rowincr > current.rows:
+                rowincr = current.rows - rowblock
+            img_np_row = np.zeros((rowincr, current.cols, n_features))
+            mask_np_row = np.zeros((rowincr, current.cols))
+
+            # loop through each row, and each band
+            # and add these values to the 2D array img_np_row
+            for row in range(rowblock, rowblock+rowincr, 1):
+                mask_np_row[row-rowblock, :] = np.array(mask_raster[row])
+
+                for band in range(n_features):
+                    img_np_row[row-rowblock, :, band] = \
+                        np.array(rasstack[band][row])
+
+            mask_np_row[mask_np_row == -2147483648] = np.nan
+            nanmask = np.isnan(mask_np_row)  # True in mask means invalid data
+
+            # reshape each row-band matrix into a n*m array
+            nsamples = rowincr * current.cols
+            flat_pixels = img_np_row.reshape((nsamples, n_features))
+
+            # remove NaN values
+            flat_pixels = np.nan_to_num(flat_pixels)
+
+            # rescale
+            if self.scaler is not None:
+                flat_pixels = self.scaler.transform(flat_pixels)
+
+            # perform prediction
+            result = self.estimator.predict(flat_pixels)
+            result = result.reshape((rowincr, current.cols))
+
+            # replace NaN values so that the prediction does not have a border
+            result = np.ma.masked_array(
+                result, mask=nanmask, fill_value=-99999)
+
+            # return a copy of result, with masked values filled with a value
+            result = result.filled([nodata])
+
+            # for each row we can perform computation, and write the result
+            for row in range(rowincr):
+                newrow = Buffer((result.shape[1],), mtype=ftype)
+                newrow[:] = result[row, :]
+                classification.put_row(newrow)
+
+            # same for probabilities
+            if class_probabilities is True:
+                result_proba = self.estimator.predict_proba(flat_pixels)
+
+                for iclass in range(result_proba.shape[1]):
+
+                    result_proba_class = result_proba[:, iclass]
+                    result_proba_class = result_proba_class.reshape(
+                                            (rowincr, current.cols))
+
+                    result_proba_class = np.ma.masked_array(
+                        result_proba_class, mask=nanmask, fill_value=np.nan)
+
+                    result_proba_class = result_proba_class.filled([np.nan])
+
+                    for row in range(rowincr):
+
+                        newrow = Buffer((
+                                    result_proba_class.shape[1],),
+                                    mtype='FCELL')
+
+                        newrow[:] = result_proba_class[row, :]
+                        prob[iclass].put_row(newrow)
+
+        # close all maps
+        for i in range(n_features):
+            rasstack[i].close()
+
+        classification.close()
+        mask_raster.close()
+
+        try:
+            for iclass in range(nclasses):
+                prob[iclass].close()
+        except:
+            pass
+
+
+def cleanup():
+    grass.run_command("g.remove", name='tmp_clfmask',
+                      flags="f", type="raster", quiet=True)
+    grass.run_command("g.remove", name='tmp_roi_clumped',
+              flags="f", type="raster", quiet=True)
+
+
+def model_classifiers(estimator='LogisticRegression', random_state=None,
+                      class_weight=None, C=1, max_depth=None,
+                      max_features='auto', min_samples_split=2,
+                      min_samples_leaf=1, n_estimators=100, subsample=1.0,
+                      learning_rate=0.1, max_degree=1, penalty=3.0,
+                      minspan_alpha=0.05):
+
+    """
+    Provides the classifiers and parameters using by the module
+
+    Args
+    ----
+    estimator: Name of estimator
+    random_state: Seed to use in randomized components
+    class_weight: Option to balance classes using weighting
+    C: Inverse of regularization strength
+    max_depth: Maximum depth for tree-based methods
+    min_samples_split: Minimum number of samples to split a node
+    min_samples_leaf: Minimum number of samples to form a leaf
+    n_estimators: Number of trees
+    subsample: Controls randomization in gradient boosting
+    learning_rate: Used in gradient boosting
+    max_degree: For earth
+    penalty: For earth
+    minspan_alpha: For earth
+
+    Returns
+    -------
+    clf: Scikit-learn classifier object
+    params: Parameters to use for object
+    mode: Flag to indicate whether classifier performs classification or
+          regression
+    """
+
+    from scipy.stats import randint, uniform
+
+    from sklearn.linear_model import LogisticRegression
+    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
+    from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
+    from sklearn.naive_bayes import GaussianNB
+    from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
+    from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
+    from sklearn.ensemble import GradientBoostingClassifier
+    from sklearn.ensemble import GradientBoostingRegressor
+    from sklearn.svm import SVC
+
+    # optional packages that add additional classifiers here
+    if estimator == 'EarthClassifier' or estimator == 'EarthRegressor':
+        try:
+            from sklearn.pipeline import Pipeline
+            from pyearth import Earth
+
+            # Combine Earth with LogisticRegression in a pipeline to do classification
+            earth_classifier = Pipeline([('Earth',
+                Earth()), ('Logistic', LogisticRegression())])
+
+            classifiers = {'EarthClassifier': earth_classifier,
+                           'EarthRegressor': Earth(max_degree=max_degree,
+                                                   penalty=penalty,
+                                                   minspan_alpha=minspan_alpha)}
+        except:
+            grass.fatal('Py-earth package not installed')
+    else:
+        # core sklearn classifiers go here
+        classifiers = {
+            'SVC': SVC(C=C, probability=True, random_state=random_state),
+            'LogisticRegression':
+
+                LogisticRegression(C=C, class_weight=class_weight,
+                                  random_state=random_state, n_jobs=-1),
+            'DecisionTreeClassifier':
+                DecisionTreeClassifier(max_depth=max_depth,
+                                      max_features=max_features,
+                                      min_samples_split=min_samples_split,
+                                      min_samples_leaf=min_samples_leaf,
+                                      random_state=random_state,
+                                      class_weight=class_weight),
+            'DecisionTreeRegressor':
+                DecisionTreeRegressor(max_features=max_features,
+                                      min_samples_split=min_samples_split,
+                                      min_samples_leaf=min_samples_leaf,
+                                      random_state=random_state),
+            'RandomForestClassifier':
+                RandomForestClassifier(n_estimators=n_estimators,
+                                       class_weight=class_weight,
+                                       max_features=max_features,
+                                       min_samples_split=min_samples_split,
+                                       random_state=random_state,
+                                       n_jobs=-1,
+                                       oob_score=False),
+            'RandomForestRegressor':
+                RandomForestRegressor(n_estimators=n_estimators,
+                                      max_features=max_features,
+                                      min_samples_split=min_samples_split,
+                                      random_state=random_state,
+                                      n_jobs=-1,
+                                      oob_score=False),
+            'GradientBoostingClassifier':
+                GradientBoostingClassifier(learning_rate=learning_rate,
+                                           n_estimators=n_estimators,
+                                           max_depth=max_depth,
+                                           min_samples_split=min_samples_split,
+                                           min_samples_leaf=min_samples_leaf,
+                                           subsample=subsample,
+                                           max_features=max_features,
+                                           random_state=random_state),
+            'GradientBoostingRegressor':
+                GradientBoostingRegressor(learning_rate=learning_rate,
+                                          n_estimators=n_estimators,
+                                          max_depth=max_depth,
+                                          min_samples_split=min_samples_split,
+                                          min_samples_leaf=min_samples_leaf,
+                                          subsample=subsample,
+                                          max_features=max_features,
+                                          random_state=random_state),
+            'GaussianNB': GaussianNB(),
+            'LinearDiscriminantAnalysis': LinearDiscriminantAnalysis(),
+            'QuadraticDiscriminantAnalysis': QuadraticDiscriminantAnalysis(),
+        }
+
+    LogisticRegressionOpts = {'C': randint(1, 1000)}
+    DecisionTreeOpts = {'max_depth': randint(2, 20),
+                        'min_samples_split': uniform(0, 0.2)}
+    RandomForestOpts = {'max_features': uniform()}
+    GradientBoostingOpts = {'learning_rate': uniform(0.001, 0.1),
+                            'max_depth': randint(3, 10),
+                            'max_features': uniform(),
+                            'n_estimators': randint(50, 500),
+                            'min_samples_split': uniform(0, 0.2),
+                            'min_samples_leaf': uniform(0, 0.2),
+                            'subsample': uniform()}
+    SVCOpts = {'C': randint(1, 100), 'shrinking': [True, False]}
+    EarthOpts = {'max_degree': randint(1,10),
+                 'penalty': uniform(0.5, 5),
+                 'minspan_alpha': uniform(0.05, 1)}
+
+    param_grids = {
+        'SVC': SVCOpts,
+        'LogisticRegression': LogisticRegressionOpts,
+        'DecisionTreeClassifier': DecisionTreeOpts,
+        'DecisionTreeRegressor': DecisionTreeOpts,
+        'RandomForestClassifier': RandomForestOpts,
+        'RandomForestRegressor': RandomForestOpts,
+        'GradientBoostingClassifier': GradientBoostingOpts,
+        'GradientBoostingRegressor': GradientBoostingOpts,
+        'GaussianNB': {},
+        'LinearDiscriminantAnalysis': {},
+        'QuadraticDiscriminantAnalysis': {},
+        'EarthClassifier': EarthOpts,
+        'EarthRegressor': EarthOpts
+    }
+
+    # define classifier
+    clf = classifiers[estimator]
+    params = param_grids[estimator]
+
+    # classification or regression
+    if estimator == 'LogisticRegression' \
+        or estimator == 'DecisionTreeClassifier' \
+        or estimator == 'RandomForestClassifier' \
+        or estimator == 'GradientBoostingClassifier' \
+        or estimator == 'GaussianNB' \
+        or estimator == 'LinearDiscriminantAnalysis' \
+        or estimator == 'QuadraticDiscriminantAnalysis' \
+        or estimator == 'EarthClassifier' \
+            or estimator == 'SVC':
+            mode = 'classification'
+    else:
+        mode = 'regression'
+
+    return (clf, params, mode)
+
+
+def save_training_data(X, y, groups, file):
+
+    """
+    Saves any extracted training data to a csv file
+
+    Args
+    ----
+    X: Numpy array containing predictor values
+    y: Numpy array containing labels
+    groups: Numpy array of group labels
+    file: Path to a csv file to save data to
+    """
+
+    # if there are no group labels, create a nan filled array
+    if groups is None:
+        groups = np.empty((y.shape[0]))
+        groups[:] = np.nan
+
+    training_data = np.column_stack([X, y, groups])
+    np.savetxt(file, training_data, delimiter=',')
+
+
+def load_training_data(file):
+
+    """
+    Loads training data and labels from a csv file
+
+    Args
+    ----
+    file: Path to a csv file to save data to
+
+    Returns
+    -------
+    X: Numpy array containing predictor values
+    y: Numpy array containing labels
+    groups: Numpy array of group labels, or None
+    """
+
+    training_data = np.loadtxt(file, delimiter=',')
+    n_features = training_data.shape[1]-1
+
+    # check to see if last column contains group labels or nans
+    groups = training_data[:, -1]
+    training_data = training_data[:, 0:n_features]
+
+    if np.isnan(groups).all() is True:
+        # if all nans then ignore last column
+        groups = None
+
+    # fetch X and y
+    X = training_data[:, 0:n_features-1]
+    y = training_data[:, -1]
+
+    return(X, y, groups)
+
+
+def sample_predictors(response, predictors, shuffle_data=True, lowmem=False,
+                      random_state=None):
+
+    from sklearn.utils import shuffle
+
+    """
+    Samples a list of GRASS rasters using a labelled raster
+    Per raster sampling
+
+    Args
+    ----
+    response: String; GRASS raster with labelled pixels
+    predictors: List of GRASS rasters containing explanatory variables
+
+    Returns
+    -------
+
+    training_data: Numpy array of extracted raster values
+    training_labels: Numpy array of labels
+    y_indexes: Row and Columns of label positions
+    """
+
+    current = Region()
+
+    # open response raster as rasterrow and read as np array
+    if RasterRow(response).exist() is True:
+        roi_gr = RasterRow(response)
+        roi_gr.open('r')
+
+        if lowmem is False:
+            response_np = np.array(roi_gr)
+        else:
+            response_np = np.memmap(tempfile.NamedTemporaryFile(),
+                                    dtype='float32', mode='w+',
+                                    shape=(current.rows, current.cols))
+            response_np[:] = np.array(roi_gr)[:]
+    else:
+        grass.fatal("GRASS response raster does not exist.... exiting")
+
+    # determine number of predictor rasters
+    n_features = len(predictors)
+
+    # check to see if all predictors exist
+    for i in range(n_features):
+        if RasterRow(predictors[i]).exist() is not True:
+            grass.fatal("GRASS raster " + predictors[i] +
+                        " does not exist.... exiting")
+
+    # check if any of those pixels are labelled (not equal to nodata)
+    # can use even if roi is FCELL because nodata will be nan
+    is_train = np.nonzero(response_np > -2147483648)
+    training_labels = response_np[is_train]
+    n_labels = np.array(is_train).shape[1]
+
+    # Create a zero numpy array of len training labels
+    if lowmem is False:
+        training_data = np.zeros((n_labels, n_features))
+    else:
+        training_data = np.memmap(tempfile.NamedTemporaryFile(),
+                                  dtype='float32', mode='w+',
+                                  shape=(n_labels, n_features))
+
+    # Loop through each raster and sample pixel values at training indexes
+    if lowmem is True:
+        feature_np = np.memmap(tempfile.NamedTemporaryFile(),
+					   dtype='float32', mode='w+',
+                               shape=(current.rows, current.cols))
+
+    for f in range(n_features):
+        predictor_gr = RasterRow(predictors[f])
+        predictor_gr.open('r')
+
+        if lowmem is False:
+            feature_np = np.array(predictor_gr)
+        else:
+            feature_np[:] = np.array(predictor_gr)[:]
+
+        training_data[0:n_labels, f] = feature_np[is_train]
+
+        # close each predictor map
+        predictor_gr.close()
+
+    # convert any CELL maps no datavals to NaN in the training data
+    for i in range(n_features):
+        training_data[training_data[:, i] == -2147483648] = np.nan
+
+    # convert indexes of training pixels from tuple to n*2 np array
+    is_train = np.array(is_train).T
+
+    # Remove nan rows from training data
+    X = training_data[~np.isnan(training_data).any(axis=1)]
+    y = training_labels[~np.isnan(training_data).any(axis=1)]
+    y_indexes = is_train[~np.isnan(training_data).any(axis=1)]
+
+    # shuffle the training data
+    if shuffle_data is True:
+        X, y, y_indexes = shuffle(X, y, y_indexes, random_state=random_state)
+
+    # close the response map
+    roi_gr.close()
+
+    return(X, y, y_indexes)
+
+
+def sample_training_data(response, maplist, group_raster='', n_partitions=3,
+                         cvtype='', lowmem=False, random_state=None):
+
+    """
+    Samples predictor and optional group id raster for cross-val
+
+    Args
+    ----
+    roi: String; GRASS raster with labelled pixels
+    maplist: List of GRASS rasters containing explanatory variables
+    group_raster: GRASS raster containing group ids of labelled pixels
+    n_partitions: Number of spatial partitions
+    cvtype: Type of spatial clustering
+    save_training: Save extracted training data to .csv file
+    lowmem: Boolean to use numpy memmap during extraction
+    random_state: Seed
+
+    Returns
+    -------
+    X: Numpy array of extracted raster values
+    y: Numpy array of labels
+    group_id: Group ids of labels
+    """
+
+    from sklearn.cluster import KMeans
+
+    # clump the labelled pixel raster if labels represent polygons
+    # then set the group_raster to the clumped raster to extract the group_ids
+    # used in the GroupKFold cross-validation
+    # ------------------------------------------------------------------------
+    if cvtype == 'clumped' and group_raster == '':
+        r.clump(input=response, output='tmp_roi_clumped',
+                overwrite=True, quiet=True)
+        group_raster = 'tmp_roi_clumped'
+
+    # extract training data from maplist and take group ids from
+    # group_raster. Shuffle=False so that group ids and labels align
+    # because cross-validation will be performed spatially
+    # ---------------------------------------------------------------
+    if group_raster != '':
+        maplist2 = copy.deepcopy(maplist)
+        maplist2.append(group_raster)
+        X, y, sample_coords = sample_predictors(response=response,
+                                                predictors=maplist2,
+                                                shuffle_data=False,
+                                                lowmem=False,
+                                                random_state=random_state)
+        # take group id from last column and remove column from predictors
+        group_id = X[:, -1]
+        X = np.delete(X, -1, axis=1)
+
+        # remove the clumped raster
+        try:
+            grass.run_command(
+                "g.remove", name='tmp_roi_clumped', flags="f",
+                type="raster", quiet=True)
+        except:
+            pass
+
+    # extract training data from maplist without group Ids
+    # shuffle this data by default
+    # ----------------------------------------------------
+    else:
+        X, y, sample_coords = sample_predictors(
+            response=response, predictors=maplist,
+            shuffle_data=True,
+            lowmem=lowmem,
+            random_state=random_state)
+
+        group_id = None
+
+        if cvtype == 'kmeans':
+            clusters = KMeans(n_clusters=n_partitions,
+                              random_state=random_state,
+                              n_jobs=-1)
+
+            clusters.fit(sample_coords)
+            group_id = clusters.labels_
+
+    return (X, y, group_id)
+
+
+def maps_from_group(group):
+    """
+    Parse individual rasters into a list from an imagery group
+
+    Args
+    ----
+    group: String; GRASS imagery group
+
+    Returns
+    -------
+    maplist: Python list containing individual GRASS raster maps
+    """
+    groupmaps = im.group(group=group, flags="g",
+                         quiet=True, stdout_=PIPE).outputs.stdout
+
+    maplist = groupmaps.split(os.linesep)
+    maplist = maplist[0:len(maplist)-1]
+    map_names = []
+
+    for rastername in maplist:
+        map_names.append(rastername.split('@')[0])
+
+    return(maplist, map_names)
+
+
+def main():
+
+    try:
+        from sklearn.externals import joblib
+
+    except:
+        grass.fatal("Scikit learn 0.18 or newer is not installed")
+
+    """
+    GRASS options and flags
+    -----------------------
+    """
+
+    # General options and flags
+    group = options['group']
+    response = options['trainingmap']
+    output = options['output']
+    classifier = options['classifier']
+    norm_data = flags['s']
+    cv = int(options['cv'])
+    group_raster = options['group_raster']
+    cvtype = options['cvtype']
+    modelonly = flags['m']
+    probability = flags['p']
+    rowincr = int(options['lines'])
+    random_state = int(options['random_state'])
+    model_save = options['save_model']
+    model_load = options['load_model']
+    load_training = options['load_training']
+    save_training = options['save_training']
+    importances = flags['f']
+    n_iter = int(options['n_iter'])
+    n_permutations = int(options['n_permutations'])
+    lowmem = flags['l']
+    errors_file = options['errors_file']
+    fimp_file = options['fimp_file']
+
+    if flags['b'] is True:
+        class_weight = 'balanced'
+    else:
+        class_weight = None
+
+    # classifier options
+    max_degree = int(options['max_degree'])
+    penalty = float(options['penalty'])
+    minspan_alpha = float(options['minspan_alpha'])
+    C = float(options['c'])
+    min_samples_split = int(options['min_samples_split'])
+    min_samples_leaf = int(options['min_samples_leaf'])
+    n_estimators = int(options['n_estimators'])
+    learning_rate = float(options['learning_rate'])
+    subsample = float(options['subsample'])
+    max_depth = int(options['max_depth'])
+    max_features = int(options['max_features'])
+
+    """
+    Error checking of options and flags
+    -----------------------------------
+    """
+
+    if max_features == -1:
+        max_features = str('auto')
+    if max_depth == -1:
+        max_depth = None
+
+    if n_iter > 1:
+        if (classifier == 'LinearDiscriminantAnalysis' or
+        classifier == 'QuadraticDiscriminantAnalysis' or
+        classifier == 'GaussianNB'):
+            grass.warning('No parameters to tune for selected model...ignoring')
+            n_iter = 1
+    
+    if importances is True and cv == 1:
+        grass.fatal('Feature importances require cross-validation cv > 1')
+        
+    """
+    Obtain information about GRASS rasters to be classified
+    -------------------------------------------------------
+    """
+
+    # fetch individual raster names from group
+    maplist, map_names = maps_from_group(group)
+    n_features = len(maplist)
+
+    # Error checking for m_features settings
+    if max_features > n_features:
+        max_features = n_features
+
+    """
+    Train the classifier
+    --------------------
+    """
+
+    # Sample training data and group ids
+    # Perform parameter tuning and cross-validation
+    # Unless a previously fitted model is to be loaded
+    # ------------------------------------------------
+    if model_load == '':
+
+        # Sample training data and group id
+        if load_training != '':
+            X, y, group_id = load_training_data(load_training)
+        else:
+            X, y, group_id = sample_training_data(
+                response, maplist, group_raster, cv, cvtype,
+                lowmem, random_state)
+
+        # option to save extracted data to .csv file
+        if save_training != '':
+            save_training_data(X, y, group_id, save_training)
+
+        # retrieve sklearn classifier object and parameters
+        grass.message("Classifier = " + classifier)
+
+        clf, param_grid, mode = \
+            model_classifiers(classifier, random_state,
+                              class_weight, C, max_depth,
+                              max_features, min_samples_split,
+                              min_samples_leaf, n_estimators,
+                              subsample, learning_rate, max_degree, penalty,
+                              minspan_alpha)
+
+        # Decide on scoring metric scheme
+        if mode == 'classification':
+            if len(np.unique(y)) == 2 and all([0, 1] == np.unique(y)):
+                scorers = 'binary'
+            else:
+                scorers = 'multiclass'
+        else:
+            scorers = 'regression'
+
+        if mode == 'regression' and probability is True:
+            grass.warning(
+                'Class probabilities only valid for classifications...ignoring')
+            probability = False
+
+        # create training object
+        learn_m = train(clf, X, y, group_id)
+
+        # preprocessing
+        if norm_data is True:
+            learn_m.standardization()
+
+        """
+        Fitting, parameter search and cross-validation
+        ----------------
+        """
+
+        # fit, search and cross-validate the training object
+        learn_m.fit(param_grid, n_iter, scorers, cv,
+                    feature_importances=importances,
+                    n_permutations=n_permutations,
+                    random_state=random_state)
+
+        if n_iter > 1:
+            grass.message('\n')
+            grass.message('Best parameters:')
+            grass.message(str(learn_m.estimator.best_params_))
+
+        # If cv > 1 then use cross-validation to generate performance measures
+        if cv > 1:
+            grass.message('\r\n')
+            grass.message(
+                "Cross validation global performance measures......:")
+
+            if mode == 'classification':
+                if scorers == 'binary':
+                    grass.message(
+                        "Accuracy   :\t%0.2f\t+/-SD\t%0.2f" %
+                        (learn_m.scores['accuracy'].mean(),
+                         learn_m.scores['accuracy'].std()))
+                    grass.message(
+                        "AUC        :\t%0.2f\t+/-SD\t%0.2f" %
+                        (learn_m.scores['auc'].mean(),
+                         learn_m.scores['auc'].std()))
+                    grass.message(
+                        "Kappa      :\t%0.2f\t+/-SD\t%0.2f" %
+                        (learn_m.scores['kappa'].mean(),
+                         learn_m.scores['kappa'].std()))
+                    grass.message(
+                        "Precision  :\t%0.2f\t+/-SD\t%0.2f" %
+                        (learn_m.scores['precision'].mean(),
+                         learn_m.scores['precision'].std()))
+                    grass.message(
+                        "Recall     :\t%0.2f\t+/-SD\t%0.2f" %
+                        (learn_m.scores['recall'].mean(),
+                         learn_m.scores['recall'].std()))
+                    grass.message(
+                        "Specificity:\t%0.2f\t+/-SD\t%0.2f" %
+                        (learn_m.scores['specificity'].mean(),
+                         learn_m.scores['specificity'].std()))
+                    grass.message(
+                        "F1         :\t%0.2f\t+/-SD\t%0.2f" %
+                        (learn_m.scores['f1'].mean(),
+                         learn_m.scores['f1'].std()))
+
+                if scorers == 'multiclass':
+                    grass.message(
+                        "Accuracy:\t%0.2f\t+/-SD\t%0.2f" %
+                        (learn_m.scores['accuracy'].mean(),
+                         learn_m.scores['accuracy'].std()))
+                    grass.message(
+                        "F1      :\t%0.2f\t+/-SD\t%0.2f" %
+                        (learn_m.scores['f1'].mean(),
+                         learn_m.scores['f1'].std()))
+                    grass.message(
+                        "Kappa   :\t%0.2f\t+/-SD\t%0.2f" %
+                        (learn_m.scores['kappa'].mean(),
+                         learn_m.scores['kappa'].std()))
+
+                # classification report
+                grass.message("\n")
+                grass.message("Classification report:")
+                grass.message(learn_m.scores_cm)
+
+            else:
+                grass.message("R2:\t%0.2f\t+/-\t%0.2f" %
+                              (learn_m.scores['r2'].mean(),
+                               learn_m.scores['r2'].std()))
+
+            # write cross-validation results for csv file
+            if errors_file != '':
+                try:
+                    import pandas as pd
+                    errors = pd.DataFrame(learn_m.scores)
+                    errors.to_csv(errors_file, mode='w')
+                except:
+                    grass.warning("Pandas is not installed. Pandas is required to write the cross-validation results to file")
+
+            # feature importances
+            if importances is True:
+                grass.message("\r\n")
+                grass.message("Feature importances")
+                grass.message("id" + "\t" + "Raster" + "\t" + "Importance")
+
+                for i in range(len(learn_m.fimp)):
+                    grass.message(
+                        str(i) + "\t" + maplist[i] +
+                        "\t" + str(round(learn_m.fimp[i], 4)))
+
+                if fimp_file != '':
+                    fimp_output = pd.DataFrame(
+                        {'grass raster': maplist, 'importance': learn_m.fimp})
+                    fimp_output.to_csv(
+                        path_or_buf=fimp_file,
+                        header=['grass raster', 'importance'])
+    else:
+        # load a previously fitted train object
+        # -------------------------------------
+        if model_load != '':
+            # load a previously fitted model
+            learn_m = joblib.load(model_load)
+
+    """
+    Optionally save the fitted model
+    ---------------------
+    """
+
+    if model_save != '':
+        joblib.dump(learn_m, model_save)
+
+    if modelonly is True:
+        grass.fatal("Model built and now exiting")
+
+    """
+    Prediction on the rest of the GRASS rasters in the imagery group
+    ----------------------------------------------------------------
+    """
+
+    learn_m.predict(maplist, output, probability, rowincr)
+
+if __name__ == "__main__":
+    options, flags = grass.parser()
+    atexit.register(cleanup)
+    main()

Added: grass-addons/grass7/raster/r.learn.ml/rfclassification.png
===================================================================
(Binary files differ)


Property changes on: grass-addons/grass7/raster/r.learn.ml/rfclassification.png
___________________________________________________________________
Added: svn:mime-type
   + application/octet-stream



More information about the grass-commit mailing list