[GRASS-SVN] r71841 - grass-addons/grass7/raster/r.learn.ml

Sat Nov 25 20:36:03 PST 2017

Author: spawley
Date: 2017-11-25 20:36:03 -0800 (Sat, 25 Nov 2017)
New Revision: 71841

Added:
   grass-addons/grass7/raster/r.learn.ml/rlearn_prediction.py
   grass-addons/grass7/raster/r.learn.ml/rlearn_sampling.py
Modified:
   grass-addons/grass7/raster/r.learn.ml/Makefile
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
   grass-addons/grass7/raster/r.learn.ml/rlearn_crossval.py
   grass-addons/grass7/raster/r.learn.ml/rlearn_utils.py
Log:
Bug fix for cross-validation. Better handling of categorical rasters. Removal of XGboost classifier option

Modified: grass-addons/grass7/raster/r.learn.ml/Makefile
===================================================================

--- grass-addons/grass7/raster/r.learn.ml/Makefile	2017-11-26 03:23:50 UTC (rev 71840)
+++ grass-addons/grass7/raster/r.learn.ml/Makefile	2017-11-26 04:36:03 UTC (rev 71841)
@@ -2,7 +2,7 @@
 
 PGM = r.learn.ml
 
-ETCFILES = rlearn_utils rlearn_crossval rlearn_rasters
+ETCFILES = rlearn_utils rlearn_crossval rlearn_prediction rlearn_sampling
 
 include $(MODULE_TOPDIR)/include/Make/Script.make
 include $(MODULE_TOPDIR)/include/Make/Python.make

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-11-26 03:23:50 UTC (rev 71840)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-11-26 04:36:03 UTC (rev 71841)
@@ -1,110 +1,66 @@
 <h2>DESCRIPTION</h2>
-
-<p><em>r.learn.ml</em> represents a front-end to the scikit learn python package for the purpose of performing classification and regression on GRASS rasters as part of an imagery group. The module enables classification and regression using several commonly used classifiers in remote sensing and spatial modelling. The choice of classifier is set using the <em>classifier</em> parameter. For more details relating to the classifiers, refer to the <a href="http://scikit-learn.org/stable/">scikit learn documentation.</a> The following classification and regression methods are available:</p>
-
+<p><em>r.learn.ml</em> represents a front-end to the scikit learn python package. The module enables scikit-learn classification and regression models to be applied to GRASS GIS rasters that are stored as part of an imagery group <em>group</em> or specified as individual maps in the optional <em>raster</em> parameter. Several commonly used classifiers and regressors are exposed in the GUI and the choice of classifier is set using the <em>classifier</em> parameter. For more details relating to the classifiers, refer to the <a href="http://scikit-learn.org/stable/">scikit learn documentation.</a> The following classification and regression methods are available:</p>
 <ul>
 	<li><em>LogisticRegression</em> represents a linear model for classification and is a modification of linear regression, but using the logistic distribution which enables the use of a categorical response variable. If the <em>response</em> raster is coded to 0 and 1, then a binary classification occurs, but the scikit-learn logistic regression can also perform a multiclass classification using a one-versus-rest scheme.</li>
-	
 	<li><em>LinearDiscriminantAnalysis</em> and <em>QuadraticDiscriminantAnalysis</em> are classifiers with linear and quadratic decision surfaces. These classifiers do not take any parameters and are inherently multiclass. They can only be used for classification.</li>
-	
-	<li><em>KNeighborsClassifier</em> is a simple classification method based on closest distance to a predefined number of training samples. Two hyperparameters are exposed, with <em>n_neighbors</em> governing the number of neighbors to use to decide the prediction label, and <em>weights</em> specifying whether these neighbors should have equal weights or whether they should be inversely weighted by their distance.</li>
-	
+	<li><em>KNeighborsClassifier</em> classifies samples based on closest distance to a predefined number of training samples. Two hyperparameters are exposed, with <em>n_neighbors</em> governing the number of neighbors to use to decide the prediction label, and <em>weights</em> specifying whether these neighbors should have equal weights or whether they should be inversely weighted by their distance.</li>
 	<li><em>GaussianNB</em> is the Gaussian Naive Bayes algorithm and can be used for classification only. Naive Bayes is a supervised learning algorithm based on applying Bayes theorem with the naive assumption of independence between every pair of features. This classifier does not take any parameters.</li>
-	
 	<li>The <em>DecisionTreeClassifier</em> and <em>DecisionTreeRegressor</em> map observations to a response variable using a hierarchy of splits and branches. The terminus of these branches, termed leaves, represent the prediction of the response variable. Decision trees are non-parametric and can model non-linear relationships between a response and predictor variables, and are insensitive the scaling of the predictors.</li>
-	
 	<li>The <em>RandomForestsClassifier</em> and <em>RandomForestsRegressor</em> represent ensemble classification and regression tree methods. Random forests overcome some of the disadvantages of single decision trees by constructing an ensemble of uncorrelated trees. Each tree is constructed from a random subsample of the training data and only a random subset of the predictors based on <em>max_features</em> is made available during each node split. Each tree produces a prediction probability and the final classification result is obtained by averaging of the prediction probabilities across all of the trees. The <em>ExtraTreesClassifier</em> is a variant on random forests where during each node split, the splitting rule that is selected is based on the best of a collection of randomly-geneated thresholds that were assigned to the features.</li>
-	
-	<li>The <em>GradientBoostingClassifier</em> and <em>GradientBoostingRegressor</em> also represent ensemble tree-based methods. However, in a boosted model the learning processes is additive in a forward step-wise fashion whereby <i>n_estimators</i> are fit during each model step, and each model step is designed to better fit samples that are not currently well predicted by the previous step. This incrementally improves the performance of the entire model ensemble by fitting to the model residuals. Additionally, the <em>XGBClassifier</em> and <em>XGBRegressor</em> models represent an accelerated version of gradient boosting which can optionally be installed from the XGBoost python package.</li>
-	
+	<li>The <em>GradientBoostingClassifier</em> and <em>GradientBoostingRegressor</em> also represent ensemble tree-based methods. However, in a boosted model the learning processes is additive in a forward step-wise fashion whereby <i>n_estimators</i> are fit during each model step, and each model step is designed to better fit samples that are not currently well predicted by the previous step. This incrementally improves the performance of the entire model ensemble by fitting to the model residuals. Additionally, Microsoft's <em>LGBMClassifier</em> and <em>LGBMRegressor</em> models represent an accelerated version of gradient boosting which can optionally be installed by pip install lightgbm.</li>
 	<li>The <em>SVC</em> model is C-Support Vector Classifier. Only a linear kernel is supported because non-linear kernels using scikit learn for typical remote sensing and spatial analysis datasets which consist of large numbers of samples are too slow to be practical. This classifier can still be slow for large datasets that include &gt 10000 training samples.</li>
-	
 	<li>The <em>EarthClassifier</em> and <em>EarthRegressor</em> is a python-based version of Friedman's multivariate adaptive regression splines. This classifier depends on the <a href="https://github.com/scikit-learn-contrib/py-earth">py-earth package</a>, which optionally can be installed in addition to scikit-learn. Earth represents a non-parametric extension to linear models such as logistic regression which improves model fit by partitioning the data into subregions, with each region being fitted by a separate regression term.</li>
 </ul>
-
 <p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. The scikit-learn classifier defaults are generally supplied, and some of these parameters can be tuning using a grid-search by inputting multiple parameter settings as a comma-separated list. This tuning can also be accomplished simultaneously with nested cross-validation by also settings the <em>cv</em> option to &gt 1. The parameters consist of:</p>
-
 <ul>
 	<li><em>C</em> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <em>C</em> applies to the LogisticRegression and SVC models.</li>
-	
 	<li><em>n_estimators</em> represents the number of trees in Random Forest model, and the number of trees used in each model step during Gradient Boosting. For random forests, a larger number of trees will never adversely affect accuracy although this is at the expensive of computational performance. In contrast, Gradient boosting will start to overfit if <em>n_estimators</em> is too high, which will reduce model accuracy.</li>
-	
 	<li><em>max_features</em> controls the number of variables that are allowed to be chosen from at each node split in the tree-based models, and can be considered to control the degree of correlation between the trees in ensemble tree methods.</li>
-	
 	<li><em>min_samples_split</em> and <em>min_samples_leaf</em> control the number of samples required to split a node or form a leaf node, respectively.</li>
-	
-	<li>The <em>learning_rate</em> and <em>subsample</em> parameters apply only to Gradient Boosting and XGBClassifier or XGBRegressor. <em>learning_rate</em> shrinks the contribution of each tree, and <em>subsample</em> is the fraction of randomly selected samples for each tree. A lower <em>learning_rate</em> always improves accuracy in gradient boosting but will require a much larger <em>n_estimators</em> setting which will lower computational performance.</li>
-	
+	<li>The <em>learning_rate</em> and <em>subsample</em> parameters apply only to Gradient Boosting. <em>learning_rate</em> shrinks the contribution of each tree, and <em>subsample</em> is the fraction of randomly selected samples for each tree. A lower <em>learning_rate</em> always improves accuracy in gradient boosting but will require a much larger <em>n_estimators</em> setting which will lower computational performance.</li>
 	<li>The main control on accuracy in the Earth classifier consists <em>max_degree</em> which is the maximum degree of terms generated by the forward pass. Settings of <em>max_degree</em> = 1 or 2 offer good accuracy versus computational performance.</li>
 </ul>
-
 <p>In addition to model fitting and prediction, feature selection can be performed using the <em>-f</em> flag. The feature selection method employed is based on Brenning et al. (2012) and consists of a custom permutation-based method that can be applied to all of the classifiers as part of a cross-validation. The method consists of: (1) determining a performance metric on a test partition of the data; (2) permuting each variable and assessing the difference in performance between the original and permutation; (3) repeating step 2 for <em>n_permutations</em>; (4) averaging the results. Steps 1-4 are repeated on each k partition. The feature importance represent the average decrease in performance of each variable when permuted. For binary classifications, the AUC is used as the metric. Multiclass classifications use accuracy, and regressions use R2.</p>
-
 <p>Cross validation can be performed by setting the <em>cv</em> parameters to &gt 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced depending on whether the response variable is binary or multiclass, or the classifier is for regression or classification. The <em>cvtype</em> parameter can also be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into <em>n_partitions</em> by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic performance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the gr
 oup_ids of the partitions can be supplied using the <em>group_raster</em> option.</p>
-
 <p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as linear models may not perform optimally if some predictors have variances that are orders of magnitude larger than others. The <em>-s</em> flag adds a standardization preprocessing step to the classification and prediction to reduce this effect. Additionally, most of the classifiers do not perform well if there is a large class imbalance in the training data. Using the <em>-b</em> flag balances the training data by weighting of the minority classes relative to the majority class. This does not apply to the Naive Bayes or LinearDiscriminantAnalysis classifiers.</p> 
-
 <p>Non-ordinal, categorical predictors are also not specifically recognized by scikit-learn. Some classifiers are not very sensitive to this (i.e. decision trees) but generally, categorical predictors need to be converted to a suite of binary using onehot encoding (i.e. where each value in a categorical raster is parsed into a separate binary grid). Entering the indices (comma-separated) of the categorical rasters as they are listed in the imagery group as 0...n in the <em>categorymaps</em> option will cause onehot encoding to be performed on the fly during training and prediction. The feature importances are returned as per the original imagery group and represent the sum of the feature importances of the onehot-encoded variables. Note: it is important that the training samples all of the categories in the rasters, otherwise the onehot-encoding will fail when it comes to the prediction.</p>
-
 <p>The module also offers the ability to save and load a classification or regression model. Saving and loading a model allows a model to be fitted on one imagery group, with the prediction applied to additional imagery groups. This approach is commonly employed in species distribution or landslide susceptibility modelling whereby a classification or regression model is built with one set of predictors (e.g. present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables.</p>
-
 <p>For convenience when performing repeated classifications using different classifiers or parameters, the training data can be saved to a csv file using the <em>save_training</em> option. This data can then be loaded into subsequent classification runs, saving time by avoiding the need to repeatedly query the predictors.</p>
-
 <h2>NOTES</h2>
-
-<p><em>r.learn.ml</em> uses the "scikit-learn" machine learning python package along with the "pandas" package. These packages need to be installed within your GRASS GIS Python environment. For Linux users, these packages should be available through the linux package manager. For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows with a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.</p>
-
-<p><em>r.learn.ml</em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <em>lines</em> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <em>lines=25</em> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set when providing a raster as the trainingmap. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <em>-l</em> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on ras
 ters of any size irrespective of the available memory.</p>
-
+<p><em>r.learn.ml</em> uses the "scikit-learn" machine learning python package along with the "pandas" package. These packages need to be installed within your GRASS GIS Python environment. For Linux users, these packages should be available through the linux package manager. For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows with a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.</p>
+<p><em>r.learn.ml</em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <em>rowincr</em> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <em>rowincr=25</em> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set when providing a raster as the trainingmap. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <em>-l</em> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on
  rasters of any size irrespective of the available memory.</p>
 <p>Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <em>randst</em> parameter.</p>
-
 <h2>EXAMPLE</h2>
-
 <p>Here we are going to use the GRASS GIS sample North Carolina data set as a basis to perform a landsat classification. We are going to classify a Landsat 7 scene from 2000, using training information from an older (1996) land cover dataset.</p>
-
 <p>Landsat 7 (2000) bands 7,4,2 color composite example:</p>
 <center>
 <img src="lsat7_2000_b742.png" alt="Landsat 7 (2000) bands 7,4,2 color composite example">
 </center>
-
 <p>Note that this example must be run in the "landsat" mapset of the North Carolina sample data set location.</p>
-
 <p>First, we are going to generate some training pixels from an older (1996) land cover classification:</p>
 <div class="code"><pre>
 g.region raster=landclass96 -p
 r.random input=landclass96 npoints=1000 raster=landclass96_roi
 </pre></div>
-
 <p>Then we can use these training pixels to perform a classification on the more recently obtained landsat 7 image:</p>
 <div class="code"><pre>
 r.learn.ml group=lsat7_2000 trainingmap=landclass96_roi output=rf_classification \
-  classifier=RandomForestClassifier n_estimators=500 randst=1 lines=25
-
+  classifier=RandomForestClassifier n_estimators=500
 # copy category labels from landclass training map to result
 r.category rf_classification raster=landclass96_roi
-
 # copy color scheme from landclass training map to result
 r.colors rf_classification raster=landclass96_roi
 r.category rf_classification
 </pre></div>
-
 <p>Random forest classification result:</p>
 <center>
 <img src="rfclassification.png" alt="Random forest classification result">
 </center>
-
 <h2>ACKNOWLEDGEMENTS</h2>
-
-<p>Thanks for Paulo van Breugel and Vaclav Petras for general testing, and Paulo for the suggestion to enable saving of the fitted models.</p>
-
+<p>Thanks for Paulo van Breugel and Vaclav Petras for testing.</p>
 <h2>REFERENCES</h2>
-
 <p>Brenning, A. 2012. Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: the R package 'sperrorest'. 2012 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 23-27 July 2012, p. 5372-5375.</p>
-
 <p>Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.</p>
-
 <h2>AUTHOR</h2>
-
 Steven Pawley
-
 <p><em>Last changed: $Date$</em></p>
\ No newline at end of file

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-11-26 03:23:50 UTC (rev 71840)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-11-26 04:36:03 UTC (rev 71841)
@@ -14,7 +14,7 @@
 #############################################################################
 # July, 2017. Jaan Janno, Mait Lang. Bugfixes concerning crossvalidation failure
 # when class numeric ID-s were not continous increasing +1 each.
-# Bugfix for processing index list of nominal layers.  
+# Bugfix for processing index list of nominal layers.
 
 #%module
 #% description: Supervised classification and regression of GRASS rasters using the python scikit-learn package
@@ -27,8 +27,8 @@
 
 #%option G_OPT_I_GROUP
 #% key: group
-#% label: Imagery group to be classified
-#% description: GRASS imagery group of raster maps to be used in the machine learning model
+#% label: Group of raster layers to be classified
+#% description: GRASS imagery group of raster maps representing feature variables to be used in the machine learning model
 #% required: yes
 #% multiple: no
 #%end
@@ -43,8 +43,8 @@
 
 #%option G_OPT_V_INPUT
 #% key: trainingpoints
-#% label: Training point vector
-#% description: Vector points map for training
+#% label: Vectorfile with training samples
+#% description: Vector points map where each point is used as training sample. Handling of missing values in training data can be choosen later.
 #% required: no
 #% guisection: Required
 #%end
@@ -52,7 +52,7 @@
 #%option G_OPT_DB_COLUMN
 #% key: field
 #% label: Response attribute column
-#% description: Name of attribute column in trainingpoints containing response value
+#% description: Name of attribute column in trainingpoints table containing response values
 #% required: no
 #% guisection: Required
 #%end
@@ -60,7 +60,7 @@
 #%option G_OPT_R_OUTPUT
 #% key: output
 #% label: Output Map
-#% description: Prediction surface result from classification or regression model
+#% description: Raster layer name to store result from classification or regression model. The name will also used as a perfix if class probabilities or intermediate of cross-validation results are ordered as maps.
 #% guisection: Required
 #% required: no
 #%end
@@ -70,7 +70,7 @@
 #% label: Classifier
 #% description: Supervised learning model to use
 #% answer: RandomForestClassifier
-#% options: LogisticRegression,LinearDiscriminantAnalysis,QuadraticDiscriminantAnalysis,KNeighborsClassifier,GaussianNB,DecisionTreeClassifier,DecisionTreeRegressor,RandomForestClassifier,RandomForestRegressor,ExtraTreesClassifier,ExtraTreesRegressor,GradientBoostingClassifier,GradientBoostingRegressor,SVC,EarthClassifier,EarthRegressor,XGBClassifier,XGBRegressor
+#% options: LogisticRegression,LinearDiscriminantAnalysis,QuadraticDiscriminantAnalysis,KNeighborsClassifier,GaussianNB,DecisionTreeClassifier,DecisionTreeRegressor,RandomForestClassifier,RandomForestRegressor,ExtraTreesClassifier,ExtraTreesRegressor,GradientBoostingClassifier,GradientBoostingRegressor,SVC,EarthClassifier,EarthRegressor
 #% guisection: Classifier settings
 #% required: no
 #%end
@@ -88,8 +88,8 @@
 #%option
 #% key: max_features
 #% type: integer
-#% label: Number of features avaiable during node splitting
-#% description: Number of features avaiable during node splitting (tree-based classifiers and regressors)
+#% label: Number of features available during node splitting; zero uses classifier defaults
+#% description: Number of features available during node splitting (tree-based classifiers and regressors)
 #% answer:0
 #% multiple: yes
 #% guisection: Classifier settings
@@ -99,7 +99,7 @@
 #% key: max_depth
 #% type: integer
 #% label: Maximum tree depth; zero uses classifier defaults
-#% description: Maximum tree depth for tree-based method; zero uses classifier defaults (full-growing for Decision trees and Randomforest, 3 for GBM and XGB)
+#% description: Maximum tree depth for tree-based method; zero uses classifier defaults (full-growing for Decision trees and Randomforest, 3 for GBM)
 #% answer:0
 #% multiple: yes
 #% guisection: Classifier settings
@@ -178,7 +178,7 @@
 #%option string
 #% key: weights
 #% label: weight function
-#% description: weight function for knn prediction
+#% description: Distance weight function for k-nearest neighbours model prediction
 #% answer: uniform
 #% options: uniform,distance
 #% multiple: yes
@@ -195,11 +195,12 @@
 #% guisection: Classifier settings
 #%end
 
-#%option integer
+#%option G_OPT_R_INPUT
 #% key: categorymaps
+#% required: no
 #% multiple: yes
-#% label: Indices of categorical rasters within the imagery group (0..n)
-#% description: Indices of categorical rasters within the imagery group (0..n) that will be one-hot encoded
+#% label: Names of categorical rasters within the imagery group
+#% description: Names of categorical rasters within the imagery group that will be one-hot encoded. Leave empty if none.
 #% guisection: Optional
 #%end
 
@@ -215,8 +216,8 @@
 #%option
 #% key: n_partitions
 #% type: integer
-#% label: Number of kmeans spatial partitions
-#% description: Number of kmeans spatial partitions for kmeans clustered cross-validation
+#% label: Number of k-means spatial partitions
+#% description: Number of k-means spatial partitions for k-means clustered cross-validation
 #% answer: 10
 #% guisection: Cross validation
 #%end
@@ -309,20 +310,20 @@
 #%end
 
 #%option
-#% key: lines
+#% key: indexes
 #% type: integer
-#% description: Processing block size in terms of number of rows
-#% answer: 25
+#% description: Indexes of class probabilities to predict. Default -1 predicts all classes
+#% answer: -1
 #% guisection: Optional
+#% multiple: yes
 #%end
 
 #%option
-#% key: indexes
+#% key: rowincr
 #% type: integer
-#% description: Indexes of class probabilities to predict. Default -1 predicts all classes
-#% answer: -1
+#% description: Maximum number of raster rows to read/write in single chunk whilst performing prediction
+#% answer: 25
 #% guisection: Optional
-#% multiple: yes
 #%end
 
 #%option
@@ -336,18 +337,14 @@
 #%flag
 #% key: s
 #% label: Standardization preprocessing
+#% description: Standardize feature variables (convert values the get zero mean and unit variance).
 #% guisection: Optional
 #%end
 
 #%flag
-#% key: i
-#% label: Impute training data preprocessing
-#% guisection: Optional
-#%end
-
-#%flag
 #% key: p
 #% label: Output class membership probabilities
+#% description: A raster layer is created for each class. It is recommended to give a list of particular classes in interest to avoid consumption of large amounts of disk space.
 #% guisection: Optional
 #%end
 
@@ -391,7 +388,7 @@
 
 #%option G_OPT_F_OUTPUT
 #% key: save_model
-#% label: Save model from file
+#% label: Save model to file
 #% required: no
 #% guisection: Optional
 #%end
@@ -416,14 +413,14 @@
 import os
 from copy import deepcopy
 import numpy as np
-
 from grass.pygrass.utils import set_path
-import grass.script as gscript
+import grass.script as gs
 from grass.pygrass.modules.shortcuts import raster as r
 
 set_path('r.learn.ml')
 from rlearn_crossval import cross_val_scores
-from rlearn_rasters import predict, extract, extract_points
+from rlearn_sampling import extract_pixels, extract_points
+from rlearn_prediction import predict
 from rlearn_utils import (
     model_classifiers, save_training_data, load_training_data, maps_from_group)
 
@@ -431,7 +428,7 @@
 
 def cleanup():
     for rast in tmp_rast:
-        gscript.run_command(
+        gs.run_command(
             "g.remove", name=rast, type='raster', flags='f', quiet=True)
 
 def warn(*args, **kwargs):
@@ -444,7 +441,7 @@
     try:
         from sklearn.externals import joblib
         from sklearn.cluster import KMeans
-        from sklearn.preprocessing import StandardScaler, Imputer
+        from sklearn.preprocessing import StandardScaler
         from sklearn.model_selection import (
             GridSearchCV, GroupShuffleSplit, ShuffleSplit,
             StratifiedKFold, GroupKFold)
@@ -454,12 +451,12 @@
         from sklearn import metrics
         from sklearn.metrics import make_scorer
     except:
-        gscript.fatal("Scikit learn 0.18 or newer is not installed")
+        gs.fatal("Scikit learn 0.18 or newer is not installed")
 
     try:
         import pandas as pd
     except:
-        gscript.fatal("Pandas is not installed ")
+        gs.fatal("Pandas is not installed ")
 
     # required gui section
     group = options['group']
@@ -506,42 +503,35 @@
     model_only = flags['m']
     probability = flags['p']
     prob_only = flags['z']
-    rowincr = int(options['lines'])
     random_state = int(options['random_state'])
     model_save = options['save_model']
     model_load = options['load_model']
     load_training = options['load_training']
     save_training = options['save_training']
     indexes = options['indexes']
+    rowincr = int(options['rowincr'])
     n_jobs = int(options['n_jobs'])
     lowmem = flags['l']
-    impute = flags['i']
     balance = flags['b']
 
-    # convert to lists
-    """
-    if ',' in categorymaps:
-        categorymaps = [int(i) for i in categorymaps.split(',')]
-	print(categorymaps)
-    else: categorymaps = None
-    """
-    
+    # fetch individual raster names from group
+    maplist, mapnames = maps_from_group(group)
+
+    # extract indices of category maps
     if categorymaps.strip() == '':
         categorymaps = None
     else:
-        try:
-            categorymaps = [int(i.strip()) for i in categorymaps.split(',')]
-            # negatiivse ja maplist, _ = maps_from_group(group) suurima indeksi kontroll, dublikaatide kontroll (unique)
-            nCategories = len(maps_from_group(group)[0])
-            if min(categorymaps) < 0:
-                gscript.fatal('Category map index can not be negative.')
-            if max(categorymaps) > nCategories - 1:
-                gscript.fatal('Category map index input can not exceed ' + str(nCategories - 1))
-            if not len(np.unique(categorymaps)) == len(categorymaps):
-                gscript.fatal('Duplicate indices in category map index list.')            
-        except:
-            gscript.fatal('Error in category map list input.')
-    
+        if isinstance(categorymaps, str):
+            categorymaps = [categorymaps]
+        cat_indexes = []
+        for cat in categorymaps:
+            try:
+                cat_indexes.append(maplist.index(cat))
+            except:
+                gs.fatal('Category map {0} not in the imagery group'.format(cat))
+        categorymaps = cat_indexes
+
+    # convert class probability indexes to list
     if ',' in indexes:
         indexes = [int(i) for i in indexes.split(',')]
     else:
@@ -555,16 +545,26 @@
 
     # feature importances selected by no cross-validation scheme used
     if importances is True and cv == 1:
-        gscript.fatal('Feature importances require cross-validation cv > 1')
+        gs.fatal('Feature importances require cross-validation cv > 1')
 
     # output map has not been entered and model_only is not set to True
-    if output == '' and model_only is True:
-        gscript.fatal('No output map specified')
+    if output == '' and model_only is not True:
+        gs.fatal('No output map specified')
 
-    # perform prediction only for class probabilities but probability flag is not set to True
+    # perform prediction only for class probabilities but probability flag
+    # is not set to True
     if prob_only is True:
         probability = True
 
+    # check for field attribute if trainingpoints are used
+    if trainingpoints != '' and field == '':
+        gs.fatal('No attribute column specified for training points')
+
+    # check that valid combination of training data input is present
+    if trainingpoints == '' and trainingmap == '' and load_training == '' \
+    and model_load =='':
+        gs.fatal('No training vector, raster or tabular data is present')
+
     # make dicts for hyperparameters, datatypes and parameters for tuning
     hyperparams_type = dict.fromkeys(hyperparams, int)
     hyperparams_type['C'] = float
@@ -574,8 +574,6 @@
     param_grid = deepcopy(hyperparams_type)
     param_grid = dict.fromkeys(param_grid, None)
 
-   
-
     for key, val in hyperparams.iteritems():
         # split any comma separated strings and add them to the param_grid
         if ',' in val:
@@ -588,7 +586,6 @@
     if hyperparams['max_depth'] == 0: hyperparams['max_depth'] = None
     if hyperparams['max_features'] == 0: hyperparams['max_features'] = 'auto'
     param_grid = {k: v for k, v in param_grid.iteritems() if v is not None}
- 
 
     # retrieve sklearn classifier object and parameters
     clf, mode = model_classifiers(
@@ -600,33 +597,32 @@
         key: value for key, value in param_grid.iteritems()
         if key in clf_params}
 
-
     # scoring metrics
     if mode == 'classification':
         scoring = ['accuracy', 'precision', 'recall', 'f1', 'kappa',\
                    'balanced_accuracy']
         search_scorer = make_scorer(metrics.matthews_corrcoef)
     else:
-        scoring = ['r2', 'neg_mean_squared_error']
+        scoring = ['r2', 'explained_variance', 'neg_mean_absolute_error',
+                   'neg_mean_squared_error', 'neg_mean_squared_log_error',
+                   'neg_median_absolute_error']
         search_scorer = 'r2'
 
     # -------------------------------------------------------------------------
     # Extract training data
     # -------------------------------------------------------------------------
 
-    # fetch individual raster names from group
-    maplist, _ = maps_from_group(group)
-    
     if model_load == '':
 
         # Sample training data and group id
         if load_training != '':
             X, y, group_id, sample_coords = load_training_data(load_training)
         else:
-            gscript.message('Extracting training data')
+            gs.message('Extracting training data')
 
             # generate spatial clump/patch partitions
-            # clump the labelled pixel raster and set the group_raster to the clumped raster
+            # clump the labelled pixel raster and set the group_raster
+            # to the clumped raster
             if trainingmap != '' and cvtype == 'clumped' and group_raster == '':
                 clumped_trainingmap = 'tmp_clumped_trainingmap'
                 tmp_rast.append(clumped_trainingmap)
@@ -634,7 +630,7 @@
                         overwrite=True, quiet=True)
                 group_raster = clumped_trainingmap
             elif trainingmap == '' and cvtype == 'clumped':
-                gscript.fatal('Cross-validation using clumped training areas ',
+                gs.fatal('Cross-validation using clumped training areas ',
                               'requires raster-based training areas')
 
             # append spatial clumps or group raster to the predictors
@@ -646,13 +642,16 @@
 
             # extract training data
             if trainingmap != '':
-                X, y, sample_coords = extract(
-                    response=trainingmap, predictors=maplist2, lowmem=lowmem)
+                X, y, sample_coords = extract_pixels(
+                    response=trainingmap, predictors=maplist2, lowmem=lowmem, na_rm=True)
             elif trainingpoints != '':
                 X, y, sample_coords = extract_points(
-                    trainingpoints, maplist2, field)
+                    trainingpoints, maplist2, field, na_rm=True)
             group_id = None
 
+            if len(y) < 1 or X.shape[0] < 1:
+                gs.fatal('There are too few training features to perform classification')
+
             # take group id from last column and remove from predictors
             if group_raster != '':
                 group_id = X[:, -1]
@@ -667,37 +666,29 @@
 
             # check for labelled pixels and training data
             if y.shape[0] == 0 or X.shape[0] == 0:
-                gscript.fatal('No training pixels or pixels in imagery group '
+                gs.fatal('No training pixels or pixels in imagery group '
                               '...check computational region')
 
-            # impute or remove NaNs
-            if impute is False:
-                y = y[~np.isnan(X).any(axis=1)]
-                sample_coords = sample_coords[~np.isnan(X).any(axis=1)]
-                if group_id is not None:
-                    group_id = group_id[~np.isnan(X).any(axis=1)]
-                X = X[~np.isnan(X).any(axis=1)]
-            else:
-                missing = Imputer(strategy='median')
-                X = missing.fit_transform(X)
-
             # shuffle data
             if group_id is None:
-                X, y, sample_coords = shuffle(X, y, sample_coords, random_state=random_state)
+                X, y, sample_coords = shuffle(
+                    X, y, sample_coords, random_state=random_state)
             if group_id is not None:
                 X, y, sample_coords, group_id = shuffle(
                     X, y, sample_coords, group_id, random_state=random_state)
 
             # optionally save extracted data to .csv file
             if save_training != '':
-                save_training_data(X, y, group_id, sample_coords, save_training)
+                save_training_data(
+                    X, y, group_id, sample_coords, save_training)
 
         # ---------------------------------------------------------------------
         # define the inner search resampling method
         # ---------------------------------------------------------------------
 
         if any(param_grid) is True and cv == 1 and grid_search == 'cross-validation':
-            gscript.fatal('Hyperparameter search using cross validation requires cv > 1')
+            gs.fatal(
+                'Hyperparameter search using cross validation requires cv > 1')
 
         # define inner resampling using cross-validation method
         elif any(param_grid) is True and grid_search == 'cross-validation':
@@ -709,12 +700,13 @@
         # define inner resampling using the holdout method
         elif any(param_grid) is True and grid_search == 'holdout':
             if group_id is None:
-                inner = ShuffleSplit(n_splits=1, test_size=0.33, random_state=random_state)
+                inner = ShuffleSplit(
+                    n_splits=1, test_size=0.33, random_state=random_state)
             else:
-                inner = GroupShuffleSplit(n_splits=1, test_size=0.33, random_state=random_state)
+                inner = GroupShuffleSplit(
+                    n_splits=1, test_size=0.33, random_state=random_state)
         else:
             inner = None
-     
 
         # ---------------------------------------------------------------------
         # define the outer search resampling method
@@ -729,9 +721,9 @@
         # define sample weights for gradient boosting classifiers
         # ---------------------------------------------------------------------
 
-        # sample weights for GradientBoosting or XGBClassifier
+        # classifiers that take sample_weights
         if balance is True and mode == 'classification' and classifier in (
-                'GradientBoostingClassifier', 'XGBClassifier'):
+                'GradientBoostingClassifier', 'GaussianNB'):
             from sklearn.utils import compute_class_weight
             class_weights = compute_class_weight(
                 class_weight='balanced', classes=(y), y=y)
@@ -743,26 +735,21 @@
         # ---------------------------------------------------------------------
         # standardization
         if norm_data is True and categorymaps is None:
-	    gscript.message('norm_data is True and categorymaps is None:')
             clf = Pipeline([('scaling', StandardScaler()),
                             ('classifier', clf)])
 
         # onehot encoding
         if categorymaps is not None and norm_data is False:
-	    gscript.message('categorymaps is not None and norm_data is False:')		
             enc = OneHotEncoder(categorical_features=categorymaps)
-	    # print(enc.n_values)
             enc.fit(X)
             clf = Pipeline([('onehot', OneHotEncoder(
                 categorical_features=categorymaps,
                 n_values=enc.n_values_, handle_unknown='ignore',
-	        # handle_unknown='ignore',
                 sparse=False)),  # dense because not all clf can use sparse
                             ('classifier', clf)])
 
         # standardization and onehot encoding
         if norm_data is True and categorymaps is not None:
-	    gscript.message('norm_data is True and categorymaps is not None:')
             enc = OneHotEncoder(categorical_features=categorymaps)
             enc.fit(X)
             clf = Pipeline([('onehot', OneHotEncoder(
@@ -771,7 +758,7 @@
                 sparse=False)),
                             ('scaling', StandardScaler()),
                             ('classifier', clf)])
-        # print(clf)
+
         # ---------------------------------------------------------------------
         # create the hyperparameter grid search method
         # ---------------------------------------------------------------------
@@ -785,10 +772,6 @@
                     newkey = 'classifier__' + key
                     param_grid[newkey] = param_grid.pop(key)
 
-            # print(param_grid)
-            # print(inner)
-            # gscript.fatal('Põmm!')
-
             # create grid search method
             clf = GridSearchCV(
                 estimator=clf, param_grid=param_grid, scoring=search_scorer,
@@ -797,28 +780,29 @@
         # ---------------------------------------------------------------------
         # classifier training
         # ---------------------------------------------------------------------
-        gscript.message(os.linesep)
-        gscript.message(('Fitting model using ' + classifier))
 
-        # pass groups to fit parameter GroupKFold/GroupShuffleSplit and param_grid are present
-        if isinstance(inner, (GroupKFold, GroupShuffleSplit)):
-            if balance is True and classifier in (
-                    'GradientBoostingClassifier', 'XGBClassifier'):
-                clf.fit(X=X, y=y, groups=group_id, sample_weight=class_weights)
+        gs.message(os.linesep)
+        gs.message(('Fitting model using ' + classifier))
+
+        # fitting ensuring that all options are passed
+        if classifier in ('GradientBoostingClassifier', 'GausianNB') and balance is True:
+            if isinstance(clf, Pipeline):
+                fit_params = {'classifier__sample_weight': class_weights}
             else:
-                clf.fit(X=X, y=y, groups=group_id)
+                fit_params = {'sample_weight': class_weights}
         else:
-            if balance is True and classifier in (
-                    'GradientBoostingClassifier', 'XGBClassifier'):
-                clf.fit(X=X, y=y, sample_weight=class_weights)
-            else:
-                clf.fit(X, y)
+            fit_params = {}
 
+        if isinstance(inner, (GroupKFold, GroupShuffleSplit)):
+            clf.fit(X, y, groups=group_id, **fit_params)
+        else:
+            clf.fit(X, y, **fit_params)
+
         # message best hyperparameter setup and optionally save using pandas
         if any(param_grid) is True:
-            gscript.message(os.linesep)
-            gscript.message('Best parameters:')
-            gscript.message(str(clf.best_params_))
+            gs.message(os.linesep)
+            gs.message('Best parameters:')
+            gs.message(str(clf.best_params_))
             if param_file != '':
                 param_df = pd.DataFrame(clf.cv_results_)
                 param_df.to_csv(param_file)
@@ -828,24 +812,18 @@
         # ---------------------------------------------------------------------
         # cross-validation
         # ---------------------------------------------------------------------
+
         # If cv > 1 then use cross-validation to generate performance measures
         if cv > 1 and tune_only is not True:
             if mode == 'classification' and cv > np.histogram(
-		    # See oli algselt ja ajas jama, kui klassikoodides olid  +2 suuremad augud
-                    # y, bins=len(np.unique(y)))[0].min():
 		    y, bins=np.unique(y))[0].min():
-		# print(np.unique(y))
-		# print(len(np.unique(y)))
-		# print(np.histogram(y, bins=len(np.unique(y)))[0].min())
-		print(np.histogram(y, bins=len(np.unique(y))))
-		print(np.histogram(y, bins=np.unique(y)))
-                gscript.message(os.linesep)
-                gscript.message('Number of cv folds is greater than number of '
-                                'samples in some classes. Cross-validation is being'
-                                ' skipped')
+                gs.message(os.linesep)
+                gs.message('Number of cv folds is greater than number of '
+                            'samples in some classes. Cross-validation is being'
+                            ' skipped')
             else:
-                gscript.message(os.linesep)
-                gscript.message(
+                gs.message(os.linesep)
+                gs.message(
                     "Cross validation global performance measures......:")
 
                 # add auc and mcc as scorer if classification is binary
@@ -859,27 +837,38 @@
                     clf, X, y, group_id, class_weights, outer, scoring,
                     importances, n_permutations, random_state, n_jobs)
 
+                # from sklearn.model_selection import cross_validate
+                # scores = cross_validate(clf, X, y, group_id, scoring, outer, n_jobs, fit_params=fit_params)
+                # test_scoring = ['test_' + i for i in scoring]
+                # gs.message(os.linesep)
+                # gs.message(('Metric \t Mean \t Error'))
+                # for sc in test_scoring:
+                #     gs.message(sc + '\t' + str(scores[sc].mean()) + '\t' + str(scores[sc].std()))
+
                 preds = np.hstack((preds, sample_coords))
 
                 for method, val in scores.iteritems():
-                    gscript.message(
+                    gs.message(
                         method+":\t%0.3f\t+/-SD\t%0.3f" %
                         (val.mean(), val.std()))
 
                 # individual class scores
                 if mode == 'classification' and len(np.unique(y)) != 2:
-                    gscript.message(os.linesep)
-                    gscript.message('Cross validation class performance measures......:')
-                    gscript.message('Class \t' + '\t'.join(map(str, np.unique(y))))
+                    gs.message(os.linesep)
+                    gs.message(
+                        'Cross validation class performance measures......:')
+                    gs.message('Class \t' + '\t'.join(map(str, np.unique(y))))
 
                     for method, val in cscores.iteritems():
                         mat_cscores = np.matrix(val)
-                        gscript.message(
+                        gs.message(
                             method+':\t' + '\t'.join(
-                                map(str, np.round(mat_cscores.mean(axis=0), 2)[0])))
-                        gscript.message(
+                                map(str, np.round(
+                                        mat_cscores.mean(axis=0), 2)[0])))
+                        gs.message(
                             method+' std:\t' + '\t'.join(
-                                map(str, np.round(mat_cscores.std(axis=0), 2)[0])))
+                                map(str, np.round(
+                                        mat_cscores.std(axis=0), 2)[0])))
 
                 # write cross-validation results for csv file
                 if errors_file != '':
@@ -892,18 +881,19 @@
                     preds.columns = ['y_true', 'y_pred', 'fold', 'x', 'y']
                     preds.to_csv(preds_file, mode='w')
                     text_file = open(preds_file + 't', "w")
-                    text_file.write('"Integer","Real","Real","integer","Real","Real"')
+                    text_file.write(
+                        '"Integer","Real","Real","integer","Real","Real"')
                     text_file.close()
 
                 # feature importances
                 if importances is True:
-                    gscript.message(os.linesep)
-                    gscript.message("Feature importances")
-                    gscript.message("id" + "\t" + "Raster" + "\t" + "Importance")
+                    gs.message(os.linesep)
+                    gs.message("Feature importances")
+                    gs.message("id" + "\t" + "Raster" + "\t" + "Importance")
 
                     # mean of cross-validation feature importances
                     for i in range(len(fimp.mean(axis=0))):
-                        gscript.message(
+                        gs.message(
                             str(i) + "\t" + maplist[i] +
                             "\t" + str(round(fimp.mean(axis=0)[i], 4)))
 
@@ -914,48 +904,55 @@
         # load a previously fitted train object
         if model_load != '':
             # load a previously fitted model
-            clf = joblib.load(model_load)
+            X, y, sample_coords, group_id, clf = joblib.load(model_load)
+            clf.fit(X,y)
 
     # Optionally save the fitted model
     if model_save != '':
-        joblib.dump(clf, model_save)
+        joblib.dump((X, y, sample_coords, group_id, clf), model_save)
 
     # -------------------------------------------------------------------------
     # prediction on grass imagery group
     # -------------------------------------------------------------------------
 
     if model_only is not True:
-        gscript.message(os.linesep)
+        gs.message(os.linesep)
 
         # predict classification/regression raster
         if prob_only is False:
-            gscript.message('Predicting classification/regression raster...')
+            gs.message('Predicting classification/regression raster...')
             predict(estimator=clf, predictors=maplist, output=output,
-                    predict_type='raw', rowincr=rowincr, n_jobs=n_jobs)
+                    predict_type='raw', overwrite=gs.overwrite(),
+                    rowincr=rowincr, n_jobs=n_jobs)
 
             if predict_resamples is True:
                 for i in range(cv):
                     resample_name = output + '_Resample' + str(i)
                     predict(estimator=models[i], predictors=maplist,
                             output=resample_name, predict_type='raw',
+                            overwrite=gs.overwrite(),
                             rowincr=rowincr, n_jobs=n_jobs)
 
         # predict class probabilities
         if probability is True:
-            gscript.message('Predicting class probabilities...')
-            predict(estimator=clf, predictors=maplist, output=output, predict_type='prob',
-                    index=indexes, rowincr=rowincr, n_jobs=n_jobs)
+            gs.message('Predicting class probabilities...')
+            predict(estimator=clf, predictors=maplist, output=output,
+                    predict_type='prob', index=indexes,
+                    class_labels=np.unique(y), overwrite=gs.overwrite(),
+                    rowincr=rowincr, n_jobs=n_jobs)
 
             if predict_resamples is True:
                 for i in range(cv):
                     resample_name = output + '_Resample' + str(i)
                     predict(estimator=models[i], predictors=maplist,
                             output=resample_name, predict_type='prob',
-                            index=indexes, rowincr=rowincr, n_jobs=n_jobs)
+                            class_labels=np.unique(y), index=indexes,
+                            overwrite=gs.overwrite(),
+                            rowincr=rowincr, n_jobs=n_jobs)
     else:
-        gscript.message("Model built and now exiting")
+        gs.message("Model built and now exiting")
 
 if __name__ == "__main__":
-    options, flags = gscript.parser()
+    options, flags = gs.parser()
     atexit.register(cleanup)
     main()

Modified: grass-addons/grass7/raster/r.learn.ml/rlearn_crossval.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/rlearn_crossval.py	2017-11-26 03:23:50 UTC (rev 71840)
+++ grass-addons/grass7/raster/r.learn.ml/rlearn_crossval.py	2017-11-26 04:36:03 UTC (rev 71841)
@@ -9,26 +9,28 @@
 
 from __future__ import absolute_import
 from copy import deepcopy
-
 import numpy as np
 import os
 from numpy.random import RandomState
+import grass.script as gs
 
-import grass.script as gscript
 
 def specificity_score(y_true, y_pred):
     """
+
     Calculate specificity score
 
     Args
     ----
-    y_true: 1D numpy array of truth values
-    y_pred: 1D numpy array of predicted classes
+    y_true (1d numpy array): true values of class labels
+    y_pred (1d numpy array): predicted class labels
 
     Returns
     -------
-    specificity: specificity score
+    specificity (float): specificity score
+
     """
+
     from sklearn.metrics import confusion_matrix
 
     cm = confusion_matrix(y_true, y_pred)
@@ -41,6 +43,7 @@
 def varimp_permutation(estimator, X, y, n_permutations, scorer,
                        n_jobs, random_state):
     """
+
     Method to perform permutation-based feature importance during
     cross-validation (cross-validation is applied externally to this
     method)
@@ -55,16 +58,17 @@
 
     Args
     ----
-    estimator: estimator that has been fitted to a training partition
-    X, y: data and labels from a test partition
-    n_permutations: number of random permutations to apply
-    scorer: scikit-learn metric function to use
-    n_jobs: integer, number of processing cores
-    random_state: seed to pass to the numpy random.seed
+    estimator (object): estimator that has been fitted to a training partition
+    X, y: 2d and 1d numpy arrays of data and labels from a test partition
+    n_permutations (integer): number of random permutations to apply
+    scorer (object): scikit-learn metric function to use
+    n_jobs (integer): integer, number of processing cores
+    random_state (float): seed to pass to the numpy random.seed
 
     Returns
     -------
-    scores: scores for each predictor following permutation
+    scores (2d numpy array): scores for each predictor following permutation
+
     """
 
     from sklearn.externals.joblib import Parallel, delayed
@@ -89,19 +93,21 @@
 
 def __permute(estimator, X, y, best_score, scorer, random_state):
     """
+
     Permute each predictor and measure difference from best score
 
     Args
     ----
-    estimator: scikit learn estimator
-    X, y: data and labels from a test partition
-    best_score: best scorer obtained on unperturbed data
-    scorer: scoring method to use to measure importances
-    random_state: random seed
+    estimator (object): scikit learn estimator
+    X, y: 2d and 1d numpy arrays data and labels from a test partition
+    best_score (float): best scorer obtained on unperturbed data
+    scorer (object): scoring method to use to measure importances
+    random_state (float): random seed
 
     Returns
     -------
-    scores: 2D numpy array of scores for each predictor following permutation
+    scores (2D numpy array): scores for each predictor following permutation
+
     """
 
     rstate = RandomState(random_state)
@@ -122,72 +128,109 @@
     return scores
 
 
-def __parallel_fit(estimator, X, y, groups, train_indices, test_indices, sample_weight):
+def __parallel_fit(estimator, X, y, groups, train_indices, sample_weight):
     """
+
     Fit classifiers/regressors in parallel
 
     Args
     ----
-    estimator: scikit learn estimator
+    estimator (object): scikit learn estimator
     X, y: 2D and 1D numpy arrays of training data and labels
-    groups: 1D numpy array of len(y) containing group labels
-    train_indices, test_indices: 1D numpy arrays of indices to use for training/validation
-    sample_weight: 1D numpy array of len(y) containing weights to use during fitting
-                    applied only to XGBoost and Gradient Boosting classifiers
+    groups (1D numpy array): of len(y) containing group labels
+    train_indices, test_indices: 1D numpy arrays of indices to use for
+        training/validation
+    sample_weight (1D numpy array): of len(y) containing weights to use during
+        fitting
+
     """
+    from sklearn.pipeline import Pipeline
 
+    rs_estimator = deepcopy(estimator)
+    
     # create training and test folds
-    X_train, X_test = X[train_indices], X[test_indices]
-    y_train, y_test = y[train_indices], y[test_indices]
-    if groups is not None: groups_train = groups[train_indices]
-    else: groups_train = None
+    X_train, y_train = X[train_indices], y[train_indices]
 
+    if groups is not None:
+        groups_train = groups[train_indices]
+    else:
+        groups_train = None
+
     # subset training and test fold sample_weight
-    if sample_weight is not None: weights = sample_weight[train_indices]
+    if sample_weight is not None:
+        weights = sample_weight[train_indices]
 
-    # train estimator
+    # specify fit_params for sample_weights if required
+    if isinstance(estimator, Pipeline) and sample_weight is not None:
+        fit_params = {'classifier__sample_weight': weights}
+    elif not isinstance(estimator, Pipeline) and sample_weight is not None:
+        fit_params = {'sample_weight': weights}
+    else:
+        fit_params = {}
+
+    # fit estimator with/without groups
     if groups is not None and type(estimator).__name__ in ['RandomizedSearchCV', 'GridSearchCV']:
-        if sample_weight is None: estimator.fit(X_train, y_train, groups=groups_train)
-        else: estimator.fit(X_train, y_train, groups=groups_train, sample_weight=weights)
+        rs_estimator.fit(X_train, y_train, groups=groups_train, **fit_params)
     else:
-        if sample_weight is None: estimator.fit(X_train, y_train)
-        else: estimator.fit(X_train, y_train, sample_weight=weights)
+        rs_estimator.fit(X_train, y_train, **fit_params)
 
-    return estimator
+    return rs_estimator
 
 
 def cross_val_scores(estimator, X, y, groups=None, sample_weight=None, cv=3,
                      scoring='accuracy', feature_importances=False,
                      n_permutations=25, random_state=None, n_jobs=-1):
     """
+
     Stratified Kfold and GroupFold cross-validation using multiple
     scoring metrics and permutation feature importances
 
     Args
     ----
-    estimator: Scikit learn estimator
+    estimator (object): Scikit learn estimator
     X, y: 2D and 1D numpy array of training data and labels
-    groups: 1D numpy array containing group labels
-    sample_weight: 1D numpy array[n_samples,] of sample weights
-    cv: Integer of cross-validation folds or sklearn.model_selection object
-    scoring: List of performance metrics to use
-    feature_importances: Boolean to perform permutation-based importances
-    n_permutations: Number of permutations during feature importance
-    random_state: Seed to pass to the random number generator
+    groups (1D numpy array): group labels
+    sample_weight (1D numpy array[n_samples,]): sample weights per sample
+    cv (integer or object): Number of cross-validation folds or
+        sklearn.model_selection object
+    scoring (list): List of performance metrics to use
+    feature_importances (boolean): option to perform permutation-based importances
+    n_permutations (integer): Number of permutations during feature importance
+    random_state (float): Seed to pass to the random number generator
 
     Returns
     -------
-    scores: Dict, containing lists of scores per cross-validation fold
-    byclass_scores: Dict, containing scores per class
-    fimp: 2D numpy array of permutation feature importances per feature
-    clf_resamples: List, fitted estimators
-    predictions: 2D numpy array with y_true, y_pred, fold
+    scores (dict): Containing lists of scores per cross-validation fold
+    byclass_scores (dict): Containing scores per class
+    fimp (2D numpy array): permutation feature importances per feature
+    clf_resamples (list): List of fitted estimators
+    predictions (2d numpy array): with y_true, y_pred, fold
+
     """
 
     from sklearn import metrics
     from sklearn.model_selection import StratifiedKFold
     from sklearn.externals.joblib import Parallel, delayed
 
+    # first unwrap the estimator from any potential pipelines or gridsearchCV
+    if type(estimator).__name__ == 'Pipeline':
+        clf_type = estimator.named_steps['classifier']
+    else:
+        clf_type = estimator
+
+    if type(clf_type).__name__ == 'GridSearchCV' or \
+        type(clf_type).__name__ == 'RandomizedSearchCV':
+        clf_type = clf_type.best_estimator_
+
+    # check name against already multithreaded classifiers
+    if type(clf_type).__name__ in [
+        'RandomForestClassifier',
+        'RandomForestRegressor',
+        'ExtraTreesClassifier',
+        'ExtraTreesRegressor',
+        'KNeighborsClassifier']:
+        n_jobs=1
+
     # -------------------------------------------------------------------------
     # create copies of estimator and create cross-validation iterator
     # -------------------------------------------------------------------------
@@ -224,7 +267,11 @@
                        'roc_auc': metrics.roc_auc_score,
                        'zero_one_loss': metrics.zero_one_loss,
                        'r2': metrics.r2_score,
-                       'neg_mean_squared_error': metrics.mean_squared_error}
+                       'explained_variance': metrics.explained_variance_score,
+                       'neg_mean_absolute_error': metrics.mean_absolute_error,
+                       'neg_mean_squared_error': metrics.mean_squared_error,
+                       'neg_mean_squared_log_error': metrics.mean_squared_log_error,
+                       'neg_median_absolute_error': metrics.median_absolute_error}
 
     byclass_methods = {'f1': metrics.f1_score,
                        'fbeta': metrics.fbeta_score,
@@ -245,9 +292,8 @@
         try:
             list(scoring_methods.keys()).index(i)
         except:
-            gscript.fatal(('Scoring ', i, ' is not a valid scoring method',
-                            os.linesep(),
-                            'Valid methods are: ', scoring_methods.keys()))
+            gs.fatal(('Scoring ', i, ' is not a valid scoring method',
+                      os.linesep, 'Valid methods are: ', scoring_methods.keys()))
 
     # set averaging type for global binary or multiclass scores
     if len(np.unique(y)) == 2 and all([0, 1] == np.unique(y)):
@@ -279,11 +325,9 @@
     # -------------------------------------------------------------------------
     # Perform multiprocessing fitting of clf on each fold
     # -------------------------------------------------------------------------
-
     clf_resamples = Parallel(n_jobs=n_jobs)(
-        delayed(__parallel_fit)(clf, X, y, groups, train_indices,
-                              test_indices, sample_weight)
-        for train_indices, test_indices in zip(trains, tests))
+        delayed(__parallel_fit)(clf, X, y, groups, train_indices, sample_weight)
+        for train_indices in trains)
 
     # -------------------------------------------------------------------------
     # loop through each fold and calculate performance metrics
@@ -296,8 +340,7 @@
     for train_indices, test_indices in zip(trains, tests):
 
         # create training and test folds
-        X_train, X_test = X[train_indices], X[test_indices]
-        y_train, y_test = y[train_indices], y[test_indices]
+        X_test, y_test = X[test_indices], y[test_indices]
 
         # prediction of test fold
         y_pred = clf_resamples[fold].predict(X_test)
@@ -317,8 +360,13 @@
             elif m == 'kappa' or m == 'specificity' or m == 'accuracy' \
             or m == 'hamming_loss' or m == 'jaccard_similarity' \
             or m == 'log_loss' or m == 'zero_one_loss' \
-            or m == 'matthews_corrcoef' or m == 'r2' \
-            or m == 'neg_mean_squared_error':
+            or m == 'matthews_corrcoef' \
+            or m == 'r2' \
+            or m == 'explained_variance' \
+            or m == 'neg_mean_absolute_error' \
+            or m == 'neg_mean_squared_error' \
+            or m == 'neg_mean_squared_log_error' \
+            or m == 'neg_median_absolute_error':
                 scores[m] = np.append(
                     scores[m], scoring_methods[m](y_test, y_pred))
 
@@ -347,4 +395,4 @@
                 scoring_methods[scoring[0]], n_jobs, random_state)
         fold += 1
 
-    return(scores, byclass_scores, fimp, clf_resamples, predictions)
\ No newline at end of file
+    return(scores, byclass_scores, fimp, clf_resamples, predictions)

Added: grass-addons/grass7/raster/r.learn.ml/rlearn_prediction.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/rlearn_prediction.py	                        (rev 0)
+++ grass-addons/grass7/raster/r.learn.ml/rlearn_prediction.py	2017-11-26 04:36:03 UTC (rev 71841)
@@ -0,0 +1,284 @@
+#!/usr/bin/env python
+# -- coding: utf-8 --
+
+"""
+The module rlearn_prediction contains functions to
+perform predictions on GRASS rasters
+"""
+
+from __future__ import absolute_import
+import numpy as np
+import grass.script as gs
+from grass.pygrass.raster import RasterRow
+from grass.pygrass.gis.region import Region
+from grass.pygrass.raster import numpy2raster
+
+
+def predict(estimator, predictors, output, predict_type='raw', index=None,
+            class_labels=None, overwrite=False, rowincr=25, n_jobs=-2):
+    """
+
+    Prediction on list of GRASS rasters using a fitted scikit learn model
+
+    Args
+    ----
+    estimator (object): scikit-learn estimator object
+    predictors (list): Names of GRASS rasters
+    output (string): Name of GRASS raster to output classification results
+    predict_type (string): 'raw' for classification/regression;
+        'prob' for class probabilities
+    index (list): Optional, list of class indices to export
+    class_labels (1d numpy array): Optional, class labels
+    overwrite (boolean): enable overwriting of existing raster
+    n_jobs (integer): Number of processing cores;
+        -1 for all cores; -2 for all cores-1
+
+    """
+
+    from sklearn.externals.joblib import Parallel, delayed
+
+    # TODO
+    # better memory efficiency and use of memmap for parallel
+    # processing
+    #from sklearn.externals.joblib.pool import has_shareable_memory
+
+    # first unwrap the estimator from any potential pipelines or gridsearchCV
+    if type(estimator).__name__ == 'Pipeline':
+       clf_type = estimator.named_steps['classifier']
+    else:
+        clf_type = estimator
+
+    if type(clf_type).__name__ == 'GridSearchCV' or \
+    type(clf_type).__name__ == 'RandomizedSearchCV':
+        clf_type = clf_type.best_estimator_
+
+    # check name against already multithreaded classifiers
+    if type(clf_type).__name__ in [
+       'RandomForestClassifier',
+        'RandomForestRegressor',
+        'ExtraTreesClassifier',
+        'ExtraTreesRegressor',
+        'KNeighborsClassifier']:
+       n_jobs = 1
+
+    # convert potential single index to list
+    if isinstance(index, int): index = [index]
+
+    # open predictors as list of rasterrow objects
+    current = Region()
+
+    # create lists of row increments
+    row_mins, row_maxs = [], []
+    for row in range(0, current.rows, rowincr):
+        if row+rowincr > current.rows:
+            rowincr = current.rows - row
+        row_mins.append(row)
+        row_maxs.append(row+rowincr)
+
+    # perform predictions on lists of row increments in parallel
+    prediction = Parallel(n_jobs=n_jobs, max_nbytes=None)(
+        delayed(__predict_parallel2)
+        (estimator, predictors, predict_type, current, row_min, row_max)
+        for row_min, row_max in zip(row_mins, row_maxs))
+    prediction = np.vstack(prediction)
+
+#    # perform predictions on lists of rows in parallel
+#    prediction = Parallel(n_jobs=n_jobs, max_nbytes=None)(
+#        delayed(__predict_parallel)
+#        (estimator, predictors, predict_type, current, row)
+#        for row in range(current.rows))
+#    prediction = np.asarray(prediction)
+
+    # determine raster dtype
+    if prediction.dtype == 'float':
+        ftype = 'FCELL'
+    else:
+        ftype = 'CELL'
+
+    #  writing of predicted results for classification
+    if predict_type == 'raw':
+        numpy2raster(array=prediction, mtype=ftype, rastname=output,
+                     overwrite=True)
+
+    # writing of predicted results for probabilities
+    if predict_type == 'prob':
+
+        # use class labels if supplied
+        # else output predictions as 0,1,2...n
+        if class_labels is None:
+            class_labels = range(prediction.shape[2])
+
+        # output all class probabilities if subset is not specified
+        if index is None:
+            index = class_labels
+
+        # select indexes of predictions 3d numpy array to be exported to rasters
+        selected_prediction_indexes = [i for i, x in enumerate(class_labels) if x in index]
+
+        # write each 3d of numpy array as a probability raster
+        for pred_index, label in zip(selected_prediction_indexes, index):
+            rastername = output + '_' + str(label)
+            numpy2raster(array=prediction[:, :, pred_index], mtype='FCELL',
+                         rastname=rastername, overwrite=overwrite)
+
+
+def __predict_parallel(estimator, predictors, predict_type, current, row):
+    """
+
+    Performs prediction on a single row of a GRASS raster(s))
+
+    Args
+    ----
+    estimator (object): Scikit-learn estimator object
+    predictors (list): Names of GRASS rasters
+    predict_type (string): 'raw' for classification/regression;
+        'prob' for class probabilities
+    current (dict): current region settings
+    row (integer): Row number to perform prediction on
+
+    Returns
+    -------
+    result (2d oe 3d numpy array): Prediction results
+
+    """
+
+    # initialize output
+    result, mask = None, None
+
+    # open grass rasters
+    n_features = len(predictors)
+    rasstack = [0] * n_features
+
+    for i in range(n_features):
+        rasstack[i] = RasterRow(predictors[i])
+        if rasstack[i].exist() is True:
+            rasstack[i].open('r')
+        else:
+            gs.fatal("GRASS raster " + predictors[i] +
+                     " does not exist.... exiting")
+
+    # loop through each row, and each band and add to 2D img_np_row
+    img_np_row = np.zeros((current.cols, n_features))
+    for band in range(n_features):
+        img_np_row[:, band] = np.array(rasstack[band][row])
+
+    # create mask
+    img_np_row[img_np_row == -2147483648] = np.nan
+    mask = np.zeros((img_np_row.shape[0]))
+    for feature in range(n_features):
+        invalid_indexes = np.nonzero(np.isnan(img_np_row[:, feature]))
+        mask[invalid_indexes] = np.nan
+
+    # reshape each row-band matrix into a n*m array
+    nsamples = current.cols
+    flat_pixels = img_np_row.reshape((nsamples, n_features))
+
+    # remove NaNs prior to passing to scikit-learn predict
+    flat_pixels = np.nan_to_num(flat_pixels)
+
+    # perform prediction for classification/regression
+    if predict_type == 'raw':
+        result = estimator.predict(flat_pixels)
+        result = result.reshape((current.cols))
+
+        # determine nodata value and grass raster type
+        if result.dtype == 'float':
+            nodata = np.nan
+        else:
+            nodata = -2147483648
+
+        # replace NaN values so that the prediction does not have a border
+        result[np.nonzero(np.isnan(mask))] = nodata
+
+    # perform prediction for class probabilities
+    if predict_type == 'prob':
+        result = estimator.predict_proba(flat_pixels)
+        result = result.reshape((current.cols, result.shape[1]))
+        result[np.nonzero(np.isnan(mask))] = np.nan
+
+    # close maps
+    for i in range(n_features):
+        rasstack[i].close()
+
+    return result
+
+
+def __predict_parallel2(estimator, predictors, predict_type, current, row_min, row_max):
+    """
+    Performs prediction on range of rows in grass rasters
+
+    Args
+    ----
+    estimator: scikit-learn estimator object
+    predictors: list of GRASS rasters
+    predict_type: character, 'raw' for classification/regression;
+                  'prob' for class probabilities
+    current: current region settings
+    row_min, row_max: Range of rows of grass rasters to perform predictions
+
+    Returns
+    -------
+    result: 2D (classification) or 3D numpy array (class probabilities) of predictions
+    ftypes: data storage type
+    """
+
+    # initialize output
+    result, mask = None, None
+
+    # open grass rasters
+    n_features = len(predictors)
+    rasstack = [0] * n_features
+
+    for i in range(n_features):
+        rasstack[i] = RasterRow(predictors[i])
+        if rasstack[i].exist() is True:
+            rasstack[i].open('r')
+        else:
+            gs.fatal("GRASS raster " + predictors[i] +
+                     " does not exist.... exiting")
+
+    # loop through each row, and each band and add to 2D img_np_row
+    img_np_row = np.zeros((row_max-row_min, current.cols, n_features))
+    for row in range(row_min, row_max):
+        for band in range(n_features):
+            img_np_row[row-row_min, :, band] = np.array(rasstack[band][row])
+
+    # create mask
+    img_np_row[img_np_row == -2147483648] = np.nan
+    mask = np.zeros((img_np_row.shape[0], img_np_row.shape[1]))
+    for feature in range(n_features):
+        invalid_indexes = np.nonzero(np.isnan(img_np_row[:, :, feature]))
+        mask[invalid_indexes] = np.nan
+
+    # reshape each row-band matrix into a n*m array
+    nsamples = (row_max-row_min) * current.cols
+    flat_pixels = img_np_row.reshape((nsamples, n_features))
+
+    # remove NaNs prior to passing to scikit-learn predict
+    flat_pixels = np.nan_to_num(flat_pixels)
+
+    # perform prediction for classification/regression
+    if predict_type == 'raw':
+        result = estimator.predict(flat_pixels)
+        result = result.reshape((row_max-row_min, current.cols))
+
+        # determine nodata value and grass raster type
+        if result.dtype == 'float':
+            nodata = np.nan
+        else:
+            nodata = -2147483648
+
+        # replace NaN values so that the prediction does not have a border
+        result[np.nonzero(np.isnan(mask))] = nodata
+
+    # perform prediction for class probabilities
+    if predict_type == 'prob':
+        result = estimator.predict_proba(flat_pixels)
+        result = result.reshape((row_max-row_min, current.cols, result.shape[1]))
+        result[np.nonzero(np.isnan(mask))] = np.nan
+
+    # close maps
+    for i in range(n_features):
+        rasstack[i].close()
+
+    return result

Added: grass-addons/grass7/raster/r.learn.ml/rlearn_sampling.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/rlearn_sampling.py	                        (rev 0)
+++ grass-addons/grass7/raster/r.learn.ml/rlearn_sampling.py	2017-11-26 04:36:03 UTC (rev 71841)
@@ -0,0 +1,192 @@
+#!/usr/bin/env python
+# -- coding: utf-8 --
+
+"""
+The module rlearn_sampling contains functions to
+extract training data from GRASS rasters.
+"""
+
+from __future__ import absolute_import
+import numpy as np
+import tempfile
+import grass.script as gs
+from grass.pygrass.raster import RasterRow
+from grass.pygrass.gis.region import Region
+from grass.pygrass.vector import VectorTopo
+from grass.pygrass.utils import get_raster_for_points, pixel2coor
+
+
+def extract_pixels(response, predictors, lowmem=False, na_rm=False):
+    """
+
+    Samples a list of GRASS rasters using a labelled raster
+    Per raster sampling
+
+    Args
+    ----
+    response (string): Name of GRASS raster with labelled pixels
+    predictors (list): List of GRASS raster names containing explanatory variables
+    lowmem (boolean): Use numpy memmap to query predictors
+    na_rm (boolean): Remove samples containing NaNs
+
+    Returns
+    -------
+    training_data (2d numpy array): Extracted raster values
+    training_labels (1d numpy array): Numpy array of labels
+    is_train (2d numpy array): Row and Columns of label positions
+
+    """
+
+    current = Region()
+
+    # open response raster as rasterrow and read as np array
+    if RasterRow(response).exist() is True:
+        roi_gr = RasterRow(response)
+        roi_gr.open('r')
+
+        if lowmem is False:
+            response_np = np.array(roi_gr)
+        else:
+            response_np = np.memmap(
+                tempfile.NamedTemporaryFile(),
+                dtype='float32', mode='w+',
+                shape=(current.rows, current.cols))
+            response_np[:] = np.array(roi_gr)[:]
+    else:
+        gs.fatal("GRASS response raster does not exist.... exiting")
+
+    # determine number of predictor rasters
+    n_features = len(predictors)
+
+    # check to see if all predictors exist
+    for i in range(n_features):
+        if RasterRow(predictors[i]).exist() is not True:
+            gs.fatal("GRASS raster " + predictors[i] +
+                          " does not exist.... exiting")
+
+    # check if any of those pixels are labelled (not equal to nodata)
+    # can use even if roi is FCELL because nodata will be nan
+    is_train = np.nonzero(response_np > -2147483648)
+    training_labels = response_np[is_train]
+    n_labels = np.array(is_train).shape[1]
+
+    # Create a zero numpy array of len training labels
+    if lowmem is False:
+        training_data = np.zeros((n_labels, n_features))
+    else:
+        training_data = np.memmap(tempfile.NamedTemporaryFile(),
+                                  dtype='float32', mode='w+',
+                                  shape=(n_labels, n_features))
+
+    # Loop through each raster and sample pixel values at training indexes
+    if lowmem is True:
+        feature_np = np.memmap(tempfile.NamedTemporaryFile(),
+                               dtype='float32', mode='w+',
+                               shape=(current.rows, current.cols))
+
+    for f in range(n_features):
+        predictor_gr = RasterRow(predictors[f])
+        predictor_gr.open('r')
+
+        if lowmem is False:
+            feature_np = np.array(predictor_gr)
+        else:
+            feature_np[:] = np.array(predictor_gr)[:]
+
+        training_data[0:n_labels, f] = feature_np[is_train]
+
+        # close each predictor map
+        predictor_gr.close()
+
+    # convert any CELL maps no datavals to NaN in the training data
+    for i in range(n_features):
+        training_data[training_data[:, i] == -2147483648] = np.nan
+
+    # convert indexes of training pixels from tuple to n*2 np array
+    is_train = np.array(is_train).T
+    for i in range(is_train.shape[0]):
+        is_train[i, :] = np.array(pixel2coor(tuple(is_train[i]), current))
+
+    # close the response map
+    roi_gr.close()
+
+    # remove samples containing NaNs
+    if na_rm is True:
+        if np.isnan(training_data).any() == True:
+            gs.message('Removing samples with NaN values in the raster feature variables...')
+        training_labels = training_labels[~np.isnan(training_data).any(axis=1)]
+        is_train = is_train[~np.isnan(training_data).any(axis=1)]
+        training_data = training_data[~np.isnan(training_data).any(axis=1)]
+
+    return(training_data, training_labels, is_train)
+
+
+def extract_points(gvector, grasters, field, na_rm=False):
+    """
+
+    Extract values from grass rasters using vector points input
+
+    Args
+    ----
+    gvector (string): Name of grass points vector
+    grasters (list): Names of grass raster to query
+    field (string): Name of field in table to use as response variable
+    na_rm (boolean): Remove samples containing NaNs
+
+    Returns
+    -------
+    X (2d numpy array): Training data
+    y (1d numpy array): Array with the response variable
+    coordinates (2d numpy array): Sample coordinates
+
+    """
+
+    # open grass vector
+    points = VectorTopo(gvector.split('@')[0])
+    points.open('r')
+
+    # create link to attribute table
+    points.dblinks.by_name(name=gvector)
+
+    # extract table field to numpy array
+    table = points.table
+    cur = table.execute("SELECT {field} FROM {name}".format(field=field, name=table.name))
+    y = np.array([np.isnan if c is None else c[0] for c in cur])
+    y = np.array(y, dtype='float')
+
+    # extract raster data
+    X = np.zeros((points.num_primitives()['point'], len(grasters)), dtype=float)
+    for i, raster in enumerate(grasters):
+        rio = RasterRow(raster)
+        if rio.exist() is False:
+            gs.fatal('Raster {x} does not exist....'.format(x=raster))
+        values = np.asarray(get_raster_for_points(points, rio))
+        coordinates = values[:, 1:3]
+        X[:, i] = values[:, 3]
+        rio.close()
+
+    # set any grass integer nodata values to NaN
+    X[X == -2147483648] = np.nan
+
+    # remove missing response data
+    X = X[~np.isnan(y)]
+    coordinates = coordinates[~np.isnan(y)]
+    y = y[~np.isnan(y)]
+
+    # int type if classes represented integers
+    if all(y % 1 == 0) is True:
+        y = np.asarray(y, dtype='int')
+
+    # close
+    points.close()
+
+    # remove samples containing NaNs
+    if na_rm is True:
+        if np.isnan(X).any() == True:
+            gs.message('Removing samples with NaN values in the raster feature variables...')
+
+        y = y[~np.isnan(X).any(axis=1)]
+        coordinates = coordinates[~np.isnan(X).any(axis=1)]
+        X = X[~np.isnan(X).any(axis=1)]
+
+    return(X, y, coordinates)

Modified: grass-addons/grass7/raster/r.learn.ml/rlearn_utils.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/rlearn_utils.py	2017-11-26 03:23:50 UTC (rev 71840)
+++ grass-addons/grass7/raster/r.learn.ml/rlearn_utils.py	2017-11-26 04:36:03 UTC (rev 71841)
@@ -5,36 +5,36 @@
 The module rlearn_utils contains functinons to assist
 with passing pre-defined scikit learn classifiers
 and other utilities for loading/saving training data.
-
 """
 
 from __future__ import absolute_import
 from subprocess import PIPE
-
 import numpy as np
 import os
-from copy import deepcopy
-
-import grass.script as gscript
+import grass.script as gs
 from grass.pygrass.modules.shortcuts import imagery as im
 
+
 def model_classifiers(estimator, random_state, n_jobs, p, weights=None):
     """
+
     Provides the classifiers and parameters using by the module
 
     Args
     ----
-    estimator: Name of estimator
-    random_state: Seed to use in randomized components
-    n_jobs: Integer, number of processing cores to use
-    p: Dict, containing classifier setttings
-    weights: None, or 'balanced' to add class_weights
+    estimator (string): Name of scikit learn estimator
+    random_state (float): Seed to use in randomized components
+    n_jobs (integer): Number of processing cores to use;
+        -1 for all cores; -2 for all cores-1
+    p (dict): Classifier setttings (keys) and values
+    weights (string): None, or 'balanced' to add class_weights
 
     Returns
     -------
-    clf: Scikit-learn classifier object
-    mode: Flag to indicate whether classifier performs classification or
-          regression
+    clf (object): Scikit-learn classifier object
+    mode (string): Flag to indicate whether classifier performs classification
+        or regression
+
     """
 
     from sklearn.linear_model import LogisticRegression
@@ -60,37 +60,16 @@
             from sklearn.pipeline import Pipeline
             from pyearth import Earth
 
-            earth_classifier = Pipeline([('Earth',
-                                          Earth(max_degree=p['max_degree'])),
-                                         ('Logistic', LogisticRegression(n_jobs=n_jobs))])
+            earth_classifier = Pipeline(
+                [('classifier', Earth(max_degree=p['max_degree'])),
+                 ('Logistic', LogisticRegression(n_jobs=n_jobs))])
 
-            classifiers = {'EarthClassifier': earth_classifier,
-                           'EarthRegressor': Earth(max_degree=p['max_degree'])}
-        except:
-            gscript.fatal('Py-earth package not installed')
-
-    elif estimator == 'XGBClassifier' or estimator == 'XGBRegressor':
-        try:
-            from xgboost import XGBClassifier, XGBRegressor
-
-            if p['max_depth'] is None:
-                p['max_depth'] = int(3)
-
             classifiers = {
-                'XGBClassifier':
-                    XGBClassifier(learning_rate=p['learning_rate'],
-                                  n_estimators=p['n_estimators'],
-                                  max_depth=p['max_depth'],
-                                  subsample=p['subsample'],
-                                  nthread=n_jobs),
-                'XGBRegressor':
-                    XGBRegressor(learning_rate=p['learning_rate'],
-                                 n_estimators=p['n_estimators'],
-                                 max_depth=p['max_depth'],
-                                 subsample=p['subsample'],
-                                 nthread=n_jobs)}
+                'EarthClassifier': earth_classifier,
+                'EarthRegressor': Earth(max_degree=p['max_degree'])
+                }
         except:
-            gscript.fatal('XGBoost package not installed')
+            gs.fatal('Py-earth package not installed')
     else:
         # core sklearn classifiers go here
         classifiers = {
@@ -101,6 +80,7 @@
             'LogisticRegression':
                 LogisticRegression(C=p['C'],
                                    class_weight=weights,
+                                   solver='liblinear',
                                    random_state=random_state,
                                    n_jobs=n_jobs,
                                    fit_intercept=True),
@@ -189,7 +169,6 @@
         or estimator == 'LinearDiscriminantAnalysis' \
         or estimator == 'QuadraticDiscriminantAnalysis' \
         or estimator == 'EarthClassifier' \
-        or estimator == 'XGBClassifier' \
         or estimator == 'SVC' \
         or estimator == 'KNeighborsClassifier':
         mode = 'classification'
@@ -201,15 +180,17 @@
 
 def save_training_data(X, y, groups, coords, file):
     """
+
     Saves any extracted training data to a csv file
 
     Args
     ----
-    X: Numpy array containing predictor values
-    y: Numpy array containing labels
-    groups: Numpy array of group labels
-    coords: Numpy array containing xy coordinates of samples
-    file: Path to a csv file to save data to
+    X (2d numpy array): Numpy array containing predictor values
+    y (1d numpy array): Numpy array containing labels
+    groups (1d numpy array): Numpy array of group labels
+    coords (2d numpy array): Numpy array containing xy coordinates of samples
+    file (string): Path to a csv file to save data to
+
     """
 
     # if there are no group labels, create a nan filled array
@@ -223,18 +204,20 @@
 
 def load_training_data(file):
     """
+
     Loads training data and labels from a csv file
 
     Args
     ----
-    file: Path to a csv file to save data to
+    file (string): Path to a csv file to save data to
 
     Returns
     -------
-    X: Numpy array containing predictor values
-    y: Numpy array containing labels
-    groups: Numpy array of group labels, or None
-    coords: Numpy array containing x,y coordinates of samples
+    X (2d numpy array): Numpy array containing predictor values
+    y (1d numpy array): Numpy array containing labels
+    groups (1d numpy array): Numpy array of group labels, or None
+    coords (2d numpy array): Numpy array containing x,y coordinates of samples
+
     """
 
     training_data = np.loadtxt(file, delimiter=',')
@@ -258,16 +241,18 @@
 
 def maps_from_group(group):
     """
+
     Parse individual rasters into a list from an imagery group
 
     Args
     ----
-    group: String; GRASS imagery group
+    group (string): Name of GRASS imagery group
 
     Returns
     -------
-    maplist: List containing individual GRASS raster maps
-    map_names: List with print friendly map names
+    maplist (list): List containing individual GRASS raster maps
+    map_names (list): List with print friendly map names
+
     """
     groupmaps = im.group(group=group, flags="g",
                          quiet=True, stdout_=PIPE).outputs.stdout
@@ -280,3 +265,17 @@
         map_names.append(rastername.split('@')[0])
 
     return(maplist, map_names)
+
+
+def save_model(estimator, X, y, sample_coords, groups, filename):
+    from sklearn.externals import joblib
+
+    joblib.dump((estimator, X, y, sample_coords, group_id), filename)
+
+
+def load_model(filename):
+    from sklearn.externals import joblib
+    
+    estimator, X, y, sample_coords, groups = joblib.load(filename)
+
+    return (estimator, X, y, sample_coords, groups)
\ No newline at end of file