[GRASS-SVN] r71000 - grass-addons/grass7/raster/r.learn.ml

Tue May 2 10:18:58 PDT 2017

Author: spawley
Date: 2017-05-02 10:18:58 -0700 (Tue, 02 May 2017)
New Revision: 71000

Modified:
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
Log:
r.learn.ml update to manual

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
===================================================================

--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-05-02 16:45:12 UTC (rev 70999)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-05-02 17:18:58 UTC (rev 71000)
@@ -3,11 +3,11 @@
 <p><em>r.learn.ml</em> represents a front-end to the scikit learn python package for the purpose of performing classification and regression on GRASS rasters as part of an imagery group. The module enables classification and regression using several commonly used classifiers in remote sensing and spatial modelling. The choice of classifier is set using the <em>classifier</em> parameter. For more details relating to the classifiers, refer to the <a href="http://scikit-learn.org/stable/">scikit learn documentation.</a> The following classification and regression methods are available:</p>
 
 <ul>
-	<li><em>LogisticRegression</em> represents a linear model for classification and is a modification of linear regression, but using the logistic distribution, which enables the use of a categorical response variable. If the <em>response</em> raster is coded to 0 and 1, then a binary classification occurs, but the scikit-learn logistic regression can also perform a multiclass classification using a one-versus-rest scheme.</li>
+	<li><em>LogisticRegression</em> represents a linear model for classification and is a modification of linear regression, but using the logistic distribution which enables the use of a categorical response variable. If the <em>response</em> raster is coded to 0 and 1, then a binary classification occurs, but the scikit-learn logistic regression can also perform a multiclass classification using a one-versus-rest scheme.</li>
 	
 	<li><em>LinearDiscriminantAnalysis</em> and <em>QuadraticDiscriminantAnalysis</em> are classifiers with linear and quadratic decision surfaces. These classifiers do not take any parameters and are inherently multiclass. They can only be used for classification.</li>
 	
-	<li><em>KNeighborsClassifier</em> is a simple classification method based on closest distance to a predefined number of training samples and making the prediction from this. Two hyperparameters are exposed in the gui, with <em>n_neighbors</em> governing the number of neighbors to use to decide the prediction label, and <em>weights</em> specifying whether these neighbors should have equal weights or whether they should be inversely weighted by their distance.</li>
+	<li><em>KNeighborsClassifier</em> is a simple classification method based on closest distance to a predefined number of training samples. Two hyperparameters are exposed, with <em>n_neighbors</em> governing the number of neighbors to use to decide the prediction label, and <em>weights</em> specifying whether these neighbors should have equal weights or whether they should be inversely weighted by their distance.</li>
 	
 	<li><em>GaussianNB</em> is the Gaussian Naive Bayes algorithm and can be used for classification only. Naive Bayes is a supervised learning algorithm based on applying Bayes theorem with the naive assumption of independence between every pair of features. This classifier does not take any parameters.</li>
 	
@@ -17,12 +17,12 @@
 	
 	<li>The <em>GradientBoostingClassifier</em> and <em>GradientBoostingRegressor</em> also represent ensemble tree-based methods. However, in a boosted model the learning processes is additive in a forward step-wise fashion whereby <i>n_estimators</i> are fit during each model step, and each model step is designed to better fit samples that are not currently well predicted by the previous step. This incrementally improves the performance of the entire model ensemble by fitting to the model residuals. Additionally, the <em>XGBClassifier</em> and <em>XGBRegressor</em> models represent an accelerated version of gradient boosting which can optionally be installed from the XGBoost python package.</li>
 	
-	<li>The <em>SVC</em> model is C-Support Vector Classifier. Only a linear kernel is supported because non-linear kernels using scikit learn for typical remote sensing and spatial analysis datasets which consist of large numbers of samples are too slow to be practical. This classifier can still be slow for large datasets.</li>
+	<li>The <em>SVC</em> model is C-Support Vector Classifier. Only a linear kernel is supported because non-linear kernels using scikit learn for typical remote sensing and spatial analysis datasets which consist of large numbers of samples are too slow to be practical. This classifier can still be slow for large datasets that include &gt 10000 training samples.</li>
 	
 	<li>The <em>EarthClassifier</em> and <em>EarthRegressor</em> is a python-based version of Friedman's multivariate adaptive regression splines. This classifier depends on the <a href="https://github.com/scikit-learn-contrib/py-earth">py-earth package</a>, which optionally can be installed in addition to scikit-learn. Earth represents a non-parametric extension to linear models such as logistic regression which improves model fit by partitioning the data into subregions, with each region being fitted by a separate regression term.</li>
 </ul>
 
-<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. The scikit-learn classifier defaults are generally supplied, and some of these parameters can be tuning using a grid-search by inputting multiple parameter settings as a comma-separated list. This tuning can also be accomplished simultaneously with nested cross-validation by also settings the <em>cv</em> option to &gt > 1. The parameters consist of:</p>
+<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. The scikit-learn classifier defaults are generally supplied, and some of these parameters can be tuning using a grid-search by inputting multiple parameter settings as a comma-separated list. This tuning can also be accomplished simultaneously with nested cross-validation by also settings the <em>cv</em> option to &gt 1. The parameters consist of:</p>
 
 <ul>
 	<li><em>C</em> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <em>C</em> applies to the LogisticRegression and SVC models.</li>
@@ -38,7 +38,7 @@
 	<li>The main control on accuracy in the Earth classifier consists <em>max_degree</em> which is the maximum degree of terms generated by the forward pass. Settings of <em>max_degree</em> = 1 or 2 offer good accuracy versus computational performance.</li>
 </ul>
 
-<p>In addition to model fitting and prediction, feature selection can be performed using the <em>-f</em> flag. The feature selection method employed consists of a custom permutation-based method that can be applied to all of the classifiers as part of a cross-validation. The method consists of: (1) determining a performance metric on a test partition of the data; (2) permuting each variable and assessing the difference in performance between the original and permutation; (3) repeating step 2 for <em>n_permutations</em>; (4) averaging the results. Steps 1-4 are repeated on each k partition. The feature importance represent the average decrease in performance of each variable when permuted. For binary classifications, the AUC is used as the metric. Multiclass classifications use accuracy, and regressions use R2.</p>
+<p>In addition to model fitting and prediction, feature selection can be performed using the <em>-f</em> flag. The feature selection method employed is based on Brenning et al. (2012) and consists of a custom permutation-based method that can be applied to all of the classifiers as part of a cross-validation. The method consists of: (1) determining a performance metric on a test partition of the data; (2) permuting each variable and assessing the difference in performance between the original and permutation; (3) repeating step 2 for <em>n_permutations</em>; (4) averaging the results. Steps 1-4 are repeated on each k partition. The feature importance represent the average decrease in performance of each variable when permuted. For binary classifications, the AUC is used as the metric. Multiclass classifications use accuracy, and regressions use R2.</p>
 
 <p>Cross validation can be performed by setting the <em>cv</em> parameters to &gt 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced depending on whether the response variable is binary or multiclass, or the classifier is for regression or classification. The <em>cvtype</em> parameter can also be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into <em>n_partitions</em> by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic performance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the gr
 oup_ids of the partitions can be supplied using the <em>group_raster</em> option.</p>
 
@@ -54,7 +54,7 @@
 
 <p><em>r.learn.ml</em> uses the "scikit-learn" machine learning python package along with the "pandas" package. These packages need to be installed within your GRASS GIS Python environment. For Linux users, these packages should be available through the linux package manager. For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows with a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.</p>
 
-<p><em>r.learn.ml</em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <em>lines</em> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <em>Lines=25</em> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set when providing a raster as the trainingmap. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <em>-l</em> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on ras
 ters of any size irrespective of the available memory.</p>
+<p><em>r.learn.ml</em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <em>lines</em> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <em>lines=25</em> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set when providing a raster as the trainingmap. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <em>-l</em> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on ras
 ters of any size irrespective of the available memory.</p>
 
 <p>Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <em>randst</em> parameter.</p>
 
@@ -97,6 +97,12 @@
 
 <p>Thanks for Paulo van Breugel and Vaclav Petras for general testing, and Paulo for the suggestion to enable saving of the fitted models.</p>
 
+<h2>REFERENCES</h2>
+
+<p>Brenning, A. 2012. Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: the R package 'sperrorest'. 2012 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 23-27 July 2012, p. 5372-5375.</p>
+
+<p>Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.</p>
+
 <h2>AUTHOR</h2>
 
 Steven Pawley