[GRASS-SVN] r70217 - grass-addons/grass7/raster/r.randomforest

Mon Jan 2 15:05:45 PST 2017

Author: spawley
Date: 2017-01-02 15:05:45 -0800 (Mon, 02 Jan 2017)
New Revision: 70217

Modified:
   grass-addons/grass7/raster/r.randomforest/r.randomforest.html
Log:
'fix manual page for r.randomforest'

Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.html
===================================================================

--- grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2017-01-02 22:42:57 UTC (rev 70216)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2017-01-02 23:05:45 UTC (rev 70217)
@@ -1,18 +1,32 @@
 <h2>DESCRIPTION</h2>
 
-<p><em><b>r.randomforest</b></em> represents a front-end to the scikit learn python package for the purpose of performing classification and regression on GRASS rasters as part of an imagery group. The module enables classification and regression using random forests and several other classifiers that are commonly used in remote sensing and spatial modelling. The choice of classifier is set using the <i>model</i> parameter. The following classification and regression methods are available. For more details relating to the classifiers, refer to the <a href="http://scikit-learn.org/stable/">scikit learn documentation.</a></p>
+<p><em>r.randomforest</em> represents a front-end to the scikit learn python package for the purpose of performing classification and regression on GRASS rasters as part of an imagery group. The module enables classification and regression using random forests and several other classifiers that are commonly used in remote sensing and spatial modelling. The choice of classifier is set using the <i>model</i> parameter. For more details relating to the classifiers, refer to the <a href="http://scikit-learn.org/stable/">scikit learn documentation.</a> The following classification and regression methods are available:</p>
 
-<p><em><b>LogisticRegression</b></em> represents a linear model for classification rather than regression. Logistic regression is a modification of linear regression but using the logistic distribution, which enables the use of a categorical response variable. If the response raster (roi) is coded to 0 and 1, then a binary classification occurs, but the scikit-learn logistic regression can also perform a multiclass classification using a one-versus-rest scheme. <em><b>LinearDiscriminantAnalysis</b></em> and <em><b>QuadraticDiscriminantAnalysis</b></em> are classifiers with linear and quadratic decision surfaces. These classifiers do not take any parameters and are inherently multiclass. They can only be used for classification. Linear discriminant analysis can only separate groups using a linear decision boundary, while quadratic discriminant analysis can learn quadratic boundaries and therefore is more flexible.<em><b>GaussianNB</b></em> is the Gaussian Naive Bayes algorithm and ca
 n be used for classification only. Naive Bayes is a supervised learning algorithm based on applying Bayes theorem with the naive assumption of independence between every pair of features. This classifier does not take any parameters. The Naive Bayes classifier is very fast and can be applied to high dimensional data because each predictor is assessed independently. However, the assumption of independence between predictors may not be appropriate for many datasets. The <em><b>DecisionTreeClassifier</b></em> and <em><b>DecisionTreeRegressor</b></em> models represent non-parametric supervised learning methods used for classification and regression. Decision tree classifiers map observations to a response variable using a hierarchy of splits and branches. The terminus of these branches, termed leaves, represent the prediction of the response variable. Decision trees are non-parametric and can model non-linear relationships between a response and predictor variables, and are insensitive 
 the scaling of the predictors. Furthermore, the resulting models represent an intuitive structure where relationships between the response and predictors are easily visualized. The <em><b>RandomForestsClassifier</b></em> and <em><b>RandomForestsRegressor</b></em> models represent ensemble classification and regression tree methods. A disadvantage of single decision trees is that they tend to overfit the model and therefore are weak predictors. Random forests overcome some of these disadvantages by constructing an ensemble of uncorrelated decision trees. The trees are forced to be uncorrelated because only a random subset of predictor variables (represented by the rasters in the imagery group) are available during each node split in the tree. Each tree produces a prediction probability and the final classification result is obtained by averaging of the prediction probabilities across all of the trees. The <em><b>GradientBoostingClassifier</b></em> and <em><b>GradientBoostingRegressor
 </b></em> also represent ensemble tree-based models. However, in a boosted model the learning processes is additive in a forward step-wise fashion, where <i>n_estimators</i> are fit during each model step and each model step is designed to better fit samples that are not currently well predicted by the previous step. This incrementally improves the performance of the entire model ensemble by fitting to the model residuals. The <em><b>SVC</b></em> model is C-Support Vector Classification. Only a linear kernel is supported because non-linear kernels using scikit learn for typical remote sensing and spatial analysis datasets which consist of large numbers of samples are too slow to be practical.</p>
+<ol>
+	<li><em>LogisticRegression</em> represents a linear model for classification rather than regression. Logistic regression is a modification of linear regression but using the logistic distribution, which enables the use of a categorical response variable. If the response raster (roi) is coded to 0 and 1, then a binary classification occurs, but the scikit-learn logistic regression can also perform a multiclass classification using a one-versus-rest scheme.</li>
+	
+	<li><em>LinearDiscriminantAnalysis</em> and <em>QuadraticDiscriminantAnalysis</em> are classifiers with linear and quadratic decision surfaces. These classifiers do not take any parameters and are inherently multiclass. They can only be used for classification. Linear discriminant analysis can only separate groups using a linear decision boundary, while quadratic discriminant analysis can learn quadratic boundaries and therefore is more flexible.</li>
+	
+	<li><em>GaussianNB</em> is the Gaussian Naive Bayes algorithm and can be used for classification only. Naive Bayes is a supervised learning algorithm based on applying Bayes theorem with the naive assumption of independence between every pair of features. This classifier does not take any parameters. The Naive Bayes classifier is very fast and can be applied to high dimensional data because each predictor is assessed independently. However, the assumption of independence between predictors may not be appropriate for many datasets.</li>
+	
+	<li>The <em>DecisionTreeClassifier</em> and <em>DecisionTreeRegressor</em> models represent non-parametric supervised learning methods used for classification and regression. Decision tree classifiers map observations to a response variable using a hierarchy of splits and branches. The terminus of these branches, termed leaves, represent the prediction of the response variable. Decision trees are non-parametric and can model non-linear relationships between a response and predictor variables, and are insensitive the scaling of the predictors. Furthermore, the resulting models represent an intuitive structure where relationships between the response and predictors are easily visualized.</li>
+	
+	<li>The <em>RandomForestsClassifier</em> and <em>RandomForestsRegressor</em> models represent ensemble classification and regression tree methods. A disadvantage of single decision trees is that they tend to overfit the model and therefore are weak predictors. Random forests overcome some of these disadvantages by constructing an ensemble of uncorrelated decision trees. The trees are forced to be uncorrelated because only a random subset of predictor variables (represented by the rasters in the imagery group) are available during each node split in the tree. Each tree produces a prediction probability and the final classification result is obtained by averaging of the prediction probabilities across all of the trees.</li>
+	
+	<li>The <em>GradientBoostingClassifier</em> and <em>GradientBoostingRegressor</em> also represent ensemble tree-based models. However, in a boosted model the learning processes is additive in a forward step-wise fashion, where <i>n_estimators</i> are fit during each model step and each model step is designed to better fit samples that are not currently well predicted by the previous step. This incrementally improves the performance of the entire model ensemble by fitting to the model residuals.</li>
+	
+	<li>The <em>SVC</em> model is C-Support Vector Classification. Only a linear kernel is supported because non-linear kernels using scikit learn for typical remote sensing and spatial analysis datasets which consist of large numbers of samples are too slow to be practical.</li>
+</ol>
 
-<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. <i>C</i> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <i>C</i> applies to the LogisticRegression and SVC models. Most of the other parameters apply to the tree and ensemble-tree based classifiers. <i>n_estimators</i> represents the number of trees in Random Forest model, and the number of trees used in each model step during Gradient Boosting. <i>max_features</i> controls the number of variables that are allowed to be chosen from at each node split in the tree-based models, and can be considered to control the degree of correlation between the trees in ensemble tree methods. <i>min_samples_split</i> and <i>min_samples_leaf</i> control the number of samples required to split a node, or form a leaf node, respectively. The <i>learning_rate</i> and <i>subsample</i> parameters apply only to Gradient B
 oosting. <i>learning_rate</i> shrinks the contribution of each tree, and <i>subsample</i> is the fraction of randomly selected samples for each tree, and values of &lt 1 reduce the model variance resulting in Stochastic Gradient Boosting.</p>
+<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. <i>C</i> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <i>C</i> applies to the LogisticRegression and SVC models. Most of the other parameters apply to the tree and ensemble-tree based classifiers. <i>n_estimators</i> represents the number of trees in Random Forest model, and the number of trees used in each model step during Gradient Boosting. <i>max_features</i> controls the number of variables that are allowed to be chosen from at each node split in the tree-based models, and can be considered to control the degree of correlation between the trees in ensemble tree methods. <i>min_samples_split</i> and <i>min_samples_leaf</i> control the number of samples required to split a node, or form a leaf node, respectively. The <i>learning_rate</i> and <i>subsample</i> parameters apply only to Gradient B
 oosting. <i>learning_rate</i> shrinks the contribution of each tree, and <i>subsample</i> is the fraction of randomly selected samples for each tree.</p>
 
 <p>In addition to model fitting and prediction, feature selection can be performed using the <i>f</i> flag. The tree-based classifiers include an intrisic measure of variable importance based on the relative rank (depth) of a feature used as a decision node in a tree. For other classifiers, univariate feature selection is employed.</p>
 
-<p>Cross validation can be performed by setting the <i>cv</i> parameters to > 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced. Also note that this cross-validation is performed on a pixel basis. If there is a strong autocorrelation between pixels (i.e. the pixels represent polygons) then the training/test splits will not represent independent samples and will overestimate the accuracy. In this case, the <i>cvtype</i> parameter can be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into groups by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic performan
 ce measures if the data are spatially correlated.</p>
+<p>Cross validation can be performed by setting the <i>cv</i> parameters to &gt 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced. Also note that this cross-validation is performed on a pixel basis. If there is a strong autocorrelation between pixels (i.e. the pixels represent polygons) then the training/test splits will not represent independent samples and will overestimate the accuracy. In this case, the <i>cvtype</i> parameter can be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into groups by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic perform
 ance measures if the data are spatially correlated.</p>
 
 <p>Most machine learning algorithms do not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by predicting the majority class with a very high accuracy, but at the expense of the minority class. If you have a highly imbalanced dataset, the 'balanced'  <i>b</i> flag can be set. The scikit-learn implementation balanced mode then automatically adjust weights inversely proportional to class frequencies. This only applies to the LogisticRegression, DecisionTree, RandomForest, and GradientBoostingClassifiers.</p>
 
-<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as <b>LogisticRegression</b> and <b>SVC</b> may not perform optimally if some predictors have variances that are orders of magnitude larger than others, and will therefore dominate the objective function. The <i>s</i> flag can be used to add a standardization preprocessing step to the classification and prediction, which will standardize each predictor relative to its standard deviation.</p>
+<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as LogisticRegression and SVC may not perform optimally if some predictors have variances that are orders of magnitude larger than others, and will therefore dominate the objective function. The <i>s</i> flag can be used to add a standardization preprocessing step to the classification and prediction, which will standardize each predictor relative to its standard deviation.</p>
 
 <p>The module also offers the ability to save and load a classification or regression model. The model is saved as a .pkl file. To load the model, you need to select the .pkl file that was saved. Saving and loading a model represents a useful feature because it allows a model to be built on one imagery group (ie. set of predictor variables), and then the prediction can be performed on other imagery groups. This approach is commonly employed in species prediction modelling, or landslide susceptibility modelling, where a classification or regression model is built with one set of predictors (e.g. which include present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables.</p>
 
@@ -20,9 +34,9 @@
 
 <h2>NOTES</h2>
 
-<p><em><b>r.randomforest</b></em> uses the "scikit-learn" machine learning python package along with pandas. This python package needs to be installed within your GRASS GIS Python environment. For Linux users, these packages should be available through the linux package manager. For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows with a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.</p>
+<p><em>r.randomforest</em> uses the "scikit-learn" machine learning python package along with pandas. This python package needs to be installed within your GRASS GIS Python environment. For Linux users, these packages should be available through the linux package manager. For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows with a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.</p>
 
-<p><em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=25</i> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <i>l</i> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on rasters of any size irrespective of the av
 ailable memory.</p>
+<p><em>r.randomforest</em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=25</i> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <i>l</i> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on rasters of any size irrespective of the available
  memory.</p>
 
 <p>Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <i>randst</i> parameter.</p>
 
@@ -69,4 +83,4 @@
 
 Steven Pawley
 
-<p><i>Last changed: $Date$</i></p>
+<p><i>Last changed: $Date$</i></p>
\ No newline at end of file