[GRASS-SVN] r70074 - grass-addons/grass7/raster/r.randomforest
svn_grass at osgeo.org
svn_grass at osgeo.org
Tue Dec 13 21:30:03 PST 2016
Author: spawley
Date: 2016-12-13 21:30:02 -0800 (Tue, 13 Dec 2016)
New Revision: 70074
Modified:
grass-addons/grass7/raster/r.randomforest/r.randomforest.html
Log:
Update to r.randomforest manual
Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.html
===================================================================
--- grass-addons/grass7/raster/r.randomforest/r.randomforest.html 2016-12-12 21:04:41 UTC (rev 70073)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.html 2016-12-14 05:30:02 UTC (rev 70074)
@@ -1,62 +1,49 @@
<h2>DESCRIPTION</h2>
-<em><b>r.randomforest</b></em> represents a front-end to the scikit learn python package for the purpose of performing classification and regression on GRASS rasters as part of an imagery group. The module enables classification and regression using random forests and several other classifiers that are commonly used in remote sensing and spatial modelling. The choice of classifier is set using the <i>model</i> parameter. The following classification and regression methods are available. For more details relating to the classifiers, refer to the <a href="http://scikit-learn.org/stable/">scikit learn documentation.</a>
+<p><em><b>r.randomforest</b></em> represents a front-end to the scikit learn python package for the purpose of performing classification and regression on GRASS rasters as part of an imagery group. The module enables classification and regression using random forests and several other classifiers that are commonly used in remote sensing and spatial modelling. The choice of classifier is set using the <i>model</i> parameter. The following classification and regression methods are available. For more details relating to the classifiers, refer to the <a href="http://scikit-learn.org/stable/">scikit learn documentation.</a></p>
-<p>
-<em><b>LogisticRegression</b></em> represents a linear model for classification rather than regression. Logistic regression is a modification of linear regression but using the logistic distribution, which enables the use of a categorical response variable. If the response raster (roi) is coded to 0 and 1, then a binary classification occurs, but the scikit-learn logistic regression can also perform a multiclass classification using a one-versus-rest scheme. <em><b>LinearDiscriminantAnalysis</b></em> and <em><b>QuadraticDiscriminantAnalysis</b></em> are classifiers with linear and quadratic decision surfaces. These classifiers do not take any parameters and are inherently multiclass. They can only be used for classification. Linear discriminant analysis can only separate groups using a linear decision boundary, while quadratic discriminant analysis can learn quadratic boundaries and therefore is more flexible.<em><b>GaussianNB</b></em> is the Gaussian Naive Bayes algorithm and can b
e used for classification only. Naive Bayes is a supervised learning algorithm based on applying Bayes theorem with the naive assumption of independence between every pair of features. This classifier does not take any parameters. The Naive Bayes classifier is very fast and can be applied to high dimensional data because each predictor is assessed independently. However, the assumption of independence between predictors may not be appropriate for many datasets. The <em><b>DecisionTreeClassifier</b></em> and <em><b>DecisionTreeRegressor</b></em> models represent non-parametric supervised learning methods used for classification and regression. Decision tree classifiers map observations to a response variable using a hierarchy of splits and branches. The terminus of these branches, termed leaves, represent the prediction of the response variable. Decision trees are non-parametric and can model non-linear relationships between a response and predictor variables, and are insensitive the
scaling of the predictors. Furthermore, the resulting models represent an intuitive structure where relationships between the response and predictors are easily visualized. The <em><b>RandomForestsClassifier</b></em> and <em><b>RandomForestsRegressor</b></em> (Breiman, 2001) models represent ensemble classification and regression tree methods. A disadvantage of single decision trees is that they tend to overfit the model and therefore are weak predictors. Random forests overcome some of these disadvantages by constructing an ensemble of uncorrelated decision trees. The trees are forced to be uncorrelated because only a random subset of predictor variables (represented by the rasters in the imagery group) are available during each node split in the tree. Each tree produces a prediction probability and the final classification result is obtained by averaging of the prediction probabilities across all of the trees. The <em><b>GradientBoostingClassifier</b></em> and <em><b>GradientBoos
tingRegressor</b></em> also represent ensemble tree-based models. However, in a boosted model the learning processes is additive in a forward step-wise fashion, where <i>n_estimators</i> are fit during each model step and each model step is designed to better fit samples that are not currently well predicted by the previous step. This incrementally improves the performance of the entire model ensemble by fitting to the model residuals. The <em><b>SVC</b></em> model is C-Support Vector Classification. Only a linear kernel is supported because non-linear kernels using scikit learn for typical remote sensing and spatial analysis datasets which consist of large numbers of samples are too slow to be practical.
+<p><em><b>LogisticRegression</b></em> represents a linear model for classification rather than regression. Logistic regression is a modification of linear regression but using the logistic distribution, which enables the use of a categorical response variable. If the response raster (roi) is coded to 0 and 1, then a binary classification occurs, but the scikit-learn logistic regression can also perform a multiclass classification using a one-versus-rest scheme. <em><b>LinearDiscriminantAnalysis</b></em> and <em><b>QuadraticDiscriminantAnalysis</b></em> are classifiers with linear and quadratic decision surfaces. These classifiers do not take any parameters and are inherently multiclass. They can only be used for classification. Linear discriminant analysis can only separate groups using a linear decision boundary, while quadratic discriminant analysis can learn quadratic boundaries and therefore is more flexible.<em><b>GaussianNB</b></em> is the Gaussian Naive Bayes algorithm and ca
n be used for classification only. Naive Bayes is a supervised learning algorithm based on applying Bayes theorem with the naive assumption of independence between every pair of features. This classifier does not take any parameters. The Naive Bayes classifier is very fast and can be applied to high dimensional data because each predictor is assessed independently. However, the assumption of independence between predictors may not be appropriate for many datasets. The <em><b>DecisionTreeClassifier</b></em> and <em><b>DecisionTreeRegressor</b></em> models represent non-parametric supervised learning methods used for classification and regression. Decision tree classifiers map observations to a response variable using a hierarchy of splits and branches. The terminus of these branches, termed leaves, represent the prediction of the response variable. Decision trees are non-parametric and can model non-linear relationships between a response and predictor variables, and are insensitive
the scaling of the predictors. Furthermore, the resulting models represent an intuitive structure where relationships between the response and predictors are easily visualized. The <em><b>RandomForestsClassifier</b></em> and <em><b>RandomForestsRegressor</b></em> models represent ensemble classification and regression tree methods. A disadvantage of single decision trees is that they tend to overfit the model and therefore are weak predictors. Random forests overcome some of these disadvantages by constructing an ensemble of uncorrelated decision trees. The trees are forced to be uncorrelated because only a random subset of predictor variables (represented by the rasters in the imagery group) are available during each node split in the tree. Each tree produces a prediction probability and the final classification result is obtained by averaging of the prediction probabilities across all of the trees. The <em><b>GradientBoostingClassifier</b></em> and <em><b>GradientBoostingRegressor
</b></em> also represent ensemble tree-based models. However, in a boosted model the learning processes is additive in a forward step-wise fashion, where <i>n_estimators</i> are fit during each model step and each model step is designed to better fit samples that are not currently well predicted by the previous step. This incrementally improves the performance of the entire model ensemble by fitting to the model residuals. The <em><b>SVC</b></em> model is C-Support Vector Classification. Only a linear kernel is supported because non-linear kernels using scikit learn for typical remote sensing and spatial analysis datasets which consist of large numbers of samples are too slow to be practical.</p>
-<p>
-The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. <i>C</i> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <i>C</i> applies to the LogisticRegression and SVC models. Most of the other parameters apply to the tree and ensemble-tree based classifiers. <i>n_estimators</i> represents the number of trees in Random Forest model, and the number of trees used in each model step during Gradient Boosting. <i>max_features</i> controls the number of variables that are allowed to be chosen from at each node split in the tree-based models, and can be considered to control the degree of correlation between the trees in ensemble tree methods. <i>min_samples_split</i> and <i>min_samples_leaf</i> control the number of samples required to split a node, or form a leaf node, respectively. The <i>learning_rate</i> and <i>subsample</i> parameters apply only to Gradient Boos
ting. <i>learning_rate</i> shrinks the contribution of each tree, and <i>subsample</i> is the fraction of randomly selected samples for each tree, and values of < 1 reduce the model variance resulting in Stochastic Gradient Boosting.
+<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. <i>C</i> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <i>C</i> applies to the LogisticRegression and SVC models. Most of the other parameters apply to the tree and ensemble-tree based classifiers. <i>n_estimators</i> represents the number of trees in Random Forest model, and the number of trees used in each model step during Gradient Boosting. <i>max_features</i> controls the number of variables that are allowed to be chosen from at each node split in the tree-based models, and can be considered to control the degree of correlation between the trees in ensemble tree methods. <i>min_samples_split</i> and <i>min_samples_leaf</i> control the number of samples required to split a node, or form a leaf node, respectively. The <i>learning_rate</i> and <i>subsample</i> parameters apply only to Gradient B
oosting. <i>learning_rate</i> shrinks the contribution of each tree, and <i>subsample</i> is the fraction of randomly selected samples for each tree, and values of < 1 reduce the model variance resulting in Stochastic Gradient Boosting.</p>
-<p>
-In addition to model fitting and prediction, <em><b>r.randomforest</b></em> can be used for feature selection using the <i>f</i> flag. The tree-based classifiers include an intrisic measure of variable importance based on the relative rank (depth) of a feature used as a decision node in a tree. For other classifiers, univariate feature selection is used to provide feature importance scores.
+<p>In addition to model fitting and prediction, feature selection can be performed using the <i>f</i> flag. The tree-based classifiers include an intrisic measure of variable importance based on the relative rank (depth) of a feature used as a decision node in a tree. For other classifiers, univariate feature selection is employed.</p>
-<p>
-Cross validation can be performed by setting the <i>cv</i> parameters to > 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced. Also note that this cross-validation is performed on a pixel basis. If there is a strong autocorrelation between pixels (i.e. the pixels represent polygons) then the training/test splits will not represent independent samples and will overestimate the accuracy. In this case, the <i>cvtype</i> parameter can be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into groups by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic performance
measures if the data are spatially correlated.
+<p>Cross validation can be performed by setting the <i>cv</i> parameters to > 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced. Also note that this cross-validation is performed on a pixel basis. If there is a strong autocorrelation between pixels (i.e. the pixels represent polygons) then the training/test splits will not represent independent samples and will overestimate the accuracy. In this case, the <i>cvtype</i> parameter can be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into groups by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic performan
ce measures if the data are spatially correlated.</p>
-<p>
-Most machine learning algorithms do not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by predicting the majority class with a very high accuracy, but at the expense of the minority class. If you have a highly imbalanced dataset, the 'balanced' <i>b</i> flag can be set. The scikit-learn implementation balanced mode then automatically adjust weights inversely proportional to class frequencies. This only applies to the LogisticRegression, DecisionTree, RandomForest, and GradientBoostingClassifiers.
+<p>Most machine learning algorithms do not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by predicting the majority class with a very high accuracy, but at the expense of the minority class. If you have a highly imbalanced dataset, the 'balanced' <i>b</i> flag can be set. The scikit-learn implementation balanced mode then automatically adjust weights inversely proportional to class frequencies. This only applies to the LogisticRegression, DecisionTree, RandomForest, and GradientBoostingClassifiers.</p>
-<p>
-Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as <b>LogisticRegression</b> and <b>SVC</b> may not perform optimally if some predictors have variances that are orders of magnitude larger than others, and will therefore dominate the objective function. The <i>s</i> flag can be used to add a standardization preprocessing step to the classification and prediction, which will standardize each predictor relative to its standard deviation.
+<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as <b>LogisticRegression</b> and <b>SVC</b> may not perform optimally if some predictors have variances that are orders of magnitude larger than others, and will therefore dominate the objective function. The <i>s</i> flag can be used to add a standardization preprocessing step to the classification and prediction, which will standardize each predictor relative to its standard deviation.</p>
-<p>
-The module also offers the ability to save and load a classification or regression model. The model is saved as a list of filenames (starting with the extension .pkl which is added automatically) for each numpy array. This list can involve a large number of files, so it makes sense to save each model in a separate directory. To load the model, you need to select the .pkl file that was saved. Saving and loading a model represents a useful feature because it allows a model to be built on one imagery group (ie. set of predictor variables), and then the prediction can be performed on other imagery groups. This approach is commonly employed in species prediction modelling, or landslide susceptibility modelling, where a classification or regression model is built with one set of predictors (e.g. which include present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables.
+<p>The module also offers the ability to save and load a classification or regression model. The model is saved as a .pkl file. To load the model, you need to select the .pkl file that was saved. Saving and loading a model represents a useful feature because it allows a model to be built on one imagery group (ie. set of predictor variables), and then the prediction can be performed on other imagery groups. This approach is commonly employed in species prediction modelling, or landslide susceptibility modelling, where a classification or regression model is built with one set of predictors (e.g. which include present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables.</p>
-<p>
-For convenience when performing repeated classifications using different classifiers or parameters, the training data can be saved to a csv file using the <i>save_training</i> option. This data can then be loaded into subsequent classification runs, saving time by avoiding the need to repeatedly query the predictors.
+<p>For convenience when performing repeated classifications using different classifiers or parameters, the training data can be saved to a csv file using the <i>save_training</i> option. This data can then be loaded into subsequent classification runs, saving time by avoiding the need to repeatedly query the predictors.</p>
<h2>NOTES</h2>
-<em><b>r.randomforest</b></em> uses the "scikit-learn" machine learning python package. This python package needs to be installed within your GRASS GIS Python environment for <em><b>r.randomforest</b></em> to work. It also needs the pandas python package. For Linux users, these packages should be available through the linux package manager in most distributions (named for example "python-scikit-learn"). For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows wit
h a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.
+<p><em><b>r.randomforest</b></em> uses the "scikit-learn" machine learning python package along with pandas. This python package needs to be installed within your GRASS GIS Python environment. For Linux users, these packages should be available through the linux package manager. For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows with a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.</p>
-<p>
-<em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=25</i> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <i>l</i> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on rasters of any size irrespective of the avail
able memory.
+<p><em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=25</i> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <i>l</i> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on rasters of any size irrespective of the av
ailable memory.</p>
-<p>
-Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <i>randst</i> parameter.
+<p>Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <i>randst</i> parameter.</p>
<h2>EXAMPLE</h2>
-Here we are going to use the GRASS GIS sample North Carolina data set as a basis to perform a landsat classification. We are going to classify a Landsat 7 scene from 2000, using training information from an older (1996) land cover dataset.
+<p>Here we are going to use the GRASS GIS sample North Carolina data set as a basis to perform a landsat classification. We are going to classify a Landsat 7 scene from 2000, using training information from an older (1996) land cover dataset.</p>
-<p>
-Landsat 7 (2000) bands 7,4,2 color composite example:
+<p>Landsat 7 (2000) bands 7,4,2 color composite example:</p>
<center>
<img src="lsat7_2000_b742.png" alt="Landsat 7 (2000) bands 7,4,2 color composite example">
</center>
-Note that this example must be run in the "landsat" mapset of the North Carolina sample data set location.
+<p>Note that this example must be run in the "landsat" mapset of the North Carolina sample data set location.</p>
-<p>
-First, we are going to generate some training pixels from an older (1996) land cover classification:
+<p>First, we are going to generate some training pixels from an older (1996) land cover classification:</p>
<div class="code"><pre>
g.region raster=landclass96 -p
r.random input=landclass96 npoints=1000 raster=landclass96_roi
</pre></div>
-<p>
-Then we can use these training pixels to perform a classification on the more recently obtained landsat 7 image:
+<p>Then we can use these training pixels to perform a classification on the more recently obtained landsat 7 image:</p>
<div class="code"><pre>
r.randomforest igroup=lsat7_2000 roi=landclass96_roi output=rf_classification \
model=RandomForestClassifier n_estimators=500 max_features=-1 min_samples_split=2 randst=1 lines=25
@@ -69,22 +56,17 @@
r.category rf_classification
</pre></div>
-<p>
-Random forest classification result:
+<p>Random forest classification result:</p>
<center>
<img src="rfclassification.png" alt="Random forest classification result">
</center>
-<h2>REFERENCES</h2>
-
-Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.
-
<h2>ACKNOWLEDGEMENTS</h2>
-Thanks for Paulo van Breugel for general testing, and particularly the suggestion to enable random forest prediction of a different set of predictor variables.
+<p>Thanks for Paulo van Breugel for general testing, and particularly the suggestion to enable random forest prediction of a different set of predictor variables.</p>
<h2>AUTHOR</h2>
Steven Pawley
-<p><i>Last changed: $Date$</i>
+<p><i>Last changed: $Date$</i></p>
More information about the grass-commit
mailing list