[GRASS-SVN] r68286 - grass-addons/grass7/raster/r.randomforest

Tue Apr 19 09:56:52 PDT 2016

Author: spawley
Date: 2016-04-19 09:56:52 -0700 (Tue, 19 Apr 2016)
New Revision: 68286

Modified:
   grass-addons/grass7/raster/r.randomforest/r.randomforest.html
Log:
r.randomforest minor update

Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.html
===================================================================

--- grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-04-19 16:50:42 UTC (rev 68285)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-04-19 16:56:52 UTC (rev 68286)
@@ -1,6 +1,6 @@
 <h2>DESCRIPTION</h2>
 
-<em><b>r.randomforest</b></em> performs Random forests classification and regression on a GRASS imagery group. Random forests (Breiman, 2001) represents an ensemble classification tree method which constructs a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a classification and the forest chooses the classification result which has the most votes over all of the trees. The probability of membership (<i>class_probabilities</i> flag) is based on the proportion of votes for each class.
+<em><b>r.randomforest</b></em> performs Random forests classification and regression on a GRASS imagery group. Random forests (Breiman, 2001) represents an ensemble classification tree method which constructs a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a prediction probability, and the final classification result is obtained by averaging of the prediction probabilities across all of the trees. The scikit-learn randomforest implementation differs from the original Breiman (2001) reference in that each tree produces a classification, and the forest chooses the classification result which has the most votes over all of the trees (i.e. majority voting versus averaging). The probability of membership (<i>class_probabilities</i> flag) is based on the proportion of votes for each class.
 
 <br><br>Random forests offers a number of advantages over traditional statistical classifiers because it is non-parametric and can deal with non-linear relationships. Furthermore, continuous and categorical data can be used, and no rescaling is required. Another practical advantage of random forests is that it involves few user-specified parameter choices, principally consisting of the number of trees in the forest (<i>ntrees</i>), and the number of variables that are allowed to be chosen from at each node split (<i>mfeatures</i>), which controls the degree of correlation between the trees. Furthermore, there is no accuracy penalty in having a large number of trees apart from increased computational time. However, the performance of RF models typically level off at a certain number of trees, at which point there is no further benefit in terms of error reduction in using a larger forest. For randomforest classification, the default <i>ntrees</i> is 500 and the default setting of <i>m
 features</i> is equal to the square root of the number of predictors.
 
@@ -10,7 +10,7 @@
 
 <br><br>Random forests classification like most machine learning methods does not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by predicting the majority class with a very high accuracy, but at the expense of the minority class, i.e. high sensitivity but low specificity. If you have a highly imbalanced dataset, the 'balanced' flag can be set. The scikit-learn implementation balanced mode then automatically adjust weights inversely proportional to class frequencies.
 
-<br><br>Random forest can also be run in regression mode by setting the <i>mode</i> to the regression option. In this case, averaging is used instead of majority voting to obtain the prediction, and the mean square error (mse) is used to measure the quality of each decision split in the tree, versus the gini impurity for classification. Additionally, the the default <i>mfeatures</i> is equal to the number of predictors, and the coefficient of determination R^2 of the prediction is outputted as the performance measure in regression mode. You also can increase the generalization ability of the classifier by increasing <i>minsplit</i>, which represents the minimum number of samples required in order to split a node. The balanced and class_probabilities options are ignored for regression. 
+<br><br>Random forest can also be run in regression mode by setting the <i>mode</i> to the regression option. In this case, the mean square error (mse) is used to measure the quality of each decision split in the tree, versus the gini impurity for classification. Additionally, the the default <i>mfeatures</i> is equal to the number of predictors, and the coefficient of determination R^2 of the prediction is outputted as the performance measure in regression mode. You also can increase the generalization ability of the classifier by increasing <i>minsplit</i>, which represents the minimum number of samples required in order to split a node. The balanced and class_probabilities options are ignored for regression. 
 
 <br><br>The module also offers the ability to save and load a random forests model. The model is saved as a list of filenames (starting with the extension .pkl which is added automatically) for each numpy array. This list can involve a large number of files, so it makes sense to save each model in a separate directory. To load the model, you need to select the .pkl file that was saved. Saving and loading a model represents a useful feature because it allows a model to be built on one imagery group (ie. set of predictor variables), and then the prediction can be performed on other imagery groups. This approach is commonly employed in species prediction modelling, or landslide susceptibility modelling, where a classification or regression model is built with one set of predictors (e.g. which include present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables. The names of the GRASS rasters in the imagery groups
  do not matter because scikit learn saves the model as a series of numpy arrays. However, the new imagery group must contain the same number of rasters, and they should be in the same order as in the imagery group upon which the model was built. As an example, the new imagery group may have a raster named 'mean_precipitation_2050' which substitutes the 'mean_precipitation_2016' in the imagery group that was used to build the model.