[GRASS-SVN] r68185 - grass-addons/grass7/raster/r.randomforest

Wed Mar 30 20:59:16 PDT 2016

Author: spawley
Date: 2016-03-30 20:59:16 -0700 (Wed, 30 Mar 2016)
New Revision: 68185

Modified:
   grass-addons/grass7/raster/r.randomforest/r.randomforest.html
Log:
minor update to r.randomforest documentation

Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.html
===================================================================

--- grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-03-30 18:53:54 UTC (rev 68184)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-03-31 03:59:16 UTC (rev 68185)
@@ -2,7 +2,7 @@
 
 <em><b>r.randomforest</b></em> performs Random forests classification and regression on a GRASS imagery group. Random forests (Breiman, 2001) represents an ensemble classification tree method which constructs a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a classification and the forest chooses the classification result which has the most votes over all of the trees. The probability of membership (<i>class_probabilities</i> flag) is based on the proportion of votes for each class.
 
-<br><br>Random forests offers a number of advantages over traditional statistical classifiers because it is non-parametric and can deal with non-linear relationships. Furthermore, continuous and categorical data can be used, and no rescaling is required. Another practical advantage of random forests is that it involves few user-specified parameter choices, principally consisting of the number of trees in the forest (<i>ntrees</i>), and the number of variables that are allowed to be chosen from at each node split (<i>mfeatures</i>), which controls the degree of correlation between the trees. Furthermore, there is no accuracy penalty in having a large number of trees apart from increased computational time. However, the performance of RF models typically level off at a certain number of trees, at which point there is no further benefit in terms of error reduction in using a larger forest.
+<br><br>Random forests offers a number of advantages over traditional statistical classifiers because it is non-parametric and can deal with non-linear relationships. Furthermore, continuous and categorical data can be used, and no rescaling is required. Another practical advantage of random forests is that it involves few user-specified parameter choices, principally consisting of the number of trees in the forest (<i>ntrees</i>), and the number of variables that are allowed to be chosen from at each node split (<i>mfeatures</i>), which controls the degree of correlation between the trees. Furthermore, there is no accuracy penalty in having a large number of trees apart from increased computational time. However, the performance of RF models typically level off at a certain number of trees, at which point there is no further benefit in terms of error reduction in using a larger forest. For randomforest classification, the default <i>ntrees</i> is 500 and the default setting of <i>m
 features</i> is equal to the square root of the number of predictors.
 
 <br><br>An additional feature of random forests is that it includes built-in accuracy assessment and variable selection. Random forests uses the concept of bagging, where a randomly selected 66% subset of the training data are held-out 'out-of-bag' (OOB) in the construction of each tree, and then OOB data are used to evaluate the prediction accuracy. The OOB error has been shown to provide a reliable estimate of classification error. Note that this is applied on a pixel-basis in the module, and the OOB error assumes that the training data are not spatially correlated. If you training data represent individual pixels that are separated by x distance, then this assumption may hold true. However, if your training data represent rasterized polygons, then this assumption will likely be false, and you should use an independent set of polygons to test the accuracy of the classification using <i>i.kappa</i>.
 
@@ -10,7 +10,7 @@
 
 <br><br>Random forests classification like most machine learning methods does not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by predicting the majority class with a very high accuracy, but at the expense of the minority class, i.e. high sensitivity but low specificity. If you have a highly imbalanced dataset, the 'balanced' flag can be set. The scikit-learn implementation balanced mode then automatically adjust weights inversely proportional to class frequencies.
 
-<br><br>Random forest can also be run in regression mode by setting the <i>mode</i> to the regression option. You also can increase the generalization ability of the classifier by increasing minsplit, which represents the minimum number of samples required in order to split a node. The balanced and class_probabilities options are ignored for regression.
+<br><br>Random forest can also be run in regression mode by setting the <i>mode</i> to the regression option. In this case, averaging is used instead of majority voting to obtain the prediction, and the mean square error (mse) is used to measure the quality of each decision split in the tree, versus the gini impurity for classification. Additionally, the the default <i>mfeatures</i> is equal to the number of predictors, and the coefficient of determination R^2 of the prediction is outputted as the performance measure in regression mode. You also can increase the generalization ability of the classifier by increasing <i>minsplit</i>, which represents the minimum number of samples required in order to split a node. The balanced and class_probabilities options are ignored for regression. 
 
 <br><br>The module also offers the ability to save and load a random forests model. The model is saved as a list of filenames (starting with the extension .pkl which is added automatically) for each numpy array. This list can involve a large number of files, so it makes sense to save each model in a separate directory. To load the model, you need to select the .pkl file that was saved. Saving and loading a model represents a useful feature because it allows a model to be built on one imagery group (ie. set of predictor variables), and then the prediction can be performed on other imagery groups. This approach is commonly employed in species prediction modelling, or landslide susceptibility modelling, where a classification or regression model is built with one set of predictors (e.g. which include present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables. The names of the GRASS rasters in the imagery groups
  do not matter because scikit learn saves the model as a series of numpy arrays. However, the new imagery group must contain the same number of rasters, and they should be in the same order as in the imagery group upon which the model was built. As an example, the new imagery group may have a raster named 'mean_precipitation_2050' which substitutes the 'mean_precipitation_2016' in the imagery group that was used to build the model.