[GRASS-SVN] r68164 - grass-addons/grass7/raster/r.randomforest

Sun Mar 27 08:17:43 PDT 2016

Author: spawley
Date: 2016-03-27 08:17:42 -0700 (Sun, 27 Mar 2016)
New Revision: 68164

Modified:
   grass-addons/grass7/raster/r.randomforest/r.randomforest.html
Log:


Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.html
===================================================================

--- grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-03-27 13:39:47 UTC (rev 68163)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-03-27 15:17:42 UTC (rev 68164)
@@ -1,21 +1,17 @@
 <h2>DESCRIPTION</h2>
 
-<em><b>r.randomforest</b></em> performs random forest classification and regression on a GRASS imagery group using the scikit learn machine learning python library. This python package, along with python pandas needs to be installed within your GRASS python environment for r.randomforest to work. For linux users, both of these packages are  available through the linux package manager in most distributions. For windows users, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">Osgeo4W</a> installation method of GRASS, where the python setuptools can also be installed. Then, you can download the NumPy-1.10+MKL, scikit-learn and pandas .whl files and install them using easy_install, or pip (which you might have to install with easy_install pip)
+<em><b>r.randomforest</b></em> performs random forest classification and regression on a GRASS imagery group using the scikit learn machine learning python library. This python package, along with python pandas needs to be installed within your GRASS python environment for r.randomforest to work. For linux users, both of these packages are  available through the linux package manager in most distributions. For windows users, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">Osgeo4W</a> installation method of GRASS, where the python setuptools can also be installed. Then, you can download the NumPy-1.10+MKL, scikit-learn and pandas .whl files and install them using easy_install, or pip (which you might have to install with easy_install pip).
 
-<h3>RANDOM FORESTS CLASSIFICATION</h3>
+<br><br>Random forests (RF) (Breiman, 2001)  represents an ensemble classification tree method. RF constructs a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a classification, and the forest chooses the classification result which has the most votes over all of the trees. The probability of membership is based on the proportion of votes for each class. RF parameters consisting of the number of trees (ntree) and the number of variables that are available at each node split (mtry) were chosen by assessing the OOB error using different parameter values.
 
-Random forests (RF) (Breiman, 2001)  represents an ensemble classification tree method. RF constructs a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a classification, and the forest chooses the classification result which has the most votes over all of the trees. The probability of membership is based on the proportion of votes for each class. RF parameters consisting of the number of trees (ntree) and the number of variables that are available at each node split (mtry) were chosen by assessing the OOB error using different parameter values.
-
 <br><br>RF provides a number of advantages over traditional statistical classifiers because it is non-parametric and can deal with non-linear relationships and non-monotonic responses. Furthermore, continuous and categorical data can be used, and no rescaling is required. Another practical advantage of RF relative to many other machine learning algorithms is that it involves few user-specified parameter choices, principally consisting of the number of trees in the forest (ntrees), and the number of variables that are allowed to be chosen from at each node split (mfeatures), which controls the degree of correlation between the trees. Furthermore, there is no accuracy penalty in having a large number of trees, apart from the cost of increased computational time. However, the performance of RF models typically level off at a certain number of trees, at which point there is no further benefit in terms of error reduction in using a larger forest.
 
 <br><br>An additional feature of RF is that it includes built-in accuracy assessment and variable selection. RF uses the concept of bagging, where a randomly selected 66% subset of the training data are held-out 'out-of-bag' (OOB) in the construction of each tree, and then OOB data are used to evaluate the prediction accuracy. RF scikit learn implementation also includes a measure of variable importance based on the Gini impurity criterion, which measures how each variable contributes to the homogeneity of the nodes, with important variables causing a larger decrease in the Gini coefficient in successive node splits. This variable importance allows the contributions of the individual predictors to be determined. The feature importance scores are output to the command display.
 
 <br><br>Random forest classification like most machine learning methods does not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by modelling the majority class with a very high accuracy, but at the expense of the minority class, i.e. high sensitivity but low specificity. If you have a highly imbalanced dataset, the 'balanced' flag can be set. The scikit learn implementation balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies.
 
-<h3>RANDOM FORESTS REGRESSION</h3>
+<br><br>Random forest can also be run in regression mode by setting the <i>mode</i> to the regression option. You also can increase the generalization ability of the classifier by increasing minsplit, which represents the minimum number of samples required in order to split a node. The balanced and class_probabilities options are ignored for regression.
 
-Random forest can also be run in regression model. In this case, a number of classifying decision trees are fitted to sub-samples of the data, and averaging is used to improve the predictive accuracy. Regression mode can be used by setting the <i>mode</i> to the regression option. You also can increase the generalization ability of the classifier by increasing minsplit, which represents the minimum number of samples required in order to split a node. The balanced and class_probabilities options are ignored for regression.
-
 <h2>NOTES</h2>
 
 <em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifier, which is multithreaded by default, and results in a stop-start behaviour. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=100</i> should be reasonable for most systems with 4-8 GB of ram. However, if you have a workstation with much larger resources, then <i>lines</i> could be set to a much larger size, including to a value that is equal or greater than the number of rows in the current region setting, in which case the entire image will be loaded into memory to classification.