[GRASS-SVN] r68162 - grass-addons/grass7/raster/r.randomforest

svn_grass at osgeo.org svn_grass at osgeo.org
Sat Mar 26 23:40:09 PDT 2016


Author: spawley
Date: 2016-03-26 23:40:09 -0700 (Sat, 26 Mar 2016)
New Revision: 68162

Modified:
   grass-addons/grass7/raster/r.randomforest/r.randomforest.html
   grass-addons/grass7/raster/r.randomforest/r.randomforest.py
Log:
Update to r.randomforest adding regression mode, and adding some better error checking in terms of what is supplied to the classifier

Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.html
===================================================================
--- grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-03-26 20:40:38 UTC (rev 68161)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-03-27 06:40:09 UTC (rev 68162)
@@ -2,14 +2,20 @@
 
 <em><b>r.randomforest</b></em> performs random forest classification and regression on a GRASS imagery group using the scikit learn machine learning python library. This python package, along with python pandas needs to be installed within your GRASS python environment for r.randomforest to work. For linux users, both of these packages are  available through the linux package manager in most distributions. For windows users, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">Osgeo4W</a> installation method of GRASS, where the python setuptools can also be installed. Then, you can download the NumPy-1.10+MKL, scikit-learn and pandas .whl files and install them using easy_install, or pip (which you might have to install with easy_install pip)
 
-<br><br> Random forests (RF) (Breiman, 2001)  represents an ensemble classification and regression tree method. RF constructs a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a classification, and the forest chooses the classification result which has the most votes over all of the trees. The probability of membership is based on the proportion of votes for each class. RF parameters consisting of the number of trees (ntree) and the number of variables that are available at each node split (mtry) were chosen by assessing the OOB error using different parameter values.
+<h3>RANDOM FORESTS CLASSIFICATION</h3>
 
-<br><br>RF provides a number of advantages over traditional statistical classifiers because it is non-parametric and can deal with non-linear relationships and non-monotonic responses. Furthermore, continuous and categorical data can be used, and no rescaling is required. Another practical advantage of RF relative to many other machine learning algorithms is that it involves few user-specified parameter choices, principally consisting of the number of trees in the forest (ntrees), and the number of variables that are allowed to be chosen from at each node split (mfeatures), which controls the degree of correlation between the trees. Furthermore, there is no accuracy penalty in having a large number of trees, apart from the cost of increased computational time. However, the performance of RF models typically level off at a certain number of trees, at which point there is no further benefit in terms of error reduction in using a larger forest. If using random forest in regression mode
 , i.e. a continuous type training data are supplied to the classifier, then you can increase the generalization ability of the classifier by increasing minsplit, which represents the minimum number of samples required in order to split a node.
+Random forests (RF) (Breiman, 2001)  represents an ensemble classification tree method. RF constructs a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a classification, and the forest chooses the classification result which has the most votes over all of the trees. The probability of membership is based on the proportion of votes for each class. RF parameters consisting of the number of trees (ntree) and the number of variables that are available at each node split (mtry) were chosen by assessing the OOB error using different parameter values.
 
+<br><br>RF provides a number of advantages over traditional statistical classifiers because it is non-parametric and can deal with non-linear relationships and non-monotonic responses. Furthermore, continuous and categorical data can be used, and no rescaling is required. Another practical advantage of RF relative to many other machine learning algorithms is that it involves few user-specified parameter choices, principally consisting of the number of trees in the forest (ntrees), and the number of variables that are allowed to be chosen from at each node split (mfeatures), which controls the degree of correlation between the trees. Furthermore, there is no accuracy penalty in having a large number of trees, apart from the cost of increased computational time. However, the performance of RF models typically level off at a certain number of trees, at which point there is no further benefit in terms of error reduction in using a larger forest.
+
 <br><br>An additional feature of RF is that it includes built-in accuracy assessment and variable selection. RF uses the concept of bagging, where a randomly selected 66% subset of the training data are held-out 'out-of-bag' (OOB) in the construction of each tree, and then OOB data are used to evaluate the prediction accuracy. RF scikit learn implementation also includes a measure of variable importance based on the Gini impurity criterion, which measures how each variable contributes to the homogeneity of the nodes, with important variables causing a larger decrease in the Gini coefficient in successive node splits. This variable importance allows the contributions of the individual predictors to be determined. The feature importance scores are output to the command display.
 
 <br><br>Random forest classification like most machine learning methods does not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by modelling the majority class with a very high accuracy, but at the expense of the minority class, i.e. high sensitivity but low specificity. If you have a highly imbalanced dataset, the 'balanced' flag can be set. The scikit learn implementation balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies.
 
+<h3>RANDOM FORESTS REGRESSION</h3>
+
+Random forest can also be run in regression model. In this case, a number of classifying decision trees are fitted to sub-samples of the data, and averaging is used to improve the predictive accuracy. Regression mode can be used by setting the <i>mode</i> to the regression option. You also can increase the generalization ability of the classifier by increasing minsplit, which represents the minimum number of samples required in order to split a node. The balanced and class_probabilities options are ignored for regression.
+
 <h2>NOTES</h2>
 
 <em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifier, which is multithreaded by default, and results in a stop-start behaviour. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=100</i> should be reasonable for most systems with 4-8 GB of ram. However, if you have a workstation with much larger resources, then <i>lines</i> could be set to a much larger size, including to a value that is equal or greater than the number of rows in the current region setting, in which case the entire image will be loaded into memory to classification.
@@ -18,7 +24,7 @@
 
 <h2>EXAMPLE</h2>
 
-r.randomforest igroup=landsat output=classification roi=labelled_pixels ntrees=500 mfeatures=-1 minsplit=2 lines=100 randst = 1
+r.randomforest igroup=lsat7_2000 at landsat roi=landcover_1m at PERMANENT output=rf_classification mode=classification ntrees=500 mfeatures=-1 minsplit=2 randst=1 lines=100
 
 <h2>REFERENCES</h2>
 
@@ -28,4 +34,4 @@
 
 Steven Pawley
 
-<p><i>Last changed: $Date: 2016-03-25 22:22:00 -0700 (Sat, 25 Mar 2016) $</i>
+<p><i>Last changed: $Date: 2016-03-25 23:45:00 -0700 (Sun, 26 Mar 2016) $</i>

Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.py
===================================================================
--- grass-addons/grass7/raster/r.randomforest/r.randomforest.py	2016-03-26 20:40:38 UTC (rev 68161)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.py	2016-03-27 06:40:09 UTC (rev 68162)
@@ -40,9 +40,17 @@
 #%option G_OPT_R_OUTPUT
 #% key: output
 #% required: yes
-#% label: Output Classification Map
+#% label: Output Map
 #%end
 
+#%option string
+#% key: mode
+#% required: yes
+#% label: Classification or regression mode
+#% answer: classification
+#% options: classification,regression
+#%end
+
 #%option
 #% key: ntrees
 #% type: integer
@@ -100,16 +108,32 @@
 #% guisection: Optional
 #%end
 
-# user set variables
-import atexit, os, random, string
+# standard modules
+import atexit, os, random, string, imp
 from grass.pygrass.raster import RasterRow
 from grass.pygrass.gis.region import Region
 from grass.pygrass.raster.buffer import Buffer
 import grass.script as grass
-from sklearn.ensemble import RandomForestClassifier
-import pandas as pd
 import numpy as np
 
+# non-standard modules
+def module_exists(module_name):
+    try:
+        imp.find_module(module_name)
+        return(True)
+    except ImportError:
+        print(module_name + " python package not installed....exiting")
+        return(False)
+
+if module_exists("sklearn") == True:
+    from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
+else:
+    exit()
+if module_exists("pandas") == True:
+    import pandas as pd
+else:
+    exit()
+
 def cleanup():
     # We can then close the rasters and the roi image
     for i in range(nbands): rasstack[i].close()
@@ -121,6 +145,7 @@
     igroup = options['igroup']
     roi = options['roi']
     output = options['output']
+    mode = options['mode']
     ntrees = options['ntrees']
     balanced = flags['b']
     class_probabilities = flags['p']
@@ -129,9 +154,27 @@
     minsplit = int(options['minsplit'])
     randst = int(options['randst'])
 
-    if mfeatures == -1: mfeatures = str('auto')
-
-    # Fetch individual raster names from group
+    ##################### error checking for valid input parameters ############################################
+    if mfeatures == -1:
+        mfeatures = str('auto')
+    if mfeatures == 0:
+        print("mfeatures must be greater than zero, or -1 which uses the sqrt(nfeatures) setting.....exiting")
+        exit()
+    if minsplit == 0:
+        print("minsplit must be greater than zero.....exiting")
+        exit()
+    if rowincr <= 0:
+        print("rowincr must be greater than zero....exiting")
+        exit()
+    if ntrees < 1:
+        print("ntrees must be greater than zero.....exiting")
+        exit()
+    if mode == 'regression' and balanced == True:
+        print ("balanced mode is ignored in Random Forests in regression mode....continuing")
+    if mode == 'regression' and class_probabilities == True:
+        print ("option to output class probabiltities is ignored in regression mode....continuing")
+    
+    ######################  Fetch individual raster names from group ###########################################
     groupmaps = grass.read_command("i.group", group = igroup, flags = "g")
 
     if os.name == "nt":
@@ -187,18 +230,25 @@
         print("ROI raster does not exist.... exiting")
         exit()
     
+    # determine cell storage type of training roi raster    
+    roi_type = grass.read_command("r.info", map = roi, flags = 'g')
+    roi_type = str(roi_type)    
+    roi_list = roi_type.split('\r\n')
+    dtype = roi_list[9].split('=')[1]
+    
+    # check if training rois are valid for classification and regression
+    if mode == 'classification' and dtype != 'CELL':
+        print ("Classification mode requires an integer CELL type training roi map.....exiting")
+        exit()
+    
     # Count number of labelled pixels
-    tdir = grass.tempdir()
-    tfile = tdir + '/' + 'rstats.csv'
+    roi_stats = str(grass.read_command("r.univar", flags=("g"), map = roi))
+    roi_stats = roi_stats.split('\r\n')[0]
+    ncells = str(roi_stats).split('=')[1]
+    nlabel_pixels = int(ncells)
     
-    grass.run_command("r.univar", flags=("gt"), map = roi, separator = 'comma', output = tfile)
-    roi_stats = pd.read_csv(tfile)
-    roi_stats
-    nlabel_pixels = roi_stats['non_null_cells'][0]
-    
     # Create a numpy array filled with zeros, with the dimensions of the number of columns in the region
-    # and the number of bands plus an additional band to attach the labels
-    
+    # and the number of bands plus an additional band to attach the labels    
     tindex=0
     training_labels = []
     training_data = np.zeros((nlabel_pixels, nbands+1))
@@ -238,11 +288,15 @@
     training_data = training_data[:, 0:nbands]
     
     ############################### Training the classifier #######################################
-    if balanced == True:
-        rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
-        class_weight = "balanced", max_features = mfeatures, min_samples_split = minsplit, random_state = randst)
+    if mode == 'classification':
+        if balanced == True:
+            rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
+            class_weight = 'balanced', max_features = mfeatures, min_samples_split = minsplit, random_state = randst)
+        else:
+            rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
+            max_features = mfeatures, min_samples_split = minsplit, random_state = randst)            
     else:
-        rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
+        rf = RandomForestRegressor(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
         max_features = mfeatures, min_samples_split = minsplit, random_state = randst)
     rf = rf.fit(training_data, training_labels)
     print('Our OOB prediction of accuracy is: {oob}%'.format(oob=rf.oob_score_ * 100))
@@ -266,10 +320,16 @@
     # anything in memory, we save it to a GRASS raster object, row-by-row.
 
     classification = RasterRow(output)
-    classification.open('w', 'CELL',  overwrite = True)
-
+    if mode == 'classification':
+        ftype = 'CELL'
+        nodata = -2147483648
+    else:
+        ftype = 'FCELL'
+        nodata = np.nan
+    classification.open('w', ftype,  overwrite = True)
+    
     # create and open RasterRow objects for classification and probabilities if enabled    
-    if class_probabilities == True:
+    if class_probabilities == True and mode == 'classification':
         prob_out_raster = [0] * nclasses
         prob = [0] * nclasses
         for iclass in range(nclasses):
@@ -303,16 +363,16 @@
         
         # replace NaN values so that the prediction surface does not have a border
         result_NaN = np.ma.masked_array(result, mask=nanmask, fill_value=np.nan)
-        result_masked = result_NaN.filled([-2147483648]) #Return a copy of result, with masked values filled with a given value
+        result_masked = result_NaN.filled([nodata]) #Return a copy of result, with masked values filled with a given value
         
         # for each row we can perform computation, and write the result into   
         for row in range(rowincr):
-            newrow = Buffer((result_masked.shape[1],), mtype='CELL')
+            newrow = Buffer((result_masked.shape[1],), mtype=ftype)
             newrow[:] = result_masked[row, :]
             classification.put_row(newrow)
         
         # same for probabilities
-        if class_probabilities == True:
+        if class_probabilities == True and mode == 'classification':
             result_proba = rf.predict_proba(flat_pixels_noNaN)
             for iclass in range(nclasses):
                 result_proba_class = result_proba[:, iclass]
@@ -326,7 +386,7 @@
     
     classification.close()
 
-    if class_probabilities == True:
+    if class_probabilities == True and mode == 'classification':
         for iclass in range(nclasses): prob[iclass].close()
     
 if __name__ == "__main__":



More information about the grass-commit mailing list