[GRASS-SVN] r68167 - grass-addons/grass7/raster/r.randomforest

svn_grass at osgeo.org svn_grass at osgeo.org
Mon Mar 28 09:42:54 PDT 2016


Author: spawley
Date: 2016-03-28 09:42:54 -0700 (Mon, 28 Mar 2016)
New Revision: 68167

Modified:
   grass-addons/grass7/raster/r.randomforest/r.randomforest.html
   grass-addons/grass7/raster/r.randomforest/r.randomforest.py
Log:
Added eatures for randomforest

Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.html
===================================================================
--- grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-03-28 14:40:29 UTC (rev 68166)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-03-28 16:42:54 UTC (rev 68167)
@@ -1,23 +1,29 @@
 <h2>DESCRIPTION</h2>
 
-<em><b>r.randomforest</b></em> performs random forest classification and regression on a GRASS imagery group using the scikit learn machine learning python library. This python package, along with python pandas needs to be installed within your GRASS python environment for r.randomforest to work. For linux users, both of these packages are  available through the linux package manager in most distributions. For windows users, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">Osgeo4W</a> installation method of GRASS, where the python setuptools can also be installed. Then, you can download the NumPy-1.10+MKL, scikit-learn and pandas .whl files and install them using easy_install, or pip (which you might have to install with easy_install pip).
+<em><b>r.randomforest</b></em> performs Random forests classification and regression on a GRASS imagery group. Random forests (Breiman, 2001) represents an ensemble classification tree method which constructs a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a classification and the forest chooses the classification result which has the most votes over all of the trees. The probability of membership (<i>class_probabilities</i> flag) is based on the proportion of votes for each class.
 
-<br><br>Random forests (RF) (Breiman, 2001)  represents an ensemble classification tree method. RF constructs a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a classification, and the forest chooses the classification result which has the most votes over all of the trees. The probability of membership is based on the proportion of votes for each class. RF parameters consisting of the number of trees (ntree) and the number of variables that are available at each node split (mtry) were chosen by assessing the OOB error using different parameter values.
+<br><br>Random forests offers a number of advantages over traditional statistical classifiers because it is non-parametric and can deal with non-linear relationships. Furthermore, continuous and categorical data can be used, and no rescaling is required. Another practical advantage of random forests is that it involves few user-specified parameter choices, principally consisting of the number of trees in the forest (<i>ntrees</i>), and the number of variables that are allowed to be chosen from at each node split (<i>mfeatures</i>), which controls the degree of correlation between the trees. Furthermore, there is no accuracy penalty in having a large number of trees apart from increased computational time. However, the performance of RF models typically level off at a certain number of trees, at which point there is no further benefit in terms of error reduction in using a larger forest.
 
-<br><br>RF provides a number of advantages over traditional statistical classifiers because it is non-parametric and can deal with non-linear relationships and non-monotonic responses. Furthermore, continuous and categorical data can be used, and no rescaling is required. Another practical advantage of RF relative to many other machine learning algorithms is that it involves few user-specified parameter choices, principally consisting of the number of trees in the forest (ntrees), and the number of variables that are allowed to be chosen from at each node split (mfeatures), which controls the degree of correlation between the trees. Furthermore, there is no accuracy penalty in having a large number of trees, apart from the cost of increased computational time. However, the performance of RF models typically level off at a certain number of trees, at which point there is no further benefit in terms of error reduction in using a larger forest.
+<br><br>An additional feature of random forests is that it includes built-in accuracy assessment and variable selection. Random forests uses the concept of bagging, where a randomly selected 66% subset of the training data are held-out 'out-of-bag' (OOB) in the construction of each tree, and then OOB data are used to evaluate the prediction accuracy. The OOB error has been shown to provide a reliable estimate of classification error. Note that this is applied on a pixel-basis in the module, and the OOB error assumes that the training data are not spatially correlated. If you training data represent individual pixels that are separated by x distance, then this assumption may hold true. However, if your training data represent rasterized polygons, then this assumption will likely be false, and you should use an independent set of polygons to test the accuracy of the classification using <i>i.kappa</i>.
 
-<br><br>An additional feature of RF is that it includes built-in accuracy assessment and variable selection. RF uses the concept of bagging, where a randomly selected 66% subset of the training data are held-out 'out-of-bag' (OOB) in the construction of each tree, and then OOB data are used to evaluate the prediction accuracy. RF scikit learn implementation also includes a measure of variable importance based on the Gini impurity criterion, which measures how each variable contributes to the homogeneity of the nodes, with important variables causing a larger decrease in the Gini coefficient in successive node splits. This variable importance allows the contributions of the individual predictors to be determined. The feature importance scores are output to the command display.
+The random forests scikit-learn implementation also includes a measure of variable importance based on the Gini impurity criterion, which measures how each variable contributes to the homogeneity of the nodes, with important variables causing a larger decrease in the Gini coefficient in successive node splits. This variable importance allows the contributions of the individual predictors to be determined. The feature importance scores are output to the command display.
 
-<br><br>Random forest classification like most machine learning methods does not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by modelling the majority class with a very high accuracy, but at the expense of the minority class, i.e. high sensitivity but low specificity. If you have a highly imbalanced dataset, the 'balanced' flag can be set. The scikit learn implementation balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies.
+<br><br>Random forests classification like most machine learning methods does not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by predicting the majority class with a very high accuracy, but at the expense of the minority class, i.e. high sensitivity but low specificity. If you have a highly imbalanced dataset, the 'balanced' flag can be set. The scikit-learn implementation balanced mode then automatically adjust weights inversely proportional to class frequencies.
 
 <br><br>Random forest can also be run in regression mode by setting the <i>mode</i> to the regression option. You also can increase the generalization ability of the classifier by increasing minsplit, which represents the minimum number of samples required in order to split a node. The balanced and class_probabilities options are ignored for regression.
 
+<br><br>The module also offers the potential to save and load a random forests model. The model is saved as a list of filenames for each numpy array. This list can involve a large number of files, so it makes sense to save each model in a separate directory.
+
+<br><br> Saving a model represents a useful feature because it allows a model to be built on one imagery group (ie. set of predictor variables), and then the prediction can be performed on other imagery groups. This approach is commonly employed in species prediction modelling, or landslide susceptibility modelling, where a classification or regression model is built with one set of predictors (e.g. which include present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables. The names of the GRASS rasters in the imagery groups do not matter because scikit learn saves the model as a series of numpy arrays. However, the new imagery group must contain the same number of rasters, and they should be in the same order as in the imagery group upon which the model was built. As an example, the new imagery group may have a raster named 'mean_precipitation_2050' which substitutes the 'mean_precipitation_2016' in the imag
 ery group that was used to build the model. 
+
 <h2>NOTES</h2>
 
-<em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifier, which is multithreaded by default, and results in a stop-start behaviour. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=100</i> should be reasonable for most systems with 4-8 GB of ram. However, if you have a workstation with much larger resources, then <i>lines</i> could be set to a much larger size, including to a value that is equal or greater than the number of rows in the current region setting, in which case the entire image will be loaded into memory to classification.
+<em><b>r.randomforest</b></em> uses the scikit-learn machine learning python package, and the pandas package. These python packages need to be installed within your GRASS python environment for <em><b>r.randomforest</b></em> to work. For linux users, both of these packages should be available through the linux package manager in most distributions. For windows users, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">Osgeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL, scikit-learn and pandas .whl files and install them using 'pip install packagename.whl'.
 
-<br><br> The bootstrapping process involved within random forest causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed uding the <i>randst</i> parameter.
+<em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifier, which is multithreaded by default, and results in a stop-start behaviour. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=100</i> should be reasonable for most systems with 4-8 GB of ram. However, if you have a workstation with much larger resources, then <i>lines</i> could be set to a much larger size (including to a value that is equal or greater than the number of rows in the current region setting) in which case the entire image will be loaded into memory to classification.
 
+<br><br> The bootstrapping process involved within random forests also causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <i>randst</i> parameter.
+
 <h2>EXAMPLE</h2>
 
 r.randomforest igroup=lsat7_2000 at landsat roi=landcover_1m at PERMANENT output=rf_classification mode=classification ntrees=500 mfeatures=-1 minsplit=2 randst=1 lines=100
@@ -26,8 +32,12 @@
 
 Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.
 
+<h2>ACKNOWLEDGEMENTS</h2>
+
+Thanks for Paulo van Breugel for general testing, and particularly the suggestion to enable random forest prediction of a different set of predictor variables.
+
 <h2>AUTHOR</h2>
 
 Steven Pawley
 
-<p><i>Last changed: $Date: 2016-03-25 23:45:00 -0700 (Sun, 26 Mar 2016) $</i>
+<p><i>Last changed: $Date: 2016-03-28 10:41:00 -0700 (Sun, 26 Mar 2016) $</i>

Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.py
===================================================================
--- grass-addons/grass7/raster/r.randomforest/r.randomforest.py	2016-03-28 14:40:29 UTC (rev 68166)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.py	2016-03-28 16:42:54 UTC (rev 68167)
@@ -25,9 +25,9 @@
 #%option G_OPT_I_GROUP
 #% key: igroup
 #% label: Imagery group to be classified (predictors)
-#% description: Series of raster maps (e.g. imagery bands) to be used in the random forest classification
+#% description: Series of raster maps to be used in the random forest classification
 #% required: yes
-#% multiple: yes
+#% multiple: no
 #%end
 
 #%option G_OPT_R_INPUT
@@ -57,34 +57,34 @@
 #% description: Number of trees in the forest
 #% answer: 500
 #% required: yes
-#% guisection: Optional
+#% guisection: Random Forest Options
 #%end
 
 #%option
 #% key: mfeatures
 #% type: integer
-#% description: The number of features to consider when looking for the best split. Sqrt(n_features) is used by default
+#% description: The number of features allowed at each split. Sqrt(n_features) is used by default
 #% answer: -1
 #% required: yes
-#% guisection: Optional
+#% guisection: Random Forest Options
 #%end
 
 #%option
-#% key: minsplit 
+#% key: minsplit
 #% type: integer
 #% description: The minimum number of samples required to split an internal node
 #% answer: 2
 #% required: yes
-#% guisection: Optional
+#% guisection: Random Forest Options
 #%end
 
 #%option
-#% key: randst 
+#% key: randst
 #% type: integer
 #% description: Seed to pass onto the random state for reproducible results
 #% answer: 1
 #% required: yes
-#% guisection: Optional
+#% guisection: Random Forest Options
 #%end
 
 #%option
@@ -99,15 +99,29 @@
 #%flag
 #% key: p
 #% label: Output class membership probabilities
-#% guisection: Optional
+#% guisection: Random Forest Options
 #%end
 
 #%flag
 #% key: b
 #% description: Balance classes by weighting
+#% guisection: Random Forest Options
+#%end
+
+#%option G_OPT_F_OUTPUT
+#% key: savefile
+#% label: Save model from file
+#% required: no
 #% guisection: Optional
 #%end
 
+#%option G_OPT_F_INPUT
+#% key: loadfile
+#% label: Load model from file
+#% required: no
+#% guisection: Optional
+#%end
+
 # standard modules
 import atexit, os, random, string, imp
 from grass.pygrass.raster import RasterRow
@@ -120,13 +134,14 @@
 def module_exists(module_name):
     try:
         imp.find_module(module_name)
-        return(True)
+        return True
     except ImportError:
         print(module_name + " python package not installed....exiting")
-        return(False)
+        return False
 
 if module_exists("sklearn") == True:
     from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
+    from sklearn.externals import joblib
 else:
     exit()
 if module_exists("pandas") == True:
@@ -160,12 +175,14 @@
     mfeatures = int(options['mfeatures'])
     minsplit = int(options['minsplit'])
     randst = int(options['randst'])
+    model_save = options['savefile']
+    model_load = options['loadfile']
 
-    ##################### error checking for valid input parameters ############################################
+    ##################### error checking for valid input parameters ########################################
     if mfeatures == -1:
         mfeatures = str('auto')
     if mfeatures == 0:
-        print("mfeatures must be greater than zero, or -1 which uses the sqrt(nfeatures) setting.....exiting")
+        print("mfeatures must be greater than zero, or -1 which uses the sqrt(nfeatures).....exiting")
         exit()
     if minsplit == 0:
         print("minsplit must be greater than zero.....exiting")
@@ -180,24 +197,27 @@
         print ("balanced mode is ignored in Random Forests in regression mode....continuing")
     if mode == 'regression' and class_probabilities == True:
         print ("option to output class probabiltities is ignored in regression mode....continuing")
-    
-    ######################  Fetch individual raster names from group ###########################################
+    if model_save != '' and model_load != '':
+        print("Cannot save and load a model at the same time")
+        exit()
+
+    ######################  Fetch individual raster names from group ###################################
     groupmaps = grass.read_command("i.group", group = igroup, flags = "g")
     groupmaps = normalize_newlines(groupmaps)
     maplist = groupmaps.split('\n')
     maplist = maplist[0:len(maplist)-1]
-    
-    ######################### Obtain information about GRASS rasters to be classified ######################
-    
-    # Determine number of bands (i.e. elements in the list) and then create a list of GRASS rasterrow objects
-    global nbands    
+
+    ######################### Obtain information about GRASS rasters to be classified ##################
+
+    # Determine number of bands and then create a list of GRASS rasterrow objects
+    global nbands
     nbands = len(maplist)
-    
+
     global rasstack
     rasstack = [0] * nbands
     for i in range(nbands):
         rasstack[i] = RasterRow(maplist[i])
-    
+
     # Check to see if each raster in the list exists
     for i in range(nbands):
         if rasstack[i].exist() == True:
@@ -205,124 +225,135 @@
         else:
             print("GRASS raster " + maplist[i] + " does not exist.... exiting")
             exit()
-    
+
     # Use grass.pygrass.gis.region to get information about the current region, particularly
     # the number of rows and columns. We are going to sample and classify the data row-by-row,
-    # and not load all of the rasters into memory in a numpy array 
+    # and not load all of the rasters into memory in a numpy array
     current = Region()
-    
-    ########################### Create a imagery mask ###################################################
+
+    ########################### Create a imagery mask ###############################################
     # The input rasters might have slightly different dimensions in terms of value and non-value pixels.
     # We will use the GRASS r.series module to automatically create a mask by propagating the null values
-    
+
     global rfmask
     rfmask = 'tmp_' + ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(8)])
     grass.run_command("r.series", output = rfmask, input = maplist, method = 'count', flags = 'n', overwrite=True)
 
-    global mask_raster    
+    global mask_raster
     mask_raster = RasterRow(rfmask)
     mask_raster.open('r')
-    
-    ######################### Sample training data using training ROI ##################################
-    global roi_raster    
+
+    ######################### Sample training data using training ROI ##############################
+    global roi_raster
     roi_raster = RasterRow(roi)
-    
+
     if roi_raster.exist() == True:
         roi_raster.open('r')
     else:
         print("ROI raster does not exist.... exiting")
         exit()
     
-    # determine cell storage type of training roi raster    
-    roi_type = grass.read_command("r.info", map = roi, flags = 'g')
-    roi_type = normalize_newlines(str(roi_type))
-    roi_list = roi_type.split('\n')
-    dtype = roi_list[9].split('=')[1]
+    # load the model
+    if model_load != '':
+        rf = joblib.load(model_load)
+    else:
+        # determine cell storage type of training roi raster
+        roi_type = grass.read_command("r.info", map = roi, flags = 'g')
+        roi_type = normalize_newlines(str(roi_type))
+        roi_list = roi_type.split('\n')
+        dtype = roi_list[9].split('=')[1]
     
-    # check if training rois are valid for classification and regression
-    if mode == 'classification' and dtype != 'CELL':
-        print ("Classification mode requires an integer CELL type training roi map.....exiting")
-        exit()
+        # check if training rois are valid for classification and regression
+        if mode == 'classification' and dtype != 'CELL':
+            print ("Classification mode requires an integer CELL type training roi map.....exiting")
+            exit()
     
-    # Count number of labelled pixels
-    roi_stats = str(grass.read_command("r.univar", flags=("g"), map = roi))
-    if os.name == "nt":
-        roi_stats = roi_stats[0:len(roi_stats)-2] # to remove the last line ending and return characters
-        roi_stats = roi_stats.split('\r\n')[0]
-    else:
-        roi_stats = roi_stats[0:len(roi_stats)-1] # to remove the last line ending and return characters
-        roi_stats = roi_stats.split('\n')[0]
-
-    ncells = str(roi_stats).split('=')[1]
-    nlabel_pixels = int(ncells)
+        # Count number of labelled pixels
+        roi_stats = str(grass.read_command("r.univar", flags=("g"), map = roi))
+        if os.name == "nt":
+            roi_stats = roi_stats[0:len(roi_stats)-2] # to remove the last line ending and return characters
+            roi_stats = roi_stats.split('\r\n')[0]
+        else:
+            roi_stats = roi_stats[0:len(roi_stats)-1] # to remove the last line ending and return characters
+            roi_stats = roi_stats.split('\n')[0]
     
-    # Create a numpy array filled with zeros, with the dimensions of the number of columns in the region
-    # and the number of bands plus an additional band to attach the labels    
-    tindex=0
-    training_labels = []
-    training_data = np.zeros((nlabel_pixels, nbands+1))
-    training_data[:] = np.NAN
+        ncells = str(roi_stats).split('=')[1]
+        nlabel_pixels = int(ncells)
     
-    # Loop through each row of the raster, get the pixels from that row in the ROI raster
-    # and check if any of those pixels are labelled (i.e. they are not nan).
-    # If labelled pixels are encountered, loop through each imagery band for that row and put the data into
-    # the img_row_band_np array. Also attach the label values onto the last column.
+        # Create a numpy array filled with zeros, with the dimensions of the number of columns in the region
+        # and the number of bands plus an additional band to attach the labels
+        tindex=0
+        training_labels = []
+        training_data = np.zeros((nlabel_pixels, nbands+1))
+        training_data[:] = np.NAN
     
-    for row in range(current.rows):
-        roi_row_np = roi_raster[row]
-        is_train = np.nonzero(roi_row_np > -2147483648)
-        training_labels = np.append(training_labels, roi_row_np[is_train])
-        nlabels_in_row = np.array(is_train).shape[1]
-        if np.isnan(roi_row_np).all() != True:
-            for band in range(nbands):
-                imagerow_np = rasstack[band][row]
-                training_data[tindex : tindex+nlabels_in_row, band] = imagerow_np[is_train]
-            tindex = tindex + nlabels_in_row
+        # Loop through each row of the raster, get the pixels from that row in the ROI raster
+        # and check if any of those pixels are labelled (i.e. they are not nan).
+        # If labelled pixels are encountered, loop through each imagery band for that row and put the data into
+        # the img_row_band_np array. Also attach the label values onto the last column.
     
-    # Determine the number of class labels using np.unique
-    nclasses = len(np.unique(training_labels))
+        for row in range(current.rows):
+            roi_row_np = roi_raster[row]
+            is_train = np.nonzero(roi_row_np > -2147483648)
+            training_labels = np.append(training_labels, roi_row_np[is_train])
+            nlabels_in_row = np.array(is_train).shape[1]
+            if np.isnan(roi_row_np).all() != True:
+                for band in range(nbands):
+                    imagerow_np = rasstack[band][row]
+                    training_data[tindex : tindex+nlabels_in_row, band] = imagerow_np[is_train]
+                tindex = tindex + nlabels_in_row
     
-    # attach training label values onto last dimension of numpy array    
-    training_data[0:nlabel_pixels, nbands] = training_labels
+        # Determine the number of class labels using np.unique
+        nclasses = len(np.unique(training_labels))
     
-    # Remove nan rows from numpy array. Explanation: np.isnan(a) returns a similar array with True
-    # where NaN, False elsewhere. .any(axis=1) reduces an m*n array to n with an logical or operation
-    # on the whole rows, ~ inverts True/False and a[  ] chooses just the rows from the original array,
-    # which have True within the brackets.
+        # attach training label values onto last dimension of numpy array
+        training_data[0:nlabel_pixels, nbands] = training_labels
     
-    training_data = training_data[~np.isnan(training_data).any(axis=1)]
+        # Remove nan rows from numpy array. Explanation: np.isnan(a) returns a similar array with True
+        # where NaN, False elsewhere. .any(axis=1) reduces an m*n array to n with an logical or operation
+        # on the whole rows, ~ inverts True/False and a[  ] chooses just the rows from the original array,
+        # which have True within the brackets.
     
-    # Split the numpy array into training_labels and training_data arrays to pass to the classifier
-    training_labels = training_data[:, nbands]
-    training_data = training_data[:, 0:nbands]
+        training_data = training_data[~np.isnan(training_data).any(axis=1)]
     
+        # Split the numpy array into training_labels and training_data arrays to pass to the classifier
+        training_labels = training_data[:, nbands]
+        training_data = training_data[:, 0:nbands]
+
     ############################### Training the classifier #######################################
-    if mode == 'classification':
-        if balanced == True:
-            rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
-            class_weight = 'balanced', max_features = mfeatures, min_samples_split = minsplit, random_state = randst)
+    if model_load == '':
+        if mode == 'classification':
+            if balanced == True:
+                rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
+                class_weight = 'balanced', max_features = mfeatures, min_samples_split = minsplit, random_state = randst)
+            else:
+                rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
+                max_features = mfeatures, min_samples_split = minsplit, random_state = randst)
         else:
-            rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
-            max_features = mfeatures, min_samples_split = minsplit, random_state = randst)            
-    else:
-        rf = RandomForestRegressor(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
-        max_features = mfeatures, min_samples_split = minsplit, random_state = randst)
-    rf = rf.fit(training_data, training_labels)
-    print('Our OOB prediction of accuracy is: {oob}%'.format(oob=rf.oob_score_ * 100))
-    rfimp = pd.DataFrame(rf.feature_importances_)
-    rfimp.insert(loc=0, column='Raster', value = maplist)
-    rfimp.columns = ['Raster', 'Importance']
-    print(rfimp)
+            rf = RandomForestRegressor(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
+            max_features = mfeatures, min_samples_split = minsplit, random_state = randst)
+        rf = rf.fit(training_data, training_labels)
+    
+        # diagnostics
+        print('Our OOB prediction of accuracy is: {oob}%'.format(oob=rf.oob_score_ * 100))
+        rfimp = pd.DataFrame(rf.feature_importances_)
+        rfimp.insert(loc=0, column='Raster', value = maplist)
+        rfimp.columns = ['Raster', 'Importance']
+        print(rfimp)
 
+        # save the model
+        if model_save != '':
+            joblib.dump(rf, model_save + ".pkl")
+
     ################################ Prediction on the rest of the raster stack ###################
     # Create a np.array that can store each raster row for all of the bands, i.e. it is a long as the current columns,
     # and as wide as the number of bands
     # Loop through the raster, row-by-row and get the row values for each band, adding these to the img_np_row np.array,
     # which adds rowincr rows together to pass to the classifier. Otherwise, row-by-row is too inefficient.
-    
+
     # The scikit learn predict function expects a list of pixels, not an NxM matrix. We therefore need to reshape each row
     # matrix into a list. The total matrix size = cols * nbands. Therefore we can use the np.reshape function to convert
-    # the image into a list with the number of rows equal to n_samples, and the number of columns equal to the number of bands. 
+    # the image into a list with the number of rows equal to n_samples, and the number of columns equal to the number of bands.
     # Then we remove any NaN values because the scikit-learn predict function cannot handle NaNs. Here we replace them with a
     # small value using the np.nan_to_num function.
     # The flat_pixels is then passed onto the prediction function. After the prediction is performed on the row, to save keeping
@@ -336,8 +367,8 @@
         ftype = 'FCELL'
         nodata = np.nan
     classification.open('w', ftype,  overwrite = True)
-    
-    # create and open RasterRow objects for classification and probabilities if enabled    
+
+    # create and open RasterRow objects for classification and probabilities if enabled
     if class_probabilities == True and mode == 'classification':
         prob_out_raster = [0] * nclasses
         prob = [0] * nclasses
@@ -351,35 +382,35 @@
         if rowblock+rowincr > current.rows: rowincr = current.rows - rowblock
         img_np_row = np.zeros((rowincr, current.cols, nbands))
         mask_np_row = np.zeros((rowincr, current.cols))
-        
+
         # loop through each row, and each band and add these values to the 2D array img_np_row
         for row in range(rowblock, rowblock+rowincr, 1):
             mask_np_row[row-rowblock, :] = np.array(mask_raster[row])
             for band in range(nbands):
                 img_np_row[row-rowblock, :, band] = np.array(rasstack[band][row])
-                
+
         mask_np_row[mask_np_row == -2147483648] = np.nan
         nanmask = np.isnan(mask_np_row) # True in the mask means invalid data
-        
+
         # reshape each row-band matrix into a list
         nsamples = rowincr * current.cols
         flat_pixels = img_np_row.reshape((nsamples, nbands))
-        
+
         # remove NaN values and perform the prediction
         flat_pixels_noNaN = np.nan_to_num(flat_pixels)
         result = rf.predict(flat_pixels_noNaN)
         result = result.reshape((rowincr, current.cols))
-        
+
         # replace NaN values so that the prediction surface does not have a border
         result_NaN = np.ma.masked_array(result, mask=nanmask, fill_value=np.nan)
         result_masked = result_NaN.filled([nodata]) #Return a copy of result, with masked values filled with a given value
-        
-        # for each row we can perform computation, and write the result into   
+
+        # for each row we can perform computation, and write the result into
         for row in range(rowincr):
             newrow = Buffer((result_masked.shape[1],), mtype=ftype)
             newrow[:] = result_masked[row, :]
             classification.put_row(newrow)
-        
+
         # same for probabilities
         if class_probabilities == True and mode == 'classification':
             result_proba = rf.predict_proba(flat_pixels_noNaN)
@@ -392,12 +423,12 @@
                     newrow = Buffer((result_proba_class_masked.shape[1],), mtype='FCELL')
                     newrow[:] = result_proba_class_masked[row, :]
                     prob[iclass].put_row(newrow)
-    
+
     classification.close()
 
     if class_probabilities == True and mode == 'classification':
         for iclass in range(nclasses): prob[iclass].close()
-    
+
 if __name__ == "__main__":
     options, flags = grass.parser()
     atexit.register(cleanup)



More information about the grass-commit mailing list