[GRASS-SVN] r68299 - grass-addons/grass7/raster/r.randomforest

Thu Apr 21 21:34:25 PDT 2016

Author: spawley
Date: 2016-04-21 21:34:25 -0700 (Thu, 21 Apr 2016)
New Revision: 68299

Modified:
   grass-addons/grass7/raster/r.randomforest/r.randomforest.html
   grass-addons/grass7/raster/r.randomforest/r.randomforest.py
Log:
update to r.randomforest with cross validation ability

Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.html
===================================================================

--- grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-04-21 17:15:49 UTC (rev 68298)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-04-22 04:34:25 UTC (rev 68299)
@@ -1,27 +1,34 @@
 <h2>DESCRIPTION</h2>
 
-<em><b>r.randomforest</b></em> performs Random forests classification and regression on a GRASS imagery group. Random forests (Breiman, 2001) represents an ensemble classification tree method which constructs a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a prediction probability, and the final classification result is obtained by averaging of the prediction probabilities across all of the trees. The scikit-learn randomforest implementation differs from the original Breiman (2001) reference in that each tree produces a classification, and the forest chooses the classification result which has the most votes over all of the trees (i.e. majority voting versus averaging). The probability of membership (<i>class_probabilities</i> flag) is based on the proportion of votes for each class.
+<em><b>r.randomforest</b></em> performs Random forests classification and regression on a suite of predictors within a GRASS imagery group. Random forest (Breiman, 2001) is an ensemble classification tree method which constructs a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a prediction probability, and the final classification result is obtained by averaging of the prediction probabilities across all of the trees. The probability of membership to the individual classes (<i>class_probabilities</i> flag) can also be output using the -p flag. The scikit-learn randomforest implementation differs from the original Breiman (2001) reference which uses majority voting rather than averaging.
 
-<br><br>Random forests offers a number of advantages over traditional statistical classifiers because it is non-parametric and can deal with non-linear relationships. Furthermore, continuous and categorical data can be used, and no rescaling is required. Another practical advantage of random forests is that it involves few user-specified parameter choices, principally consisting of the number of trees in the forest (<i>ntrees</i>), and the number of variables that are allowed to be chosen from at each node split (<i>mfeatures</i>), which controls the degree of correlation between the trees. Furthermore, there is no accuracy penalty in having a large number of trees apart from increased computational time. However, the performance of RF models typically level off at a certain number of trees, at which point there is no further benefit in terms of error reduction in using a larger forest. For randomforest classification, the default <i>ntrees</i> is 500 and the default setting of <i>m
 features</i> is equal to the square root of the number of predictors.
+<br><br>Random forests offers a number of advantages over traditional statistical classifiers because it is non-parametric and can deal with non-linear relationships and  categorical data, and no rescaling is required. Random forests also require relatively few user-specified parameter choices, principally consisting of the number of trees in the forest (<i>ntrees</i>), and the number of variables that are allowed to be chosen from at each node split (<i>mfeatures</i>), which controls the degree of correlation between the trees. There is no accuracy penalty in having a large number of trees apart from increased computational time. For randomforest classification, the default <i>ntrees</i> is 500 and the default setting of <i>mfeatures</i> is equal to the square root of the number of predictors.
 
-<br><br>An additional feature of random forests is that it includes built-in accuracy assessment and variable selection. Random forests uses the concept of bagging, where a randomly selected 66% subset of the training data are held-out 'out-of-bag' (OOB) in the construction of each tree, and then OOB data are used to evaluate the prediction accuracy. The OOB error has been shown to provide a reliable estimate of classification error. Note that this is applied on a pixel-basis in the module, and the OOB error assumes that the training data are not spatially correlated. If you training data represent individual pixels that are separated by x distance, then this assumption may hold true. However, if your training data represent rasterized polygons, then this assumption will likely be false, and you should use an independent set of polygons to test the accuracy of the classification using <i>i.kappa</i>.
+<br><br>Random forests includes built-in accuracy assessment and variable selection. Random forests uses bagging, where a randomly selected 66% subset of the training data are held-out 'out-of-bag' (OOB) in the construction of each tree, and then OOB data are used to evaluate the prediction accuracy. The OOB error has been shown to provide a reliable estimate of classification error. However, the <i>-cv</i> can be set to > 1 to perform k-fold cross validation as an external method of accuracy assessment, and a number of global performance metrics are used.
 
-The random forests scikit-learn implementation also includes a measure of variable importance based on the Gini impurity criterion, which measures how each variable contributes to the homogeneity of the nodes, with important variables causing a larger decrease in the Gini coefficient in successive node splits. This variable importance allows the contributions of the individual predictors to be determined. The feature importance scores are output to the command display.
+<br><br><b>Note:</b> the performance are applied on a pixel-basis in the module and assume that the training data are not spatially correlated. This assumption is unlikely to be true of the training data represent rasterized polygons rather than points. In this case, an independent set of polygons should be used to test the accuracy of the classification using <i>i.kappa</i>.
 
-<br><br>Random forests classification like most machine learning methods does not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by predicting the majority class with a very high accuracy, but at the expense of the minority class, i.e. high sensitivity but low specificity. If you have a highly imbalanced dataset, the 'balanced' flag can be set. The scikit-learn implementation balanced mode then automatically adjust weights inversely proportional to class frequencies.
+<br><br>The random forests scikit-learn implementation includes a measure of variable importance based on the Gini impurity criterion, which measures how each variable contributes to the homogeneity of the nodes, with important variables causing a larger decrease in the Gini coefficient in successive node splits. This variable importance allows the contributions of the individual predictors to be determined. The feature importance scores are displayed in the command output.
 
-<br><br>Random forest can also be run in regression mode by setting the <i>mode</i> to the regression option. In this case, the mean square error (mse) is used to measure the quality of each decision split in the tree, versus the gini impurity for classification. Additionally, the the default <i>mfeatures</i> is equal to the number of predictors, and the coefficient of determination R^2 of the prediction is outputted as the performance measure in regression mode. You also can increase the generalization ability of the classifier by increasing <i>minsplit</i>, which represents the minimum number of samples required in order to split a node. The balanced and class_probabilities options are ignored for regression. 
+<br><br>Random forests does not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by predicting the majority class with a very high accuracy, but at the expense of the minority class. If you have a highly imbalanced dataset, the 'balanced' flag can be set. The scikit-learn implementation balanced mode then automatically adjust weights inversely proportional to class frequencies.
 
+<br><br>Random forest can also be run in regression mode by setting the <i>mode</i> to the regression option. In this case, the mean square error (mse) is used to measure the quality of each decision split in the tree. Additionally, the the default <i>mfeatures</i> is equal to the number of predictors, and the coefficient of determination R^2 of the prediction is used as the performance measure. The generalization ability of the classifier can also be increased using <i>minsplit</i>, which represents the minimum number of samples required in order to split a node. The balanced and class_probabilities options are ignored for regression. 
+
 <br><br>The module also offers the ability to save and load a random forests model. The model is saved as a list of filenames (starting with the extension .pkl which is added automatically) for each numpy array. This list can involve a large number of files, so it makes sense to save each model in a separate directory. To load the model, you need to select the .pkl file that was saved. Saving and loading a model represents a useful feature because it allows a model to be built on one imagery group (ie. set of predictor variables), and then the prediction can be performed on other imagery groups. This approach is commonly employed in species prediction modelling, or landslide susceptibility modelling, where a classification or regression model is built with one set of predictors (e.g. which include present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables. The names of the GRASS rasters in the imagery groups
  do not matter because scikit learn saves the model as a series of numpy arrays. However, the new imagery group must contain the same number of rasters, and they should be in the same order as in the imagery group upon which the model was built. As an example, the new imagery group may have a raster named 'mean_precipitation_2050' which substitutes the 'mean_precipitation_2016' in the imagery group that was used to build the model. 
 
 <h2>NOTES</h2>
 
-<em><b>r.randomforest</b></em> uses the scikit-learn machine learning python package, and the pandas package. These python packages need to be installed within your GRASS python environment for <em><b>r.randomforest</b></em> to work. For linux users, both of these packages should be available through the linux package manager in most distributions. For windows users, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">Osgeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL, scikit-learn and pandas .whl files and install them using 'pip install packagename.whl'.
+<em><b>r.randomforest</b></em> uses the scikit-learn machine learning python package. This python package needs to be installed within your GRASS python environment for <em><b>r.randomforest</b></em> to work. For linux users, this package should be available through the linux package manager in most distributions. For windows users, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">Osgeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'.
 
-<em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifier, which is multithreaded by default, and results in a stop-start behaviour. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=100</i> should be reasonable for most systems with 4-8 GB of ram. However, if you have a workstation with much larger resources, then <i>lines</i> could be set to a much larger size (including to a value that is equal or greater than the number of rows in the current region setting) in which case the entire image will be loaded into memory to classification.
+<em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifier, which is multithreaded by default. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=50</i> should be reasonable for most systems with 4-8 GB of ram. However, if you have a workstation with much larger resources, then <i>lines</i> could be set to a much larger size (including to a value that is equal or greater than the number of rows in the current region setting) in which case the entire image will be loaded into memory to classification.
 
 <br><br> The bootstrapping process involved within random forests also causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <i>randst</i> parameter.
 
+<h2>TODO</h2>
+
+Provide option to perform cross-validation on a polygon or region basis.
+Provide option to perform spatial and non-spatial cross-validation.
+
 <h2>EXAMPLE</h2>
 
 Here we are going to use the GRASS GIS sample North Carolina data set as a basis to perform a landsat classification. We are going to classify a Landsat 7 scene from 2000, using training information from an older (1996) land cover dataset.

Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.py
===================================================================
--- grass-addons/grass7/raster/r.randomforest/r.randomforest.py	2016-04-21 17:15:49 UTC (rev 68298)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.py	2016-04-22 04:34:25 UTC (rev 68299)
@@ -4,7 +4,7 @@
 # MODULE:       r.randomforest
 # AUTHOR:       Steven Pawley
 # PURPOSE:      Provides supervised random forest classification and regression
-#               (using python scikit-learn and pandas)
+#               (using python scikit-learn)
 #
 # COPYRIGHT:    (c) 2015 Steven Pawley, and the GRASS Development Team
 #               This program is free software under the GNU General Public
@@ -18,7 +18,6 @@
 #% keyword: classification
 #% keyword: machine learning
 #% keyword: scikit-learn
-#% keyword: pandas
 #% keyword: random forests
 #%end
 
@@ -80,6 +79,15 @@
 #%end
 
 #%option
+#% key: cv
+#% type: integer
+#% description: Use k-fold cross-validation when cv > 1
+#% answer: 1
+#% required: yes
+#% guisection: Random Forest Options
+#%end
+
+#%option
 #% key: randst
 #% type: integer
 #% description: Seed to pass onto the random state for reproducible results
@@ -92,9 +100,9 @@
 #% key: lines
 #% type: integer
 #% description: Processing block size in terms of number of rows
-#% answer: 20
+#% answer: 50
 #% required: yes
-#% guisection: Optional
+#% guisection: Random Forest Options
 #%end
 
 #%flag
@@ -129,11 +137,8 @@
 #% guisection: Optional
 #%end
 
-#%option G_OPT_F_OUTPUT
-#% key: fimportance
-#% label: Save feature importance and accuracy to csv
-#% required: no
-#% guisection: Optional
+#%rules
+#% exclusive: roi,loadfile
 #%end
 
 # standard modules
@@ -151,16 +156,14 @@
         imp.find_module(module_name)
         return True
     except ImportError:
-        grass.error(_("{} Python package not installed."
-                      " Exiting").format(module_name))
+        grass.fatal("{} Python package not installed. Exiting").format(module_name)
         return False
 
 def cleanup():
     # We can then close the rasters and the roi image
     for i in range(nbands): rasstack[i].close()
-    mask_raster.close()
     roi_raster.close()
-    grass.run_command("g.remove", name=rfmask, flags="f", type="raster")
+    if rfmask != '': grass.run_command("g.remove", name=rfmask, flags="f", type="raster")
 
 def normalize_newlines(string):
     # Windows uses carriage return and line feed ("\r\n") as a line ending
@@ -168,12 +171,42 @@
     import re
     return re.sub(r'(\r\n|\r|\n)', '\n', string)
 
+def scoreresults(X, y, clf, kfolds):
+    from sklearn import cross_validation, metrics
+    kf = cross_validation.KFold(len(y), n_folds=kfolds, shuffle=True)
+
+    pr = np.zeros((kfolds, 5))
+    pr.shape
+    i=0
+
+    for train_index, test_index in kf:
+        X_train, X_test =  X[train_index], X[test_index]
+        y_train, y_test = y[train_index], y[test_index]
+        fit = clf.fit(X_train, y_train)
+        y_pred = clf.predict(X_test)
+        
+        pr[i,0] = metrics.accuracy_score(y_test, y_pred)
+
+        # check if the response variable is binary, then calculate roc_auc and set average to binary
+        if len(np.unique(y)) == 2:
+            statmethod = 'binary'
+            pr[i,1] = metrics.roc_auc_score(y_test, y_pred)
+        else:
+            statmethod = 'micro'
+
+        pr[i,2] = metrics.precision_score(y_test, y_pred, average=statmethod)
+        pr[i,3] = metrics.recall_score(y_test, y_pred, average=statmethod)   
+        pr[i,4] = metrics.cohen_kappa_score(y_test, y_pred)
+        i+= 1
+    return(pr)
+
 def main():
     igroup = options['igroup']
     roi = options['roi']
     output = options['output']
     mode = options['mode']
     ntrees = options['ntrees']
+    cv = int(options['cv'])
     balanced = flags['b']
     modelonly = flags['m']
     class_probabilities = flags['p']
@@ -183,42 +216,38 @@
     randst = int(options['randst'])
     model_save = options['savefile']
     model_load = options['loadfile']
-    fimportance = options['fimportance']
 
     ##################### error checking for valid input parameters ################################
     if mfeatures == -1:
         mfeatures = str('auto')
     if mfeatures == 0:
-        grass.fatal_error("mfeatures must be greater than zero, or -1 which uses the sqrt(nfeatures)...exiting")
+        grass.fatal("mfeatures must be greater than zero, or -1 which uses the sqrt(nfeatures)...exiting")
         exit()
     if minsplit == 0:
-        grass.fatal_error("minsplit must be greater than zero.....exiting")
+        grass.fatal("minsplit must be greater than zero.....exiting")
         exit()
     if rowincr <= 0:
-        grass.fatal_error("rowincr must be greater than zero....exiting")
+        grass.fatal("rowincr must be greater than zero....exiting")
         exit()
     if ntrees < 1:
-        grass.fatal_error("ntrees must be greater than zero.....exiting")
+        grass.fatal("ntrees must be greater than zero.....exiting")
         exit()
     if mode == 'regression' and balanced == True:
         grass.warning(_("balanced mode is ignored in Random Forests in regression mode....continuing"))
     if mode == 'regression' and class_probabilities == True:
         grass.warning(_("option to output class probabiltities is ignored in regression mode....continuing"))
     if model_save != '' and model_load != '':
-        grass.fatal_error("Cannot save and load a model at the same time.....exiting")
+        grass.fatal("Cannot save and load a model at the same time.....exiting")
     if model_load == '' and roi == '':
-        grass.fatal_error("Require labelled pixels regions of interest.....exiting")
+        grass.fatal("Require labelled pixels regions of interest.....exiting")
 
     # lazy imports
     if module_exists("sklearn") == True:
         from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
         from sklearn.externals import joblib
+        from sklearn import cross_validation, metrics
     else:
-        grass.fatal_error("Scikit-learn python module is not installed.....exiting")
-    if module_exists("pandas") == True:
-        import pandas as pd
-    else:
-        grass.fatal_error("Pandas python module is not installed.....exiting")
+        grass.fatal("Scikit-learn python module is not installed.....exiting")
 
     ######################  Fetch individual raster names from group ###############################
     groupmaps = grass.read_command("i.group", group=igroup, flags="g")
@@ -242,42 +271,31 @@
         if rasstack[i].exist() == True:
             rasstack[i].open('r')
         else:
-            grass.fatal_error("GRASS raster " + maplist[i] + " does not exist.... exiting")
+            grass.fatal("GRASS raster " + maplist[i] + " does not exist.... exiting")
 
     # Use grass.pygrass.gis.region to get information about the current region, particularly
     # the number of rows and columns. We are going to sample and classify the data row-by-row,
     # and not load all of the rasters into memory in a numpy array
     current = Region()
 
-    ########################### Create a imagery mask ##############################################
-    # The input rasters might have different dimensions in terms of value and non-value pixels.
-    # We will use the r.series module to automatically create a mask by propagating the null values
-
+    # Define name of mask raster
     global rfmask
-
-    rfmask = 'tmp_' + ''.join([random.choice(string.ascii_letters + string.digits) \
-    for n in xrange(8)])
-
-    grass.run_command("r.series", output=rfmask, input=maplist, method='count', flags='n', \
-    overwrite=True)
-
-    global mask_raster
-    mask_raster = RasterRow(rfmask)
-    mask_raster.open('r')
-
+    rfmask = ''
+    
     ######################### Sample training data using training ROI ##############################
     global roi_raster
     roi_raster = RasterRow(roi)
 
-    if roi_raster.exist() == True:
-        roi_raster.open('r')
-    else:
-        grass.fatal_error("ROI raster does not exist.... exiting")
-
     # load the model
     if model_load != '':
         rf = joblib.load(model_load)
     else:
+        # Check if ROI raster exists if model is not loaded
+        if roi_raster.exist() == True:
+            roi_raster.open('r')
+        else:
+            grass.fatal("ROI raster does not exist.... exiting")
+            
         # determine cell storage type of training roi raster
         roi_type = grass.read_command("r.info", map=roi, flags='g')
         roi_type = normalize_newlines(str(roi_type))
@@ -286,7 +304,7 @@
 
         # check if training rois are valid for classification and regression
         if mode == 'classification' and dtype != 'CELL':
-            grass.fatal_error("Classification mode requires an integer CELL type training roi map.....exiting")
+            grass.fatal("Classification mode requires an integer CELL type training roi map.....exiting")
 
         # Count number of labelled pixels
         roi_stats = str(grass.read_command("r.univar", flags=("g"), map=roi))
@@ -333,47 +351,77 @@
         training_data = training_data[:, 0:nbands]
 
     ############################### Training the classifier #######################################
+
+    # define classifier for classification or regression mode, and balanced or unbalanced datasets
     if model_load == '':
         if mode == 'classification':
             if balanced == True:
                 rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
-                class_weight='balanced', max_features=mfeatures, min_samples_split=minsplit, \
-                random_state=randst)
+                                            class_weight='balanced', max_features=mfeatures, min_samples_split=minsplit, \
+                                            random_state=randst)
             else:
                 rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
-                max_features=mfeatures, min_samples_split=minsplit, random_state=randst)
-            rf = rf.fit(training_data, training_labels)
+                                            max_features=mfeatures, min_samples_split=minsplit, random_state=randst)
+        else:
+            rf = RandomForestRegressor(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
+                                       max_features=mfeatures, min_samples_split=minsplit, random_state=randst)
+
+        # train classifier
+        rf = rf.fit(training_data, training_labels)
+
+        # use cross-validation
+        if cv > 1:
+            grass.message(_("\r\n"))
+            grass.message(_("Global performance measures......:"))
+            if mode == 'classification':
+                scores = scoreresults(training_data, training_labels, rf, cv)
+                grass.message(_("accuracy: \t {mean} \t +/- \t {error}".format(mean=scores[:,0].mean(), error=scores[:,0].std() * 2)))
+                if len(np.unique(training_labels)) == 2:
+                    grass.message(_("roc_auc: \t {mean} \t +/- \t {error}".format(mean=scores[:,1].mean(), error=scores[:,1].std() * 2)))
+                grass.message(_("precision: \t {mean} \t +/- \t {error}".format(mean=scores[:,2].mean(), error=scores[:,2].std() * 2)))
+                grass.message(_("recall: \t {mean} \t +/- \t {error}".format(mean=scores[:,3].mean(), error=scores[:,3].std() * 2)))
+                grass.message(_("kappa: \t {mean} \t +/- \t {error}".format(mean=scores[:,4].mean(), error=scores[:,4].std() * 2)))
+            else:
+                scores = cross_validation.cross_val_score(rf, training_data, training_labels, cv=cv, scoring='r2')
+                grass.message(_("R2: \t {mean} \t +/- \t {error}".format(mean=scores.mean(), error=scores.std())))
+
+        # output internal performance measures
+        grass.message(_("\r\n"))
+        if mode == 'classification':
             acc = float(rf.oob_score_)
             grass.message(_('Our OOB prediction of accuracy is: {oob}%'.format(oob=rf.oob_score_ * 100)))
         else:
-            rf = RandomForestRegressor(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
-            max_features=mfeatures, min_samples_split=minsplit, random_state=randst)
-            rf = rf.fit(training_data, training_labels)
             acc = rf.score(X=training_data, y=training_labels)
             grass.message(_('Our coefficient of determination R^2 of the prediction is: {r2}%'.format \
-            (r2=rf.score(X=training_data, y=training_labels))))
+                            (r2=rf.score(X=training_data, y=training_labels))))
 
         # diagnostics
-        rfimp = pd.DataFrame(rf.feature_importances_)
-        rfimp.insert(loc=0, column='Raster', value=maplist)
-        rfimp.columns = ['Raster', 'Importance']
-        print(rfimp)
+        rfimp = rf.feature_importances_
+        grass.message(_("Random forest feature importance"))
+        grass.message(_("id" + "\t" + "Raster" + "\t" + "Importance"))
+        for i in range(len(rfimp)):
+             grass.message(_(str(i) + "\t" + maplist[i] + "\t" + str(rfimp[i])))
         
-        # save diagnostics to file        
-        if fimportance != '':
-            rfimp.to_csv(path_or_buf = fimportance)
-            fd = open(fimportance, 'a')
-            fd.write(str(rfimp.shape[0]) + ',' + 'Performance measure (OOB or R2)' + ',' + str(acc))
-            fd.close()
-        
         # save the model
         if model_save != '':
             joblib.dump(rf, model_save + ".pkl")
         
         if modelonly == True:
-            exit()
+            grass.fatal("Model built and now exiting")
 
     ################################ Prediction on the rest of the raster stack ###################
+
+    # Create a imagery mask
+    # The input rasters might have different dimensions in terms of value and non-value pixels.
+    # We will use the r.series module to automatically create a mask by propagating the null values
+    rfmask = 'tmp_' + ''.join([random.choice(string.ascii_letters + string.digits) \
+                               for n in xrange(8)])
+    grass.run_command("r.series", output=rfmask, input=maplist, method='count', flags='n')
+
+    global mask_raster
+    mask_raster = RasterRow(rfmask)
+    mask_raster.open('r')
+    
     # 1. Create a np.array that can store each raster row for all of the bands
     # 2. Loop through the raster, row-by-row and get the row values for each band
     #    adding these to the img_np_row np.array,
@@ -458,6 +506,7 @@
                     prob[iclass].put_row(newrow)
 
     classification.close()
+    mask_raster.close()
 
     if class_probabilities == True and mode == 'classification':
         for iclass in range(nclasses): prob[iclass].close()