[GRASS-SVN] r68785 - grass-addons/grass7/raster/r.randomforest

Mon Jun 27 11:16:25 PDT 2016

Author: spawley
Date: 2016-06-27 11:16:25 -0700 (Mon, 27 Jun 2016)
New Revision: 68785

Modified:
   grass-addons/grass7/raster/r.randomforest/r.randomforest.html
   grass-addons/grass7/raster/r.randomforest/r.randomforest.py
Log:


Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.html
===================================================================

--- grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-06-27 12:15:41 UTC (rev 68784)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-06-27 18:16:25 UTC (rev 68785)
@@ -25,7 +25,7 @@
 For MS-Windows users, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'.
 
 <p>
-<em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifier, which is multithreaded by default. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=50</i> should be reasonable for most systems with 4-8 GB of ram. However, if you have a workstation with much larger resources, then <i>lines</i> could be set to a much larger size (including to a value that is equal or greater than the number of rows in the current region setting) in which case the entire image will be loaded into memory to classification.
+<em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifier, which is multithreaded by default. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=25</i> should be reasonable for most systems with 4-8 GB of ram. However, if you have a workstation with much larger resources, then <i>lines</i> could be set to a much larger size (including to a value that is equal or greater than the number of rows in the current region setting) in which case the entire image will be loaded into memory to classification.
 
 <br><br> The bootstrapping process involved within random forests also causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <i>randst</i> parameter.
 
@@ -54,7 +54,7 @@
 <br><br>Then we can use these training pixels to perform a classification on the more recently obtained landsat 7 image:
 <div class="code"><pre>
 r.randomforest igroup=lsat7_2000 roi=landclass96_roi output=rf_classification \
-  mode=classification ntrees=500 mfeatures=-1 minsplit=2 randst=1 lines=100
+  mode=classification ntrees=500 mfeatures=-1 minsplit=2 randst=1 lines=25
 
 # copy category labels from landclass training map to result
 r.category rf_classification raster=landclass96_roi

Modified: grass-addons/grass7/raster/r.randomforest/r.randomforest.py
===================================================================
--- grass-addons/grass7/raster/r.randomforest/r.randomforest.py	2016-06-27 12:15:41 UTC (rev 68784)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.py	2016-06-27 18:16:25 UTC (rev 68785)
@@ -3,13 +3,13 @@
 #
 # MODULE:       r.randomforest
 # AUTHOR:       Steven Pawley
-# PURPOSE:      Provides supervised random forest classification and regression
-#               (using python scikit-learn)
+# PURPOSE:    Provides supervised random forest classification and regression
+#                      using python scikit-learn)
 #
-# COPYRIGHT:    (c) 2015 Steven Pawley, and the GRASS Development Team
-#               This program is free software under the GNU General Public
-#               License (>=v2). Read the file COPYING that comes with GRASS
-#               for details.
+# COPYRIGHT: (c) 2015 Steven Pawley, and the GRASS Development Team
+#                      This program is free software under the GNU General Public
+#                      License (>=v2). Read the file COPYING that comes with GRASS
+#                      for details.
 #
 #############################################################################
 
@@ -100,7 +100,7 @@
 #% key: lines
 #% type: integer
 #% description: Processing block size in terms of number of rows
-#% answer: 50
+#% answer: 25
 #% required: yes
 #% guisection: Random Forest Options
 #%end
@@ -141,49 +141,24 @@
 #% exclusive: roi,loadfile
 #%end
 
-# standard modules
-import atexit, random, string, imp, re
+# import standard modules
+import atexit, random, string, re, os
 from grass.pygrass.raster import RasterRow
 from grass.pygrass.gis.region import Region
 from grass.pygrass.raster.buffer import Buffer
 import grass.script as grass
 import numpy as np
-
-# non-standard modules
-def module_exists(module_name):
-    try:
-        imp.find_module(module_name)
-        return True
-    except ImportError:
-        grass.fatal("Python package <%s> not installed (python-sklearn). Exiting" % module_name)
-        return False
-        
-# lazy imports
-if module_exists("sklearn") == True:
-    from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
-    from sklearn.externals import joblib
-    from sklearn import cross_validation, metrics
-    import warnings
-    warnings.filterwarnings("ignore")
-else:
-    grass.fatal("Scikit-learn python module (python-sklearn) is not installed.....exiting")
-
+       
 def cleanup():
-    # We can then close the rasters and the roi image
+    # close the GRASS raster objects and the roi raster object
     for i in range(nbands): rasstack[i].close()
     roi_raster.close()
     if rfmask != '': grass.run_command("g.remove", name=rfmask, flags="f",
                                        type="raster")
 
-def normalize_newlines(string):
-    # Windows uses carriage return and line feed ("\r\n") as a line ending
-    # Unix uses just line feed ("\n"). This function normalizes strings to "\n"
-    return re.sub(r'(\r\n|\r|\n)', '\n', string)
-
 def score_classification_results(X, y, clf, kfolds, rstate):
-    # This custom function performs cross validation on a classification model,
+    # PURPOSE: custom function performs cross validation on a classification model,
     # RETURNS: a 1D list representing accuracy, AUROC, precision, recall, kappa and specificity
-    # (using our custom function)
         
     kf = cross_validation.KFold(len(y), n_folds=kfolds, shuffle=True, random_state=rstate)
 
@@ -211,10 +186,14 @@
     return(acc, auc, kappa)
 
 def score_regression_results(X, y, clf, kfolds, rstate):
-    # This custom function performs cross validation on a regression model,
+    # PURPOSE: performs cross validation on a regression model
     # RETURNS: single value of R2
     
-    from sklearn import cross_validation, metrics
+    try:
+        from sklearn import cross_validation, metrics
+    except:
+        grass.fatal("Scikit-learn python module (python-sklearn) is not installed.....exiting") 
+        
     kf = cross_validation.KFold(len(y), n_folds=kfolds, shuffle=True, random_state=rstate)
 
     pr = []
@@ -230,6 +209,7 @@
     return(pr)
 
 def cv_performance_byClass(X, y, clf, kfolds, rstate):
+    # PURPOSE: k-fold cross-validation performance measures by class
     # RETURNS: 2D list of CLASS and mean performance measure result
     # RETURNS: 2D list of CLASS and standard deviation of performance measure result
     # Performance measures are sensitivity, recall
@@ -287,7 +267,10 @@
     model_save = options['savefile']
     model_load = options['loadfile']
 
-    ##################### error checking for valid input parameters ################################
+    """
+    Error checking for valid input parameters
+    """
+   
     if mfeatures == -1:
         mfeatures = str('auto')
     if mfeatures == 0:
@@ -311,15 +294,17 @@
     if model_load == '' and roi == '':
         grass.fatal("Require labelled pixels regions of interest.....exiting")
 
-    ######################  Fetch individual raster names from group ###############################
+    """
+    Obtain information about GRASS rasters to be classified
+    """
+
+   # fetch individual raster names from group
+    
     groupmaps = grass.read_command("i.group", group=igroup, flags="g")
-    groupmaps = normalize_newlines(groupmaps)
-    maplist = groupmaps.split('\n')
+    maplist = groupmaps.split(os.linesep)
     maplist = maplist[0:len(maplist)-1]
 
-    ######################### Obtain information about GRASS rasters to be classified ##############
-
-    # Determine number of bands and then create a list of GRASS rasterrow objects
+    # determine number of bands and then create a list of GRASS rasterrow objects
     global nbands
     nbands = len(maplist)
 
@@ -328,23 +313,34 @@
     for i in range(nbands):
         rasstack[i] = RasterRow(maplist[i])
 
-    # Check to see if each raster in the list exists
+    # check to see if each raster in the list exists
     for i in range(nbands):
         if rasstack[i].exist() == True:
             rasstack[i].open('r')
         else:
             grass.fatal("GRASS raster " + maplist[i] + " does not exist.... exiting")
 
-    # Use grass.pygrass.gis.region to get information about the current region, particularly
-    # the number of rows and columns. We are going to sample and classify the data row-by-row,
-    # and not load all of the rasters into memory in a numpy array
+    # use grass.pygrass.gis.region to get information about the current region, particularly
     current = Region()
 
     # Define name of mask raster
     global rfmask
     rfmask = ''
+
+    # lazy import of sklearn
+    try:
+        from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
+        from sklearn.externals import joblib
+        from sklearn import cross_validation, metrics
+        import warnings
+        warnings.filterwarnings("ignore")
+    except:
+        grass.fatal("Scikit-learn python module (python-sklearn) is not installed.....exiting")
     
-    ######################### Sample training data using training ROI ##############################
+    """
+    Sample training data using training ROI
+    """
+    
     global roi_raster
     roi_raster = RasterRow(roi)
 
@@ -360,8 +356,8 @@
             
         # determine cell storage type of training roi raster
         roi_type = grass.read_command("r.info", map=roi, flags='g')
-        roi_type = normalize_newlines(str(roi_type))
-        roi_list = roi_type.split('\n')
+        roi_list = str(roi_type).split(os.linesep)
+        roi_type.split('\n')
         dtype = roi_list[9].split('=')[1]
 
         # check if training rois are valid for classification and regression
@@ -370,8 +366,7 @@
 
         # Count number of labelled pixels
         roi_stats = str(grass.read_command("r.univar", flags=("g"), map=roi))
-        roi_stats = normalize_newlines(roi_stats)
-        roi_stats = roi_stats.split('\n')[0]
+        roi_stats = roi_stats.split(os.linesep)[0]
         ncells = str(roi_stats).split('=')[1]
         nlabel_pixels = int(ncells)
 
@@ -412,7 +407,9 @@
         training_labels = training_data[:, nbands]
         training_data = training_data[:, 0:nbands]
 
-    ############################### Training the classifier #######################################
+    """
+    Train the randomforest classifier and perform cross-validation
+    """
 
     # define classifier for classification or regression mode, and balanced or unbalanced datasets
     if model_load == '':
@@ -485,11 +482,13 @@
         if modelonly == True:
             grass.fatal("Model built and now exiting")
 
-    ################################ Prediction on the rest of the raster stack ###################
+    """
+    Prediction on the rest of the GRASS rasters in the imagery group
+    """
 
-    # Create a imagery mask
-    # The input rasters might have different dimensions in terms of value and non-value pixels.
-    # We will use the r.series module to automatically create a mask by propagating the null values
+    # create a imagery mask
+    # the input rasters might have different dimensions in terms of value and non-value pixels.
+    # r.series used to automatically create a mask by propagating the null values
     rfmask = 'tmp_' + ''.join([random.choice(string.ascii_letters + string.digits) \
                                for n in xrange(8)])
     grass.run_command("r.series", output=rfmask, input=maplist, method='count', flags='n')
@@ -498,22 +497,24 @@
     mask_raster = RasterRow(rfmask)
     mask_raster.open('r')
     
-    # 1. Create a np.array that can store each raster row for all of the bands
-    # 2. Loop through the raster, row-by-row and get the row values for each band
-    #    adding these to the img_np_row np.array,
-    #    which bundles rowincr rows together to pass to the classifier,
-    #    otherwise, row-by-row is too inefficient.
-    # 3. The scikit learn predict function expects a list of pixels, not an NxM matrix.
-    #    We therefore need to reshape each row matrix into a list.
-    #    The total matrix size = cols * nbands.
-    #    Therefore we can use the np.reshape function to convert the image into a list
-    #    with the number of rows = n_samples, and the number of columns = the number of bands.
-    # 4. Then we remove any NaN values because the scikit-learn predict function cannot handle NaNs.
-    #    Here we replace them with a small value using the np.nan_to_num function.
-    # 5. The flat_pixels is then passed onto the prediction function.
-    # 6. After the prediction is performed on the row, to save keeping
-    #    anything in memory, we save it to a GRASS raster object, row-by-row.
-
+    """
+    PROCEDURE
+    1. Create a np.array that can store each raster row for all of the bands
+    2. Loop through the raster, row-by-row and get the row values for each band
+        adding these to the img_np_row np.array,
+        which bundles rowincr rows together to pass to the classifier
+    3. The scikit learn predict function expects a list of pixels, not an NxM matrix.
+        We therefore need to reshape each row matrix into a list.
+        The total matrix size = cols * nbands.
+        Therefore we can use the np.reshape function to convert the image into a list
+        with the number of rows = n_samples, and the number of columns = the number of bands.
+    4. Then we remove any NaN values because the scikit-learn predict function cannot handle NaNs.
+        Here we replace them with a small value using the np.nan_to_num function.
+    5. The flat_pixels is then passed onto the prediction function.
+    6. After the prediction is performed on the row, to save keeping
+        anything in memory, we save it to a GRASS raster object, row-by-row.
+    """
+    # create and open RasterRow objects for classification
     classification = RasterRow(output)
     if mode == 'classification':
         ftype = 'CELL'
@@ -523,15 +524,19 @@
         nodata = np.nan
     classification.open('w', ftype, overwrite=True)
 
-    # create and open RasterRow objects for classification and probabilities if enabled
+    # create and open RasterRow objects for  probabilities if enabled
     if class_probabilities == True and mode == 'classification':
         prob_out_raster = [0] * nclasses
         prob = [0] * nclasses
         for iclass in range(nclasses):
-            prob_out_raster[iclass] = output + '_p' + str(class_list[iclass])
+            prob_out_raster[iclass] = output + '_classPr' + str(int(class_list[iclass]))
             prob[iclass] = RasterRow(prob_out_raster[iclass])
             prob[iclass].open('w', 'FCELL', overwrite=True)
 
+    """
+    Prediction using row blocks
+    """
+
     for rowblock in range(0, current.rows, rowincr):
         # check that the row increment does not exceed the number of rows
         if rowblock+rowincr > current.rows: rowincr = current.rows - rowblock