[GRASS-SVN] r68159 - in grass-addons/grass7/raster: . r.randomforest

Fri Mar 25 21:26:33 PDT 2016

Author: spawley
Date: 2016-03-25 21:26:32 -0700 (Fri, 25 Mar 2016)
New Revision: 68159

Added:
   grass-addons/grass7/raster/r.randomforest/
   grass-addons/grass7/raster/r.randomforest/Makefile
   grass-addons/grass7/raster/r.randomforest/r.randomforest.html
   grass-addons/grass7/raster/r.randomforest/r.randomforest.py
Log:
Commit of r.randomforest

Added: grass-addons/grass7/raster/r.randomforest/Makefile
===================================================================

--- grass-addons/grass7/raster/r.randomforest/Makefile	                        (rev 0)
+++ grass-addons/grass7/raster/r.randomforest/Makefile	2016-03-26 04:26:32 UTC (rev 68159)
@@ -0,0 +1,7 @@
+MODULE_TOPDIR = ../..
+
+PGM = r.randomforest
+
+include $(MODULE_TOPDIR)/include/Make/Script.make
+
+default: script

Added: grass-addons/grass7/raster/r.randomforest/r.randomforest.html
===================================================================
--- grass-addons/grass7/raster/r.randomforest/r.randomforest.html	                        (rev 0)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.html	2016-03-26 04:26:32 UTC (rev 68159)
@@ -0,0 +1,31 @@
+<h2>DESCRIPTION</h2>
+
+<em><b>r.randomforest</b></em> performs random forest classification and regression on a GRASS imagery group using the scikit learn machine learning python library. This python package, along with python pandas needs to be installed within your GRASS python environment for r.randomforest to work. For linux users, both of these packages are  available through the linux package manager in most distributions. For windows users, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">Osgeo4W</a> installation method of GRASS, where the python setuptools can also be installed. Then, you can download the NumPy-1.10+MKL, scikit-learn and pandas .whl files and install them using easy_install, or pip (which you might have to install with easy_install pip)
+
+<br><br> Random forests (RF) (Breiman, 2001)  represents an ensemble classification and regression tree method. RF constructs a forest of uncorrelated decision trees based on a random subset of predictor variables, which occurs independently at every node split in each tree. Each tree produces a classification, and the forest chooses the classification result which has the most votes over all of the trees. The probability of membership is based on the proportion of votes for each class. RF parameters consisting of the number of trees (ntree) and the number of variables that are available at each node split (mtry) were chosen by assessing the OOB error using different parameter values.
+
+<br><br>RF provides a number of advantages over traditional statistical classifiers because it is non-parametric and can deal with non-linear relationships and non-monotonic responses. Furthermore, continuous and categorical data can be used, and no rescaling is required. Another practical advantage of RF relative to many other machine learning algorithms is that it involves few user-specified parameter choices, principally consisting of the number of trees in the forest (ntrees), and the number of variables that are allowed to be chosen from at each node split (mfeatures), which controls the degree of correlation between the trees. Furthermore, there is no accuracy penalty in having a large number of trees, apart from the cost of increased computational time. However, the performance of RF models typically level off at a certain number of trees, at which point there is no further benefit in terms of error reduction in using a larger forest. If using random forest in regression mode
 , i.e. a continuous type training data are supplied to the classifier, then you can increase the generalization ability of the classifier by increasing minsplit, which represents the minimum number of samples required in order to split a node.
+
+<br><br>An additional feature of RF is that it includes built-in accuracy assessment and variable selection. RF uses the concept of bagging, where a randomly selected 66% subset of the training data are held-out 'out-of-bag' (OOB) in the construction of each tree, and then OOB data are used to evaluate the prediction accuracy. RF scikit learn implementation also includes a measure of variable importance based on the Gini impurity criterion, which measures how each variable contributes to the homogeneity of the nodes, with important variables causing a larger decrease in the Gini coefficient in successive node splits. This variable importance allows the contributions of the individual predictors to be determined. The feature importance scores are output to the command display.
+
+<br><br>Random forest classification like most machine learning methods does not perform well in the case of a large class imbalance. In this case, the classifier will seek to reduce the overall model error, but this will occur by modelling the majority class with a very high accuracy, but at the expense of the minority class, i.e. high sensitivity but low specificity. If you have a highly imbalanced dataset, the 'balanced' flag can be set. The scikit learn implementation balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies.
+
+<h2>NOTES</h2>
+
+<em><b>r.randomforest</b></em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifier, which is multithreaded by default, and results in a stop-start behaviour. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=100</i> should be reasonable for most systems with 4-8 GB of ram. However, if you have a workstation with much larger resources, then <i>lines</i> could be set to a much larger size, including to a value that is equal or greater than the number of rows in the current region setting, in which case the entire image will be loaded into memory to classification.
+
+<br><br> The bootstrapping process involved within random forest causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed uding the <i>randst</i> parameter.
+
+<h2>EXAMPLE</h2>
+
+r.randomforest igroup=landsat output=classification roi=labelled_pixels ntrees=500 mfeatures=-1 minsplit=2 lines=100 randst = 1
+
+<h2>REFERENCES</h2>
+
+Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.
+
+<h2>AUTHOR</h2>
+
+Steven Pawley
+
+<p><i>Last changed: $Date: 2016-03-25 22:22:00 -0700 (Sat, 25 Mar 2016) $</i>

Added: grass-addons/grass7/raster/r.randomforest/r.randomforest.py
===================================================================
--- grass-addons/grass7/raster/r.randomforest/r.randomforest.py	                        (rev 0)
+++ grass-addons/grass7/raster/r.randomforest/r.randomforest.py	2016-03-26 04:26:32 UTC (rev 68159)
@@ -0,0 +1,335 @@
+#!/usr/bin/env python
+############################################################################
+#
+# MODULE:       r.randomforest
+# AUTHOR:       Steven Pawley
+# PURPOSE:      Provides supervised random forest classification and regression
+#               (using python scikit-learn and pandas)
+#
+# COPYRIGHT:    (c) 2015 Steven Pawley, and the GRASS Development Team
+#               This program is free software under the GNU General Public
+#               License (>=v2). Read the file COPYING that comes with GRASS
+#               for details.
+#
+#############################################################################
+
+#%module
+#% description: Provides supervised random forest classification
+#% keyword: classification
+#% keyword: machine learning
+#% keyword: scikit-learn
+#% keyword: pandas
+#% keyword: random forests
+#%end
+
+#%option G_OPT_I_GROUP
+#% key: igroup
+#% label: Imagery group to be classified (predictors)
+#% description: Series of raster maps (e.g. imagery bands) to be used in the random forest classification
+#% required: yes
+#% multiple: yes
+#%end
+
+#%option G_OPT_R_INPUT
+#% key: roi
+#% label: Raster map with labelled pixels
+#% description: Raster map with labelled pixels
+#% required: yes
+#%end
+
+#%option G_OPT_R_OUTPUT
+#% key: output
+#% required: yes
+#% label: Output Classification Map
+#%end
+
+#%option
+#% key: ntrees
+#% type: integer
+#% description: Number of trees in the forest
+#% answer: 500
+#% required: yes
+#% guisection: Optional
+#%end
+
+#%option
+#% key: mfeatures
+#% type: integer
+#% description: The number of features to consider when looking for the best split. Sqrt(n_features) is used by default
+#% answer: -1
+#% required: yes
+#% guisection: Optional
+#%end
+
+#%option
+#% key: minsplit 
+#% type: integer
+#% description: The minimum number of samples required to split an internal node
+#% answer: 2
+#% required: yes
+#% guisection: Optional
+#%end
+
+#%option
+#% key: randst 
+#% type: integer
+#% description: Seed to pass onto the random state for reproducible results
+#% answer: 1
+#% required: yes
+#% guisection: Optional
+#%end
+
+#%option
+#% key: lines
+#% type: integer
+#% description: Processing block size in terms of number of rows
+#% answer: 100
+#% required: yes
+#% guisection: Optional
+#%end
+
+#%flag
+#% key: p
+#% label: Output class membership probabilities
+#% guisection: Optional
+#%end
+
+#%flag
+#% key: b
+#% description: Balance classes by weighting
+#% guisection: Optional
+#%end
+
+# user set variables
+import atexit, os, random, string
+from grass.pygrass.raster import RasterRow
+from grass.pygrass.gis.region import Region
+from grass.pygrass.raster.buffer import Buffer
+import grass.script as grass
+from sklearn.ensemble import RandomForestClassifier
+import pandas as pd
+import numpy as np
+
+def cleanup():
+    # We can then close the rasters and the roi image
+    for i in range(nbands): rasstack[i].close()
+    mask_raster.close()
+    roi_raster.close()
+    grass.run_command("g.remove", name = rfmask, flags = "f", type = "raster")
+
+def main():
+    igroup = options['igroup']
+    roi = options['roi']
+    output = options['output']
+    ntrees = options['ntrees']
+    balanced = flags['b']
+    class_probabilities = flags['p']
+    rowincr = int(options['lines'])
+    mfeatures = int(options['mfeatures'])
+    minsplit = int(options['minsplit'])
+    randst = int(options['randst'])
+
+    if mfeatures == -1: mfeatures = str('auto')
+
+    # Fetch individual raster names from group
+    groupmaps = grass.read_command("i.group", group = igroup, flags = "g")
+
+    if os.name == "nt":
+        groupmaps = groupmaps[0:len(groupmaps)-2] # to remove the last line ending and return characters
+        maplist = groupmaps.split('\r\n')
+    else:
+        groupmaps = groupmaps[0:len(groupmaps)-1] # to remove the last line ending and return characters
+        maplist = groupmaps.split('\n')
+    
+    ######################### Obtain information about GRASS rasters to be classified ######################
+    
+    # Determine number of bands (i.e. elements in the list) and then create a list of GRASS rasterrow objects
+    global nbands    
+    nbands = len(maplist)
+    
+    global rasstack
+    rasstack = [0] * nbands
+    for i in range(nbands):
+        rasstack[i] = RasterRow(maplist[i])
+    
+    # Check to see if each raster in the list exists
+    for i in range(nbands):
+        if rasstack[i].exist() == True:
+            rasstack[i].open('r')
+        else:
+            print("GRASS raster " + maplist[i] + " does not exist.... exiting")
+            exit()
+    
+    # Use grass.pygrass.gis.region to get information about the current region, particularly
+    # the number of rows and columns. We are going to sample and classify the data row-by-row,
+    # and not load all of the rasters into memory in a numpy array 
+    current = Region()
+    
+    ########################### Create a imagery mask ###################################################
+    # The input rasters might have slightly different dimensions in terms of value and non-value pixels.
+    # We will use the GRASS r.series module to automatically create a mask by propagating the null values
+    
+    global rfmask
+    rfmask = 'tmp_' + ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(8)])
+    grass.run_command("r.series", output = rfmask, input = maplist, method = 'count', flags = 'n', overwrite=True)
+
+    global mask_raster    
+    mask_raster = RasterRow(rfmask)
+    mask_raster.open('r')
+    
+    ######################### Sample training data using training ROI ##################################
+    global roi_raster    
+    roi_raster = RasterRow(roi)
+    
+    if roi_raster.exist() == True:
+        roi_raster.open('r')
+    else:
+        print("ROI raster does not exist.... exiting")
+        exit()
+    
+    # Count number of labelled pixels
+    tdir = grass.tempdir()
+    tfile = tdir + '/' + 'rstats.csv'
+    
+    grass.run_command("r.univar", flags=("gt"), map = roi, separator = 'comma', output = tfile)
+    roi_stats = pd.read_csv(tfile)
+    roi_stats
+    nlabel_pixels = roi_stats['non_null_cells'][0]
+    
+    # Create a numpy array filled with zeros, with the dimensions of the number of columns in the region
+    # and the number of bands plus an additional band to attach the labels
+    
+    tindex=0
+    training_labels = []
+    training_data = np.zeros((nlabel_pixels, nbands+1))
+    training_data[:] = np.NAN
+    
+    # Loop through each row of the raster, get the pixels from that row in the ROI raster
+    # and check if any of those pixels are labelled (i.e. they are not nan).
+    # If labelled pixels are encountered, loop through each imagery band for that row and put the data into
+    # the img_row_band_np array. Also attach the label values onto the last column.
+    
+    for row in range(current.rows):
+        roi_row_np = roi_raster[row]
+        is_train = np.nonzero(roi_row_np > -2147483648)
+        training_labels = np.append(training_labels, roi_row_np[is_train])
+        nlabels_in_row = np.array(is_train).shape[1]
+        if np.isnan(roi_row_np).all() != True:
+            for band in range(nbands):
+                imagerow_np = rasstack[band][row]
+                training_data[tindex : tindex+nlabels_in_row, band] = imagerow_np[is_train]
+            tindex = tindex + nlabels_in_row
+    
+    # Determine the number of class labels using np.unique
+    nclasses = len(np.unique(training_labels))
+    
+    # attach training label values onto last dimension of numpy array    
+    training_data[0:nlabel_pixels, nbands] = training_labels
+    
+    # Remove nan rows from numpy array. Explanation: np.isnan(a) returns a similar array with True
+    # where NaN, False elsewhere. .any(axis=1) reduces an m*n array to n with an logical or operation
+    # on the whole rows, ~ inverts True/False and a[  ] chooses just the rows from the original array,
+    # which have True within the brackets.
+    
+    training_data = training_data[~np.isnan(training_data).any(axis=1)]
+    
+    # Split the numpy array into training_labels and training_data arrays to pass to the classifier
+    training_labels = training_data[:, nbands]
+    training_data = training_data[:, 0:nbands]
+    
+    ############################### Training the classifier #######################################
+    if balanced == True:
+        rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
+        class_weight = "balanced", max_features = mfeatures, min_samples_split = minsplit, random_state = randst)
+    else:
+        rf = RandomForestClassifier(n_jobs=-1, n_estimators=int(ntrees), oob_score=True, \
+        max_features = mfeatures, min_samples_split = minsplit, random_state = randst)
+    rf = rf.fit(training_data, training_labels)
+    print('Our OOB prediction of accuracy is: {oob}%'.format(oob=rf.oob_score_ * 100))
+    rfimp = pd.DataFrame(rf.feature_importances_)
+    rfimp.insert(loc=0, column='Raster', value = maplist)
+    rfimp.columns = ['Raster', 'Importance']
+    print(rfimp)
+
+    ################################ Prediction on the rest of the raster stack ###################
+    # Create a np.array that can store each raster row for all of the bands, i.e. it is a long as the current columns,
+    # and as wide as the number of bands
+    # Loop through the raster, row-by-row and get the row values for each band, adding these to the img_np_row np.array,
+    # which adds rowincr rows together to pass to the classifier. Otherwise, row-by-row is too inefficient.
+    
+    # The scikit learn predict function expects a list of pixels, not an NxM matrix. We therefore need to reshape each row
+    # matrix into a list. The total matrix size = cols * nbands. Therefore we can use the np.reshape function to convert
+    # the image into a list with the number of rows equal to n_samples, and the number of columns equal to the number of bands. 
+    # Then we remove any NaN values because the scikit-learn predict function cannot handle NaNs. Here we replace them with a
+    # small value using the np.nan_to_num function.
+    # The flat_pixels is then passed onto the prediction function. After the prediction is performed on the row, to save keeping
+    # anything in memory, we save it to a GRASS raster object, row-by-row.
+
+    classification = RasterRow(output)
+    classification.open('w', 'CELL',  overwrite = True)
+
+    # create and open RasterRow objects for classification and probabilities if enabled    
+    if class_probabilities == True:
+        prob_out_raster = [0] * nclasses
+        prob = [0] * nclasses
+        for iclass in range(nclasses):
+            prob_out_raster[iclass] = output + '_p' + str(iclass)
+            prob[iclass] = RasterRow(prob_out_raster[iclass])
+            prob[iclass].open('w', 'FCELL',  overwrite = True)
+
+    for rowblock in range(0, current.rows, rowincr):
+        # check that the row increment does not exceed the number of rows
+        if rowblock+rowincr > current.rows: rowincr = current.rows - rowblock
+        img_np_row = np.zeros((rowincr, current.cols, nbands))
+        mask_np_row = np.zeros((rowincr, current.cols))
+        
+        # loop through each row, and each band and add these values to the 2D array img_np_row
+        for row in range(rowblock, rowblock+rowincr, 1):
+            mask_np_row[row-rowblock, :] = np.array(mask_raster[row])
+            for band in range(nbands):
+                img_np_row[row-rowblock, :, band] = np.array(rasstack[band][row])
+                
+        mask_np_row[mask_np_row == -2147483648] = np.nan
+        nanmask = np.isnan(mask_np_row) # True in the mask means invalid data
+        
+        # reshape each row-band matrix into a list
+        nsamples = rowincr * current.cols
+        flat_pixels = img_np_row.reshape((nsamples, nbands))
+        
+        # remove NaN values and perform the prediction
+        flat_pixels_noNaN = np.nan_to_num(flat_pixels)
+        result = rf.predict(flat_pixels_noNaN)
+        result = result.reshape((rowincr, current.cols))
+        
+        # replace NaN values so that the prediction surface does not have a border
+        result_NaN = np.ma.masked_array(result, mask=nanmask, fill_value=np.nan)
+        result_masked = result_NaN.filled([-2147483648]) #Return a copy of result, with masked values filled with a given value
+        
+        # for each row we can perform computation, and write the result into   
+        for row in range(rowincr):
+            newrow = Buffer((result_masked.shape[1],), mtype='CELL')
+            newrow[:] = result_masked[row, :]
+            classification.put_row(newrow)
+        
+        # same for probabilities
+        if class_probabilities == True:
+            result_proba = rf.predict_proba(flat_pixels_noNaN)
+            for iclass in range(nclasses):
+                result_proba_class = result_proba[:, iclass]
+                result_proba_class = result_proba_class.reshape((rowincr, current.cols))
+                result_proba_class_NaN = np.ma.masked_array(result_proba_class, mask=nanmask, fill_value=np.nan)
+                result_proba_class_masked = result_proba_class_NaN.filled([np.nan]) #Return a copy of result, with masked values filled with a given value
+                for row in range(rowincr):
+                    newrow = Buffer((result_proba_class_masked.shape[1],), mtype='FCELL')
+                    newrow[:] = result_proba_class_masked[row, :]
+                    prob[iclass].put_row(newrow)
+    
+    classification.close()
+
+    if class_probabilities == True:
+        for iclass in range(nclasses): prob[iclass].close()
+    
+if __name__ == "__main__":
+    options, flags = grass.parser()
+    atexit.register(cleanup)
+    main()