[GRASS-SVN] r70253 - grass-addons/grass7/raster/r.learn.ml

Wed Jan 4 15:11:03 PST 2017

Author: spawley
Date: 2017-01-04 15:11:03 -0800 (Wed, 04 Jan 2017)
New Revision: 70253

Modified:
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
Log:
'fixed bug with earth classifier and manual page'

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
===================================================================

--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-01-04 22:19:06 UTC (rev 70252)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-01-04 23:11:03 UTC (rev 70253)
@@ -18,12 +18,12 @@
 	<li>The <em>SVC</em> model is C-Support Vector Classifier. Only a linear kernel is supported because non-linear kernels using scikit learn for typical remote sensing and spatial analysis datasets which consist of large numbers of samples are too slow to be practical. This classifier can still be slow for large datasets.</li>
 	
 	<li>The <em>EarthClassifier</em> and <em>EarthRegressor</em> is a python-based version of Friedman's multivariate adaptive regression splines. This classifier depends on the <a href="https://github.com/scikit-learn-contrib/py-earth">py-earth package</a>, which optionally can be installed in addition to scikit-learn. Earth represents a non-parametric extension to linear models such as logistic regression which improves model fit by partitioning the data into subregions, with each region being fitted by a separate regression term.</li>
-</ol>
+</ul>
 
-<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. The classifier defaults are supplied but these parameters can be automatically tuning using a randomized search by setting the <em>n_iter</em> option to &gt 1. Parameter tuning can be accomplished simultaneously with nested cross-validation by also settings the <em>cv</em> option to &gt > 1. The parameters consist of:
+<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. The classifier defaults are supplied but these parameters can be automatically tuning using a randomized search by setting the <em>n_iter</em> option to &gt 1. Parameter tuning can be accomplished simultaneously with nested cross-validation by also settings the <em>cv</em> option to &gt > 1. The parameters consist of:</p>
 
 <ul>
-	<li><em>C</em> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <i>C</i> applies to the LogisticRegression and SVC models. Tuning occurs over the range of 1-1000. </li>
+	<li><em>C</em> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <em>C</em> applies to the LogisticRegression and SVC models. Tuning occurs over the range of 1-1000. </li>
 	
 	<li><em>n_estimators</em> represents the number of trees in Random Forest model, and the number of trees used in each model step during Gradient Boosting. Tuning occurs over 50-500 for gradient boosting, whereas <em>n_estimators</em> is not tuning for random forests.</li>
 	
@@ -33,25 +33,26 @@
 	
 	<li>The <em>learning_rate</em> and <em>subsample</em> parameters apply only to Gradient Boosting. <em>learning_rate</em> shrinks the contribution of each tree, and <em>subsample</em> is the fraction of randomly selected samples for each tree. <em>learning_rate</em> is tuning over 0.001-0.1, and <em>subsample</em> is tuned over 0-1.0.</li>
 	
-	<li>Parameters relating to the Earth classifier consist of: <em>max_degree</em> which is the maximum degree of terms generated by the forward pass; <em>penalty</em> is a smoothing parameter; and <em>minspan_alpha</em> is the probability between 0 and 1 that controls the number of data points between knots. These are tuned over 1-10 for <em>max_degree</em>, 0.5-5 for <em>penalty</em>, and 0.05-1 for <em>minspan_alpha.</em></li>
+	<li>Parameters relating to the Earth classifier consist of: <em>max_degree</em> which is the maximum degree of terms generated by the forward pass; <em>penalty</em> is a smoothing parameter; and <em>minspan_alpha</em> is the probability between 0 and 1 that controls the number of data points between knots. These are tuned over 1-5 for <em>max_degree</em>, 0.5-2 for <em>penalty</em>, and 0-1 for <em>minspan_alpha.</em></li>
+</ul>
 
-<p>In addition to model fitting and prediction, feature selection can be performed using the <i>f</i> flag. The feature selection method employed consists of a custom permutation-based method that can be applied to all of the classifiers as part of a cross-validation. The method consists of: (1) determining a performance metric on a test partition of the data; (2) permuting each variable and assessing the difference in performance between the original and permutation; (3) repeating step 2 for <i>n_permutations</i>; (4) averaging the results. Steps 1-4 are repeated on each k partition. The feature importance represent the average decrease in performance of each variable when permuted. For binary classifications, the AUC is used as the metric. Multiclass classifications use accuracy, and regressions use R2.</p>
+<p>In addition to model fitting and prediction, feature selection can be performed using the <em>-f</em> flag. The feature selection method employed consists of a custom permutation-based method that can be applied to all of the classifiers as part of a cross-validation. The method consists of: (1) determining a performance metric on a test partition of the data; (2) permuting each variable and assessing the difference in performance between the original and permutation; (3) repeating step 2 for <em>n_permutations</em>; (4) averaging the results. Steps 1-4 are repeated on each k partition. The feature importance represent the average decrease in performance of each variable when permuted. For binary classifications, the AUC is used as the metric. Multiclass classifications use accuracy, and regressions use R2.</p>
 
-<p>Cross validation can be performed by setting the <i>cv</i> parameters to &gt 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced. Also note that this cross-validation is performed on a pixel basis. If there is a strong autocorrelation between pixels (i.e. the pixels represent polygons) then the training/test splits will not represent independent samples and will overestimate the accuracy. In this case, the <i>cvtype</i> parameter can be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into groups by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic perform
 ance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the group_ids of the partitions can be supplied using the <i>group_raster</i> option.</p>
+<p>Cross validation can be performed by setting the <em>cv</em> parameters to &gt 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced. Also note that this cross-validation is performed on a pixel basis. If there is a strong autocorrelation between pixels (i.e. the pixels represent polygons) then the training/test splits will not represent independent samples and will overestimate the accuracy. In this case, the <em>cvtype</em> parameter can be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into groups by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic per
 formance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the group_ids of the partitions can be supplied using the <em>group_raster</em> option.</p>
 
-<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as LogisticRegression and SVC may not perform optimally if some predictors have variances that are orders of magnitude larger than others, and will therefore dominate the objective function. The <i>s</i> flag can be used to add a standardization preprocessing step to the classification and prediction, which will standardize each predictor relative to its standard deviation.</p>
+<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as LogisticRegression and SVC may not perform optimally if some predictors have variances that are orders of magnitude larger than others, and will therefore dominate the objective function. The <em>-s</em> flag can be used to add a standardization preprocessing step to the classification and prediction, which will standardize each predictor relative to its standard deviation.</p>
 
 <p>The module also offers the ability to save and load a classification or regression model. Saving and loading a model allows a model to be fitted on one imagery group, with the prediction applied to additional imagery groups. This approach is commonly employed in species distribution or landslide susceptibility modelling whereby a classification or regression model is built with one set of predictors (e.g. present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables.</p>
 
-<p>For convenience when performing repeated classifications using different classifiers or parameters, the training data can be saved to a csv file using the <i>save_training</i> option. This data can then be loaded into subsequent classification runs, saving time by avoiding the need to repeatedly query the predictors.</p>
+<p>For convenience when performing repeated classifications using different classifiers or parameters, the training data can be saved to a csv file using the <em>save_training</em> option. This data can then be loaded into subsequent classification runs, saving time by avoiding the need to repeatedly query the predictors.</p>
 
 <h2>NOTES</h2>
 
 <p><em>r.learn.ml</em> uses the "scikit-learn" machine learning python package along with the "pandas" package. These packages need to be installed within your GRASS GIS Python environment. For Linux users, these packages should be available through the linux package manager. For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows with a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.</p>
 
-<p><em>r.learn.ml</em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <i>lines</i> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <i>Lines=25</i> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <i>l</i> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on rasters of any size irrespective of the available mem
 ory.</p>
+<p><em>r.learn.ml</em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <em>lines</em> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <em>Lines=25</em> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <em>-l</em> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on rasters of any size irrespective of the availa
 ble memory.</p>
 
-<p>Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <i>randst</i> parameter.</p>
+<p>Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <em>randst</em> parameter.</p>
 
 <h2>NOTES</h2>
 
@@ -100,4 +101,4 @@
 
 Steven Pawley
 
-<p><i>Last changed: $Date: 2017-01-02 16:05:45 -0700 (Mon, 02 Jan 2017) $</i></p>
\ No newline at end of file
+<p><em>Last changed: $Date: 2017-01-04 15:34:42 -0700 (Wed, 04 Jan 2017) $</em></p>
\ No newline at end of file

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-01-04 22:19:06 UTC (rev 70252)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-01-04 23:11:03 UTC (rev 70253)
@@ -845,7 +845,9 @@
 
             # Combine Earth with LogisticRegression in a pipeline to do classification
             earth_classifier = Pipeline([('Earth',
-                Earth()), ('Logistic', LogisticRegression())])
+                Earth(max_degree=max_degree,
+                     penalty=penalty,
+                     minspan_alpha=minspan_alpha)), ('Logistic', LogisticRegression())])
 
             classifiers = {'EarthClassifier': earth_classifier,
                            'EarthRegressor': Earth(max_degree=max_degree,
@@ -925,7 +927,10 @@
     SVCOpts = {'C': randint(1, 100), 'shrinking': [True, False]}
     EarthOpts = {'max_degree': randint(1,10),
                  'penalty': uniform(0.5, 5),
-                 'minspan_alpha': uniform(0.05, 1)}
+                 'minspan_alpha': uniform(0.05, 1.0)}
+    EarthClassifierOpts = {'Earth__max_degree': randint(1,5),
+                         'Earth__penalty': uniform(0.5, 5),
+                         'Earth__minspan_alpha': uniform()}
 
     param_grids = {
         'SVC': SVCOpts,
@@ -939,7 +944,7 @@
         'GaussianNB': {},
         'LinearDiscriminantAnalysis': {},
         'QuadraticDiscriminantAnalysis': {},
-        'EarthClassifier': EarthOpts,
+        'EarthClassifier': EarthClassifierOpts,
         'EarthRegressor': EarthOpts
     }