[GRASS-SVN] r70289 - grass-addons/grass7/raster/r.learn.ml

Fri Jan 6 13:34:55 PST 2017

Author: spawley
Date: 2017-01-06 13:34:54 -0800 (Fri, 06 Jan 2017)
New Revision: 70289

Modified:
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
   grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
Log:
'added caveats to manual re. onehot encoding'

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
===================================================================

--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-01-06 20:35:56 UTC (rev 70288)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html	2017-01-06 21:34:54 UTC (rev 70289)
@@ -40,7 +40,7 @@
 
 <p>Cross validation can be performed by setting the <em>cv</em> parameters to &gt 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced. Also note that this cross-validation is performed on a pixel basis. If there is a strong autocorrelation between pixels (i.e. the pixels represent polygons) then the training/test splits will not represent independent samples and will overestimate the accuracy. In this case, the <em>cvtype</em> parameter can be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into groups by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic per
 formance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the group_ids of the partitions can be supplied using the <em>group_raster</em> option.</p>
 
-<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as LogisticRegression and SVC may not perform optimally if some predictors have variances that are orders of magnitude larger than others, and will therefore dominate the objective function. The <em>-s</em> flag can be used to add a standardization preprocessing step to the classification and prediction, which will standardize each predictor relative to its standard deviation. Non-ordinal, categorical predictors are also not specifically recognized by scikit-learn. Some classifiers are not very sensitive to this (i.e. decision trees) but generally, categorical predictors need to be converted to a suite of binary using onehot encoding (i.e. where each value in a categorical raster is parsed into a separate binary grid). Entering the indices of the categorical rasters as they are listed in the imagery group as 0...n in the <em>categorymaps</em> option will cause onehot encoding 
 to be performed on the fly during training and prediction. The feature importances are returned as per the original imagery group and represent the sum of the feature importances of the onehot-encoded variables.</p>
+<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as linear models may not perform optimally if some predictors have variances that are orders of magnitude larger than others. The <em>-s</em> flag adds a standardization preprocessing step to the classification and prediction to reduce this effect. Non-ordinal, categorical predictors are also not specifically recognized by scikit-learn. Some classifiers are not very sensitive to this (i.e. decision trees) but generally, categorical predictors need to be converted to a suite of binary using onehot encoding (i.e. where each value in a categorical raster is parsed into a separate binary grid). Entering the indices (comma-separated) of the categorical rasters as they are listed in the imagery group as 0...n in the <em>categorymaps</em> option will cause onehot encoding to be performed on the fly during training and prediction. The feature importances are returned as per the origin
 al imagery group and represent the sum of the feature importances of the onehot-encoded variables. Note: it is important that the training samples all of the categories in the rasters, otherwise the onehot-encoding will fail when it comes to the prediction. An alterative approach is to onehot-encode the categorical rasters manually (i.e. create a series of raster maps coded to 0 and 1 for each category value) and use these in the imagery group.</p>
 
 <p>The module also offers the ability to save and load a classification or regression model. Saving and loading a model allows a model to be fitted on one imagery group, with the prediction applied to additional imagery groups. This approach is commonly employed in species distribution or landslide susceptibility modelling whereby a classification or regression model is built with one set of predictors (e.g. present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables.</p>
 

Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-01-06 20:35:56 UTC (rev 70288)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py	2017-01-06 21:34:54 UTC (rev 70289)
@@ -771,7 +771,14 @@
 
             # onehot-encoding
             if self.enc is not None:
-                flat_pixels = self.enc.transform(flat_pixels)
+                try:
+                    flat_pixels = self.enc.transform(flat_pixels)
+                except:
+                    # if this fails it is because the onehot-encoder was fitted
+                    # on the training samples, but the prediction data contains
+                    # new values, i.e. the training data has not sampled all of
+                    # categories
+                    grass.fatal('There are values in the categorical rasters that are not present in the training data set, i.e. the training data has not sampled all of the categories')
             
             # rescale
             if self.scaler is not None: