[GRASS-SVN] r70989 - in grass-addons/grass7: raster/r.findtheriver raster/r.green/r.green.hydro/r.green.hydro.planning raster/r.learn.ml raster/r.sim.water.mp raster/r.sun.mp raster3d/r3.to.group vector/v.profile
svn_grass at osgeo.org
svn_grass at osgeo.org
Sat Apr 29 14:42:18 PDT 2017
Author: neteler
Date: 2017-04-29 14:42:18 -0700 (Sat, 29 Apr 2017)
New Revision: 70989
Modified:
grass-addons/grass7/raster/r.findtheriver/find_stream.c
grass-addons/grass7/raster/r.findtheriver/global.h
grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/Makefile
grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r.green.hydro.planning.html
grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r.green.hydro.planning.py
grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r_green_hydro_planning_input.png
grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r_green_hydro_planning_input_PNAM.png
grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r_green_hydro_planning_output_PNAM3.png
grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r_green_hydro_planning_output_park.png
grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r_green_hydro_planning_output_points.png
grass-addons/grass7/raster/r.learn.ml/Makefile
grass-addons/grass7/raster/r.learn.ml/lsat7_2000_b742.png
grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
grass-addons/grass7/raster/r.learn.ml/r_learn_utils.py
grass-addons/grass7/raster/r.learn.ml/rfclassification.png
grass-addons/grass7/raster/r.sim.water.mp/Makefile
grass-addons/grass7/raster/r.sim.water.mp/erod.c
grass-addons/grass7/raster/r.sim.water.mp/hydro.c
grass-addons/grass7/raster/r.sim.water.mp/input.c
grass-addons/grass7/raster/r.sim.water.mp/main.c
grass-addons/grass7/raster/r.sim.water.mp/observation_points.c
grass-addons/grass7/raster/r.sim.water.mp/output.c
grass-addons/grass7/raster/r.sim.water.mp/r.sim.water.mp.html
grass-addons/grass7/raster/r.sim.water.mp/r_sim_water.png
grass-addons/grass7/raster/r.sim.water.mp/random.c
grass-addons/grass7/raster/r.sim.water.mp/simlib.h
grass-addons/grass7/raster/r.sim.water.mp/spearfish.sh
grass-addons/grass7/raster/r.sim.water.mp/utils.c
grass-addons/grass7/raster/r.sim.water.mp/waterglobs.h
grass-addons/grass7/raster/r.sun.mp/Makefile
grass-addons/grass7/raster/r.sun.mp/README
grass-addons/grass7/raster/r.sun.mp/TODO
grass-addons/grass7/raster/r.sun.mp/dalloc.c
grass-addons/grass7/raster/r.sun.mp/local_proto.h
grass-addons/grass7/raster/r.sun.mp/main.c
grass-addons/grass7/raster/r.sun.mp/r.sun.mp.html
grass-addons/grass7/raster/r.sun.mp/rsunglobals.h
grass-addons/grass7/raster/r.sun.mp/rsunlib.c
grass-addons/grass7/raster/r.sun.mp/sunradstruct.h
grass-addons/grass7/raster3d/r3.to.group/Makefile
grass-addons/grass7/raster3d/r3.to.group/r3.to.group.html
grass-addons/grass7/raster3d/r3.to.group/r3.to.group.py
grass-addons/grass7/vector/v.profile/local_proto.h
grass-addons/grass7/vector/v.profile/processors.c
Log:
addons: svn propset
Property changes on: grass-addons/grass7/raster/r.findtheriver/find_stream.c
___________________________________________________________________
Added: svn:mime-type
+ text/x-csrc
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.findtheriver/global.h
___________________________________________________________________
Added: svn:mime-type
+ text/x-chdr
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/Makefile
___________________________________________________________________
Added: svn:mime-type
+ text/x-makefile
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r.green.hydro.planning.html
___________________________________________________________________
Added: svn:mime-type
+ text/html
Added: svn:keywords
+ Author Date Id
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r.green.hydro.planning.py
___________________________________________________________________
Added: svn:mime-type
+ text/x-python
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r_green_hydro_planning_input.png
___________________________________________________________________
Modified: svn:mime-type
- application/octet-stream
+ image/png
Property changes on: grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r_green_hydro_planning_input_PNAM.png
___________________________________________________________________
Modified: svn:mime-type
- application/octet-stream
+ image/png
Property changes on: grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r_green_hydro_planning_output_PNAM3.png
___________________________________________________________________
Modified: svn:mime-type
- application/octet-stream
+ image/png
Property changes on: grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r_green_hydro_planning_output_park.png
___________________________________________________________________
Modified: svn:mime-type
- application/octet-stream
+ image/png
Property changes on: grass-addons/grass7/raster/r.green/r.green.hydro/r.green.hydro.planning/r_green_hydro_planning_output_points.png
___________________________________________________________________
Modified: svn:mime-type
- application/octet-stream
+ image/png
Modified: grass-addons/grass7/raster/r.learn.ml/Makefile
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/Makefile 2017-04-29 20:09:01 UTC (rev 70988)
+++ grass-addons/grass7/raster/r.learn.ml/Makefile 2017-04-29 21:42:18 UTC (rev 70989)
@@ -1,10 +1,10 @@
-MODULE_TOPDIR = ../..
-
-PGM = r.learn.ml
-
-ETCFILES = r_learn_utils
-
-include $(MODULE_TOPDIR)/include/Make/Script.make
-include $(MODULE_TOPDIR)/include/Make/Python.make
-
-default: script
+MODULE_TOPDIR = ../..
+
+PGM = r.learn.ml
+
+ETCFILES = r_learn_utils
+
+include $(MODULE_TOPDIR)/include/Make/Script.make
+include $(MODULE_TOPDIR)/include/Make/Python.make
+
+default: script
Property changes on: grass-addons/grass7/raster/r.learn.ml/Makefile
___________________________________________________________________
Added: svn:mime-type
+ text/x-makefile
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.learn.ml/lsat7_2000_b742.png
___________________________________________________________________
Modified: svn:mime-type
- application/octet-stream
+ image/png
Modified: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html 2017-04-29 20:09:01 UTC (rev 70988)
+++ grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html 2017-04-29 21:42:18 UTC (rev 70989)
@@ -1,104 +1,104 @@
-<h2>DESCRIPTION</h2>
-
-<p><em>r.learn.ml</em> represents a front-end to the scikit learn python package for the purpose of performing classification and regression on GRASS rasters as part of an imagery group. The module enables classification and regression using several commonly used classifiers in remote sensing and spatial modelling. The choice of classifier is set using the <em>classifier</em> parameter. For more details relating to the classifiers, refer to the <a href="http://scikit-learn.org/stable/">scikit learn documentation.</a> The following classification and regression methods are available:</p>
-
-<ul>
- <li><em>LogisticRegression</em> represents a linear model for classification and is a modification of linear regression, but using the logistic distribution, which enables the use of a categorical response variable. If the <em>response</em> raster is coded to 0 and 1, then a binary classification occurs, but the scikit-learn logistic regression can also perform a multiclass classification using a one-versus-rest scheme.</li>
-
- <li><em>LinearDiscriminantAnalysis</em> and <em>QuadraticDiscriminantAnalysis</em> are classifiers with linear and quadratic decision surfaces. These classifiers do not take any parameters and are inherently multiclass. They can only be used for classification.</li>
-
- <li><em>KNeighborsClassifier</em> is a simple classification method based on closest distance to a predefined number of training samples and making the prediction from this. Two hyperparameters are exposed in the gui, with <em>n_neighbors</em> governing the number of neighbors to use to decide the prediction label, and <em>weights</em> specifying whether these neighbors should have equal weights or whether they should be inversely weighted by their distance.</li>
-
- <li><em>GaussianNB</em> is the Gaussian Naive Bayes algorithm and can be used for classification only. Naive Bayes is a supervised learning algorithm based on applying Bayes theorem with the naive assumption of independence between every pair of features. This classifier does not take any parameters.</li>
-
- <li>The <em>DecisionTreeClassifier</em> and <em>DecisionTreeRegressor</em> map observations to a response variable using a hierarchy of splits and branches. The terminus of these branches, termed leaves, represent the prediction of the response variable. Decision trees are non-parametric and can model non-linear relationships between a response and predictor variables, and are insensitive the scaling of the predictors.</li>
-
- <li>The <em>RandomForestsClassifier</em> and <em>RandomForestsRegressor</em> represent ensemble classification and regression tree methods. Random forests overcome some of the disadvantages of single decision trees by constructing an ensemble of uncorrelated trees. Each tree is constructed from a random subsample of the training data and only a random subset of the predictors based on <em>max_features</em> is made available during each node split. Each tree produces a prediction probability and the final classification result is obtained by averaging of the prediction probabilities across all of the trees. The <em>ExtraTreesClassifier</em> is a variant on random forests where during each node split, the splitting rule that is selected is based on the best of a collection of randomly-geneated thresholds that were assigned to the features.</li>
-
- <li>The <em>GradientBoostingClassifier</em> and <em>GradientBoostingRegressor</em> also represent ensemble tree-based methods. However, in a boosted model the learning processes is additive in a forward step-wise fashion whereby <i>n_estimators</i> are fit during each model step, and each model step is designed to better fit samples that are not currently well predicted by the previous step. This incrementally improves the performance of the entire model ensemble by fitting to the model residuals. Additionally, the <em>XGBClassifier</em> and <em>XGBRegressor</em> models represent an accelerated version of gradient boosting which can optionally be installed from the XGBoost python package.</li>
-
- <li>The <em>SVC</em> model is C-Support Vector Classifier. Only a linear kernel is supported because non-linear kernels using scikit learn for typical remote sensing and spatial analysis datasets which consist of large numbers of samples are too slow to be practical. This classifier can still be slow for large datasets.</li>
-
- <li>The <em>EarthClassifier</em> and <em>EarthRegressor</em> is a python-based version of Friedman's multivariate adaptive regression splines. This classifier depends on the <a href="https://github.com/scikit-learn-contrib/py-earth">py-earth package</a>, which optionally can be installed in addition to scikit-learn. Earth represents a non-parametric extension to linear models such as logistic regression which improves model fit by partitioning the data into subregions, with each region being fitted by a separate regression term.</li>
-</ul>
-
-<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. The scikit-learn classifier defaults are generally supplied, and some of these parameters can be tuning using a grid-search by inputting multiple parameter settings as a comma-separated list. This tuning can also be accomplished simultaneously with nested cross-validation by also settings the <em>cv</em> option to > > 1. The parameters consist of:</p>
-
-<ul>
- <li><em>C</em> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <em>C</em> applies to the LogisticRegression and SVC models.</li>
-
- <li><em>n_estimators</em> represents the number of trees in Random Forest model, and the number of trees used in each model step during Gradient Boosting. For random forests, a larger number of trees will never adversely affect accuracy although this is at the expensive of computational performance. In contrast, Gradient boosting will start to overfit if <em>n_estimators</em> is too high, which will reduce model accuracy.</li>
-
- <li><em>max_features</em> controls the number of variables that are allowed to be chosen from at each node split in the tree-based models, and can be considered to control the degree of correlation between the trees in ensemble tree methods.</li>
-
- <li><em>min_samples_split</em> and <em>min_samples_leaf</em> control the number of samples required to split a node or form a leaf node, respectively.</li>
-
- <li>The <em>learning_rate</em> and <em>subsample</em> parameters apply only to Gradient Boosting and XGBClassifier or XGBRegressor. <em>learning_rate</em> shrinks the contribution of each tree, and <em>subsample</em> is the fraction of randomly selected samples for each tree. A lower <em>learning_rate</em> always improves accuracy in gradient boosting but will require a much larger <em>n_estimators</em> setting which will lower computational performance.</li>
-
- <li>The main control on accuracy in the Earth classifier consists <em>max_degree</em> which is the maximum degree of terms generated by the forward pass. Settings of <em>max_degree</em> = 1 or 2 offer good accuracy versus computational performance.</li>
-</ul>
-
-<p>In addition to model fitting and prediction, feature selection can be performed using the <em>-f</em> flag. The feature selection method employed consists of a custom permutation-based method that can be applied to all of the classifiers as part of a cross-validation. The method consists of: (1) determining a performance metric on a test partition of the data; (2) permuting each variable and assessing the difference in performance between the original and permutation; (3) repeating step 2 for <em>n_permutations</em>; (4) averaging the results. Steps 1-4 are repeated on each k partition. The feature importance represent the average decrease in performance of each variable when permuted. For binary classifications, the AUC is used as the metric. Multiclass classifications use accuracy, and regressions use R2.</p>
-
-<p>Cross validation can be performed by setting the <em>cv</em> parameters to > 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced depending on whether the response variable is binary or multiclass, or the classifier is for regression or classification. The <em>cvtype</em> parameter can also be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into <em>n_partitions</em> by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic performance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the gr
oup_ids of the partitions can be supplied using the <em>group_raster</em> option.</p>
-
-<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as linear models may not perform optimally if some predictors have variances that are orders of magnitude larger than others. The <em>-s</em> flag adds a standardization preprocessing step to the classification and prediction to reduce this effect. Additionally, most of the classifiers do not perform well if there is a large class imbalance in the training data. Using the <em>-b</em> flag balances the training data by weighting of the minority classes relative to the majority class. This does not apply to the Naive Bayes or LinearDiscriminantAnalysis classifiers.</p>
-
-<p>Non-ordinal, categorical predictors are also not specifically recognized by scikit-learn. Some classifiers are not very sensitive to this (i.e. decision trees) but generally, categorical predictors need to be converted to a suite of binary using onehot encoding (i.e. where each value in a categorical raster is parsed into a separate binary grid). Entering the indices (comma-separated) of the categorical rasters as they are listed in the imagery group as 0...n in the <em>categorymaps</em> option will cause onehot encoding to be performed on the fly during training and prediction. The feature importances are returned as per the original imagery group and represent the sum of the feature importances of the onehot-encoded variables. Note: it is important that the training samples all of the categories in the rasters, otherwise the onehot-encoding will fail when it comes to the prediction.</p>
-
-<p>The module also offers the ability to save and load a classification or regression model. Saving and loading a model allows a model to be fitted on one imagery group, with the prediction applied to additional imagery groups. This approach is commonly employed in species distribution or landslide susceptibility modelling whereby a classification or regression model is built with one set of predictors (e.g. present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables.</p>
-
-<p>For convenience when performing repeated classifications using different classifiers or parameters, the training data can be saved to a csv file using the <em>save_training</em> option. This data can then be loaded into subsequent classification runs, saving time by avoiding the need to repeatedly query the predictors.</p>
-
-<h2>NOTES</h2>
-
-<p><em>r.learn.ml</em> uses the "scikit-learn" machine learning python package along with the "pandas" package. These packages need to be installed within your GRASS GIS Python environment. For Linux users, these packages should be available through the linux package manager. For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows with a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.</p>
-
-<p><em>r.learn.ml</em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <em>lines</em> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <em>Lines=25</em> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set when providing a raster as the trainingmap. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <em>-l</em> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on ras
ters of any size irrespective of the available memory.</p>
-
-<p>Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <em>randst</em> parameter.</p>
-
-<h2>EXAMPLE</h2>
-
-<p>Here we are going to use the GRASS GIS sample North Carolina data set as a basis to perform a landsat classification. We are going to classify a Landsat 7 scene from 2000, using training information from an older (1996) land cover dataset.</p>
-
-<p>Landsat 7 (2000) bands 7,4,2 color composite example:</p>
-<center>
-<img src="lsat7_2000_b742.png" alt="Landsat 7 (2000) bands 7,4,2 color composite example">
-</center>
-
-<p>Note that this example must be run in the "landsat" mapset of the North Carolina sample data set location.</p>
-
-<p>First, we are going to generate some training pixels from an older (1996) land cover classification:</p>
-<div class="code"><pre>
-g.region raster=landclass96 -p
-r.random input=landclass96 npoints=1000 raster=landclass96_roi
-</pre></div>
-
-<p>Then we can use these training pixels to perform a classification on the more recently obtained landsat 7 image:</p>
-<div class="code"><pre>
-r.learn.ml group=lsat7_2000 trainingmap=landclass96_roi output=rf_classification \
- classifier=RandomForestClassifier n_estimators=500 max_features=-1 min_samples_split=2 randst=1 lines=25
-
-# copy category labels from landclass training map to result
-r.category rf_classification raster=landclass96_roi
-
-# copy color scheme from landclass training map to result
-r.colors rf_classification raster=landclass96_roi
-r.category rf_classification
-</pre></div>
-
-<p>Random forest classification result:</p>
-<center>
-<img src="rfclassification.png" alt="Random forest classification result">
-</center>
-
-<h2>ACKNOWLEDGEMENTS</h2>
-
-<p>Thanks for Paulo van Breugel and Vaclav Petras for general testing, and Paulo for the suggestion to enable saving of the fitted models.</p>
-
-<h2>AUTHOR</h2>
-
-Steven Pawley
-
-<p><em>Last changed: $Date: 2017-01-04 15:34:42 -0700 (Wed, 04 Jan 2017) $</em></p>
\ No newline at end of file
+<h2>DESCRIPTION</h2>
+
+<p><em>r.learn.ml</em> represents a front-end to the scikit learn python package for the purpose of performing classification and regression on GRASS rasters as part of an imagery group. The module enables classification and regression using several commonly used classifiers in remote sensing and spatial modelling. The choice of classifier is set using the <em>classifier</em> parameter. For more details relating to the classifiers, refer to the <a href="http://scikit-learn.org/stable/">scikit learn documentation.</a> The following classification and regression methods are available:</p>
+
+<ul>
+ <li><em>LogisticRegression</em> represents a linear model for classification and is a modification of linear regression, but using the logistic distribution, which enables the use of a categorical response variable. If the <em>response</em> raster is coded to 0 and 1, then a binary classification occurs, but the scikit-learn logistic regression can also perform a multiclass classification using a one-versus-rest scheme.</li>
+
+ <li><em>LinearDiscriminantAnalysis</em> and <em>QuadraticDiscriminantAnalysis</em> are classifiers with linear and quadratic decision surfaces. These classifiers do not take any parameters and are inherently multiclass. They can only be used for classification.</li>
+
+ <li><em>KNeighborsClassifier</em> is a simple classification method based on closest distance to a predefined number of training samples and making the prediction from this. Two hyperparameters are exposed in the gui, with <em>n_neighbors</em> governing the number of neighbors to use to decide the prediction label, and <em>weights</em> specifying whether these neighbors should have equal weights or whether they should be inversely weighted by their distance.</li>
+
+ <li><em>GaussianNB</em> is the Gaussian Naive Bayes algorithm and can be used for classification only. Naive Bayes is a supervised learning algorithm based on applying Bayes theorem with the naive assumption of independence between every pair of features. This classifier does not take any parameters.</li>
+
+ <li>The <em>DecisionTreeClassifier</em> and <em>DecisionTreeRegressor</em> map observations to a response variable using a hierarchy of splits and branches. The terminus of these branches, termed leaves, represent the prediction of the response variable. Decision trees are non-parametric and can model non-linear relationships between a response and predictor variables, and are insensitive the scaling of the predictors.</li>
+
+ <li>The <em>RandomForestsClassifier</em> and <em>RandomForestsRegressor</em> represent ensemble classification and regression tree methods. Random forests overcome some of the disadvantages of single decision trees by constructing an ensemble of uncorrelated trees. Each tree is constructed from a random subsample of the training data and only a random subset of the predictors based on <em>max_features</em> is made available during each node split. Each tree produces a prediction probability and the final classification result is obtained by averaging of the prediction probabilities across all of the trees. The <em>ExtraTreesClassifier</em> is a variant on random forests where during each node split, the splitting rule that is selected is based on the best of a collection of randomly-geneated thresholds that were assigned to the features.</li>
+
+ <li>The <em>GradientBoostingClassifier</em> and <em>GradientBoostingRegressor</em> also represent ensemble tree-based methods. However, in a boosted model the learning processes is additive in a forward step-wise fashion whereby <i>n_estimators</i> are fit during each model step, and each model step is designed to better fit samples that are not currently well predicted by the previous step. This incrementally improves the performance of the entire model ensemble by fitting to the model residuals. Additionally, the <em>XGBClassifier</em> and <em>XGBRegressor</em> models represent an accelerated version of gradient boosting which can optionally be installed from the XGBoost python package.</li>
+
+ <li>The <em>SVC</em> model is C-Support Vector Classifier. Only a linear kernel is supported because non-linear kernels using scikit learn for typical remote sensing and spatial analysis datasets which consist of large numbers of samples are too slow to be practical. This classifier can still be slow for large datasets.</li>
+
+ <li>The <em>EarthClassifier</em> and <em>EarthRegressor</em> is a python-based version of Friedman's multivariate adaptive regression splines. This classifier depends on the <a href="https://github.com/scikit-learn-contrib/py-earth">py-earth package</a>, which optionally can be installed in addition to scikit-learn. Earth represents a non-parametric extension to linear models such as logistic regression which improves model fit by partitioning the data into subregions, with each region being fitted by a separate regression term.</li>
+</ul>
+
+<p>The Classifier parameters tab provides access to the most pertinent parameters that affect the previously described algorithms. The scikit-learn classifier defaults are generally supplied, and some of these parameters can be tuning using a grid-search by inputting multiple parameter settings as a comma-separated list. This tuning can also be accomplished simultaneously with nested cross-validation by also settings the <em>cv</em> option to > > 1. The parameters consist of:</p>
+
+<ul>
+ <li><em>C</em> is the inverse of the regularization strength, which is when a penalty is applied to avoid overfitting. <em>C</em> applies to the LogisticRegression and SVC models.</li>
+
+ <li><em>n_estimators</em> represents the number of trees in Random Forest model, and the number of trees used in each model step during Gradient Boosting. For random forests, a larger number of trees will never adversely affect accuracy although this is at the expensive of computational performance. In contrast, Gradient boosting will start to overfit if <em>n_estimators</em> is too high, which will reduce model accuracy.</li>
+
+ <li><em>max_features</em> controls the number of variables that are allowed to be chosen from at each node split in the tree-based models, and can be considered to control the degree of correlation between the trees in ensemble tree methods.</li>
+
+ <li><em>min_samples_split</em> and <em>min_samples_leaf</em> control the number of samples required to split a node or form a leaf node, respectively.</li>
+
+ <li>The <em>learning_rate</em> and <em>subsample</em> parameters apply only to Gradient Boosting and XGBClassifier or XGBRegressor. <em>learning_rate</em> shrinks the contribution of each tree, and <em>subsample</em> is the fraction of randomly selected samples for each tree. A lower <em>learning_rate</em> always improves accuracy in gradient boosting but will require a much larger <em>n_estimators</em> setting which will lower computational performance.</li>
+
+ <li>The main control on accuracy in the Earth classifier consists <em>max_degree</em> which is the maximum degree of terms generated by the forward pass. Settings of <em>max_degree</em> = 1 or 2 offer good accuracy versus computational performance.</li>
+</ul>
+
+<p>In addition to model fitting and prediction, feature selection can be performed using the <em>-f</em> flag. The feature selection method employed consists of a custom permutation-based method that can be applied to all of the classifiers as part of a cross-validation. The method consists of: (1) determining a performance metric on a test partition of the data; (2) permuting each variable and assessing the difference in performance between the original and permutation; (3) repeating step 2 for <em>n_permutations</em>; (4) averaging the results. Steps 1-4 are repeated on each k partition. The feature importance represent the average decrease in performance of each variable when permuted. For binary classifications, the AUC is used as the metric. Multiclass classifications use accuracy, and regressions use R2.</p>
+
+<p>Cross validation can be performed by setting the <em>cv</em> parameters to > 1. Cross-validation is performed using stratified kfolds, and multiple global and per-class accuracy measures are produced depending on whether the response variable is binary or multiclass, or the classifier is for regression or classification. The <em>cvtype</em> parameter can also be changed from 'non-spatial' to either 'clumped' or 'kmeans' to perform spatial cross-validation. Clumped spatial cross-validation is used if the training pixels represent polygons, and then cross-validation will be effectively performed on a polygon basis. Kmeans spatial cross-validation will partition the training pixels into <em>n_partitions</em> by kmeans clustering of the pixel coordinates. These partitions will then be used for cross-validation, which should provide more realistic performance measures if the data are spatially correlated. If these partioning schemes are not sufficient then a raster containing the gr
oup_ids of the partitions can be supplied using the <em>group_raster</em> option.</p>
+
+<p>Although tree-based classifiers are insensitive to the scaling of the input data, other classifiers such as linear models may not perform optimally if some predictors have variances that are orders of magnitude larger than others. The <em>-s</em> flag adds a standardization preprocessing step to the classification and prediction to reduce this effect. Additionally, most of the classifiers do not perform well if there is a large class imbalance in the training data. Using the <em>-b</em> flag balances the training data by weighting of the minority classes relative to the majority class. This does not apply to the Naive Bayes or LinearDiscriminantAnalysis classifiers.</p>
+
+<p>Non-ordinal, categorical predictors are also not specifically recognized by scikit-learn. Some classifiers are not very sensitive to this (i.e. decision trees) but generally, categorical predictors need to be converted to a suite of binary using onehot encoding (i.e. where each value in a categorical raster is parsed into a separate binary grid). Entering the indices (comma-separated) of the categorical rasters as they are listed in the imagery group as 0...n in the <em>categorymaps</em> option will cause onehot encoding to be performed on the fly during training and prediction. The feature importances are returned as per the original imagery group and represent the sum of the feature importances of the onehot-encoded variables. Note: it is important that the training samples all of the categories in the rasters, otherwise the onehot-encoding will fail when it comes to the prediction.</p>
+
+<p>The module also offers the ability to save and load a classification or regression model. Saving and loading a model allows a model to be fitted on one imagery group, with the prediction applied to additional imagery groups. This approach is commonly employed in species distribution or landslide susceptibility modelling whereby a classification or regression model is built with one set of predictors (e.g. present-day climatic variables) and then predictions can be performed on other imagery groups containing forecasted climatic variables.</p>
+
+<p>For convenience when performing repeated classifications using different classifiers or parameters, the training data can be saved to a csv file using the <em>save_training</em> option. This data can then be loaded into subsequent classification runs, saving time by avoiding the need to repeatedly query the predictors.</p>
+
+<h2>NOTES</h2>
+
+<p><em>r.learn.ml</em> uses the "scikit-learn" machine learning python package along with the "pandas" package. These packages need to be installed within your GRASS GIS Python environment. For Linux users, these packages should be available through the linux package manager. For MS-Windows users using a 64 bit GRASS, the easiest way of installing the packages is by using the precompiled binaries from <a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/">Christoph Gohlke</a> and by using the <a href="https://grass.osgeo.org/download/software/ms-windows/">OSGeo4W</a> installation method of GRASS, where the python setuptools can also be installed. You can then use 'easy_install pip' to install the pip package manager. Then, you can download the NumPy-1.10+MKL and scikit-learn .whl files and install them using 'pip install packagename.whl'. For MS-Windows with a 32 bit GRASS, scikit-learn is available in the OSGeo4W installer.</p>
+
+<p><em>r.learn.ml</em> is designed to keep system memory requirements relatively low. For this purpose, the rasters are read from the disk row-by-row, using the RasterRow method in PyGRASS. This however does not represent an efficient volume of data to pass to the classifiers, which are mostly multithreaded. Therefore, groups of rows specified by the <em>lines</em> parameter are passed to the classifier, and the reclassified image is reconstructed and written row-by-row back to the disk. <em>Lines=25</em> should be reasonable for most systems with 4-8 GB of ram. The row-by-row access however results in slow performance when sampling the imagery group to build the training data set when providing a raster as the trainingmap. Instead, the default behaviour is to read each predictor into memory at a time. If this still exceeds the system memory then the <em>-l</em> flag can be set to write each predictor to a numpy memmap file, and classification/regression can then be performed on ras
ters of any size irrespective of the available memory.</p>
+
+<p>Many of the classifiers involve a random process which can causes a small amount of variation in the classification results, out-of-bag error, and feature importances. To enable reproducible results, a seed is supplied to the classifier. This can be changed using the <em>randst</em> parameter.</p>
+
+<h2>EXAMPLE</h2>
+
+<p>Here we are going to use the GRASS GIS sample North Carolina data set as a basis to perform a landsat classification. We are going to classify a Landsat 7 scene from 2000, using training information from an older (1996) land cover dataset.</p>
+
+<p>Landsat 7 (2000) bands 7,4,2 color composite example:</p>
+<center>
+<img src="lsat7_2000_b742.png" alt="Landsat 7 (2000) bands 7,4,2 color composite example">
+</center>
+
+<p>Note that this example must be run in the "landsat" mapset of the North Carolina sample data set location.</p>
+
+<p>First, we are going to generate some training pixels from an older (1996) land cover classification:</p>
+<div class="code"><pre>
+g.region raster=landclass96 -p
+r.random input=landclass96 npoints=1000 raster=landclass96_roi
+</pre></div>
+
+<p>Then we can use these training pixels to perform a classification on the more recently obtained landsat 7 image:</p>
+<div class="code"><pre>
+r.learn.ml group=lsat7_2000 trainingmap=landclass96_roi output=rf_classification \
+ classifier=RandomForestClassifier n_estimators=500 max_features=-1 min_samples_split=2 randst=1 lines=25
+
+# copy category labels from landclass training map to result
+r.category rf_classification raster=landclass96_roi
+
+# copy color scheme from landclass training map to result
+r.colors rf_classification raster=landclass96_roi
+r.category rf_classification
+</pre></div>
+
+<p>Random forest classification result:</p>
+<center>
+<img src="rfclassification.png" alt="Random forest classification result">
+</center>
+
+<h2>ACKNOWLEDGEMENTS</h2>
+
+<p>Thanks for Paulo van Breugel and Vaclav Petras for general testing, and Paulo for the suggestion to enable saving of the fitted models.</p>
+
+<h2>AUTHOR</h2>
+
+Steven Pawley
+
+<p><em>Last changed: $Date$</em></p>
\ No newline at end of file
Property changes on: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.html
___________________________________________________________________
Added: svn:mime-type
+ text/html
Added: svn:keywords
+ Author Date Id
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.learn.ml/r.learn.ml.py
___________________________________________________________________
Added: svn:mime-type
+ text/x-python
Added: svn:eol-style
+ native
Modified: grass-addons/grass7/raster/r.learn.ml/r_learn_utils.py
===================================================================
--- grass-addons/grass7/raster/r.learn.ml/r_learn_utils.py 2017-04-29 20:09:01 UTC (rev 70988)
+++ grass-addons/grass7/raster/r.learn.ml/r_learn_utils.py 2017-04-29 21:42:18 UTC (rev 70989)
@@ -1,810 +1,810 @@
-#!/usr/bin/env python
-# -- coding: utf-8 --
-
-from __future__ import absolute_import
-import numpy as np
-import os
-import tempfile
-from copy import deepcopy
-from numpy.random import RandomState
-from grass.pygrass.modules.shortcuts import raster as r
-from grass.pygrass.raster import RasterRow
-from grass.pygrass.gis.region import Region
-from grass.pygrass.raster.buffer import Buffer
-from grass.pygrass.modules.shortcuts import imagery as im
-from grass.pygrass.vector import VectorTopo
-from grass.pygrass.vector.table import Link
-from grass.pygrass.utils import get_raster_for_points
-import grass.script as gscript
-from subprocess import PIPE
-
-def specificity_score(y_true, y_pred):
- """
- Calculate specificity score
-
- Args
- ----
- y_true: 1D numpy array of truth values
- y_pred: 1D numpy array of predicted classes
-
- Returns
- -------
- specificity: specificity score
- """
- from sklearn.metrics import confusion_matrix
-
- cm = confusion_matrix(y_true, y_pred)
- tn = float(cm[0][0])
- fp = float(cm[0][1])
-
- return tn/(tn+fp)
-
-
-def varimp_permutation(estimator, X_test, y_true,
- n_permutations, scorer,
- random_state):
-
- """
- Method to perform permutation-based feature importance during
- cross-validation (cross-validation is applied externally to this
- method)
-
- Procedure is:
- 1. Pass fitted estimator and test partition X y
- 2. Assess AUC on the test partition (bestauc)
- 3. Permute each variable and assess the difference between bestauc and
- the messed-up variable
- 4. Repeat (3) for many random permutations
- 5. Average the repeats
-
- Args
- ----
- estimator: estimator that has been fitted to a training partition
- X_test, y_true: data and labels from a test partition
- n_permutations: number of random permutations to apply
- scorer: scikit-learn metric function to use
- random_state: seed to pass to the numpy random.seed
-
- Returns
- -------
- scores: scores for each predictor following permutation
- """
-
- # calculate score on original variables without permutation
- # determine best metric type for binary/multiclass/regression scenarios
- y_pred = estimator.predict(X_test)
- best_score = scorer(y_true, y_pred)
-
- np.random.seed(seed=random_state)
- rstate = RandomState(random_state)
- scores = np.zeros((n_permutations, X_test.shape[1]))
-
- # outer loop to repeat the pemutation rep times
- for rep in range(n_permutations):
-
- # inner loop to permute each predictor variable and assess
- # difference in auc
- for i in range(X_test.shape[1]):
- Xscram = np.copy(X_test)
- Xscram[:, i] = rstate.choice(X_test[:, i], X_test.shape[0])
-
- # fit the model on the training data and predict the test data
- y_pred = estimator.predict(Xscram)
- scores[rep, i] = best_score-scorer(y_true, y_pred)
- if scores[rep, i] < 0:
- scores[rep, i] = 0
-
- # average the repetitions
- scores = scores.mean(axis=0)
-
- return scores
-
-
-def cross_val_scores(estimator, X, y, groups=None, sample_weight=None, cv=3,
- scoring='accuracy', feature_importances=False,
- n_permutations=25, models=False, random_state=None):
- """
- Stratified Kfold and GroupFold cross-validation using multiple
- scoring metrics and permutation feature importances
-
- Args
- ----
- estimator: Scikit learn estimator
- X: 2D numpy array of training data
- y: 1D numpy array representing response variable
- groups: 1D numpy array containing group labels
- sample_weight: 1D numpy array[n_samples,] of sample weights
- cv: Integer of cross-validation folds or sklearn.model_selection object
- sampling: Over- or under-sampling object with fit_sample method
- scoring: List of performance metrics to use
- feature_importances: Boolean to perform permutation-based importances
- n_permutations: Number of permutations during feature importance
- models: Boolean, return a list of the fitted models
- random_state: Seed to pass to the random number generator
- """
-
- from sklearn import metrics
- from sklearn.model_selection import (
- RandomizedSearchCV, GridSearchCV, StratifiedKFold)
-
- # deepcopy estimator
- estimator = deepcopy(estimator)
- fitted_models = []
-
- # create model_selection method
- if isinstance(cv, int): cv = StratifiedKFold(n_splits=cv)
-
- # create dictionary of lists to store metrics
- if isinstance(scoring, basestring): scoring = [scoring]
- scores = dict.fromkeys(scoring)
- scores = {key: [] for key, value in scores.iteritems()}
- scoring_methods = {'accuracy': metrics.accuracy_score,
- 'balanced_accuracy': metrics.recall_score,
- 'average_precision': metrics.average_precision_score,
- 'brier_loss': metrics.brier_score_loss,
- 'kappa': metrics.cohen_kappa_score,
- 'f1': metrics.f1_score,
- 'fbeta': metrics.fbeta_score,
- 'hamming_loss': metrics.hamming_loss,
- 'jaccard_similarity': metrics.jaccard_similarity_score,
- 'log_loss': metrics.log_loss,
- 'matthews_corrcoef': metrics.matthews_corrcoef,
- 'precision': metrics.precision_score,
- 'recall': metrics.recall_score,
- 'specificity': specificity_score,
- 'roc_auc': metrics.roc_auc_score,
- 'zero_one_loss': metrics.zero_one_loss,
- 'r2': metrics.r2_score,
- 'neg_mean_squared_error': metrics.mean_squared_error}
-
- byclass_methods = {'f1': metrics.f1_score,
- 'fbeta': metrics.fbeta_score,
- 'precision': metrics.precision_score,
- 'recall': metrics.recall_score}
-
- # create diction to store byclass metrics results
- n_classes = len(np.unique(y))
- labels = np.unique(y)
- byclass_scores = dict.fromkeys(byclass_methods)
- byclass_scores = {key: np.zeros((0, n_classes)) for key, value in byclass_scores.iteritems()}
-
- # remove any byclass_scorers that are not in the scoring list
- byclass_scores = {key: value for key, value in byclass_scores.iteritems() if key in scores}
-
- # check if remaining scorers are valid sklearn metrics
- for i in scores.keys():
- try:
- list(scoring_methods.keys()).index(i)
- except:
- print('Scoring ' + i + ' is not a valid scoring method')
- print('Valid methods are:')
- print(scoring_methods.keys())
-
- # set averaging type for global binary or multiclass scores
- if len(np.unique(y)) == 2 and all([0, 1] == np.unique(y)):
- average = 'binary'
- else:
- average = 'macro'
-
- # create np array to store feature importance scores
- if feature_importances is True:
- fimp = np.zeros((cv.get_n_splits(), X.shape[1]))
- fimp[:] = np.nan
- else:
- fimp = None
-
- # generate Kfold indices
- if groups is None:
- k_fold = cv.split(X, y)
- else:
- k_fold = cv.split(X, y, groups=groups)
-
- # train on k-1 folds and test of k folds
- for train_indices, test_indices in k_fold:
-
- # create training and test folds
- X_train, X_test = X[train_indices], X[test_indices]
- y_train, y_test = y[train_indices], y[test_indices]
- if groups is not None: groups_train = groups[train_indices]
- else: groups_train = None
-
- # subset training and test fold sample_weight
- if sample_weight is not None: weights = sample_weight[train_indices]
-
- # train estimator
- if groups is not None and isinstance(estimator, (RandomizedSearchCV, GridSearchCV)) is True:
- if sample_weight is None: estimator.fit(X_train, y_train, groups=groups_train)
- else: estimator.fit(X_train, y_train, groups=groups_train, sample_weight=weights)
- else:
- if sample_weight is None: estimator.fit(X_train, y_train)
- else: estimator.fit(X_train, y_train, sample_weight=weights)
-
- # optionally store the fitted models on each resample
- if models is True: fitted_models.append(deepcopy(estimator))
-
- # prediction of test fold
- y_pred = estimator.predict(X_test)
-
- # calculate global performance metrics
- for m in scores.keys():
- # metrics that require probabilties
- if m == 'brier_loss' or m == 'roc_auc':
- y_prob = estimator.predict_proba(X_test)[:, 1]
- scores[m] = np.append(
- scores[m], scoring_methods[m](y_test, y_prob))
-
- # metrics that have no averaging for multiclass
- elif m == 'kappa' or m == 'specificity' or m == 'accuracy' \
- or m == 'hamming_loss' or m == 'jaccard_similarity' \
- or m == 'log_loss' or m == 'zero_one_loss' or m == 'matthews_corrcoef':
- scores[m] = np.append(
- scores[m], scoring_methods[m](y_test, y_pred))
-
- # balanced accuracy
- elif m == 'balanced_accuracy':
- scores[m] = np.append(
- scores[m], scoring_methods[m](
- y_test, y_pred, average='macro'))
-
- # metrics that have averaging for multiclass
- else:
- scores[m] = np.append(
- scores[m], scoring_methods[m](
- y_test, y_pred, average=average))
-
- # calculate per-class performance metrics
- for key in byclass_scores.keys():
- byclass_scores[key] = np.vstack((
- byclass_scores[key], byclass_methods[key](
- y_test, y_pred, labels=labels, average=None)))
-
- # feature importances using permutation
- if feature_importances is True:
- if bool((np.isnan(fimp)).all()) is True:
- fimp = varimp_permutation(
- estimator, X_test, y_test, n_permutations,
- scoring_methods[scoring[0]],
- random_state)
- else:
- fimp = np.row_stack(
- (fimp, varimp_permutation(
- estimator, X_test, y_test,
- n_permutations, scoring_methods[scoring[0]],
- random_state)))
-
- return(scores, byclass_scores, fimp, fitted_models)
-
-
-def predict(estimator, predictors, output, predict_type='raw',
- index=None, rowincr=25):
- """
- Prediction on list of GRASS rasters using a fitted scikit learn model
-
- Args
- ----
- estimator: scikit-learn estimator object
- predictors: list of GRASS rasters
- output: Name of GRASS raster to output classification results
- predict_type: character, 'raw' for classification/regression;
- 'prob' for class probabilities
- index: Optional, list of class indices to export
- rowincr: Integer of raster rows to process at one time
- """
-
- # convert potential single index to list
- if isinstance(index, int): index = [index]
-
- # open predictors as list of rasterrow objects
- current = Region()
- n_features = len(predictors)
- rasstack = [0] * n_features
-
- for i in range(n_features):
- rasstack[i] = RasterRow(predictors[i])
- if rasstack[i].exist() is True:
- rasstack[i].open('r')
- else:
- gscript.fatal("GRASS raster " + predictors[i] +
- " does not exist.... exiting")
-
- # Prediction using blocks of rows per iteration
- for rowblock in range(0, current.rows, rowincr):
- gscript.percent(rowblock, current.rows, rowincr)
-
- # check that the row increment does not exceed the number of rows
- if rowblock+rowincr > current.rows:
- rowincr = current.rows - rowblock
- img_np_row = np.zeros((rowincr, current.cols, n_features))
-
- # loop through each row, and each band and add to 2D img_np_row
- for row in range(rowblock, rowblock+rowincr, 1):
- for band in range(n_features):
- img_np_row[row-rowblock, :, band] = \
- np.array(rasstack[band][row])
-
- # create mask
- img_np_row[img_np_row == -2147483648] = np.nan
- mask = np.zeros((img_np_row.shape[0], img_np_row.shape[1]))
- for feature in range(n_features):
- invalid_indexes = np.nonzero(np.isnan(img_np_row[:, :, feature]))
- mask[invalid_indexes] = np.nan
-
- # reshape each row-band matrix into a n*m array
- nsamples = rowincr * current.cols
- flat_pixels = img_np_row.reshape((nsamples, n_features))
-
- # remove NaNs prior to passing to scikit-learn predict
- flat_pixels = np.nan_to_num(flat_pixels)
-
- # perform prediction for classification/regression
- if predict_type == 'raw':
- result = estimator.predict(flat_pixels)
- result = result.reshape((rowincr, current.cols))
-
- # determine nodata value and grass raster type
- if result.dtype == 'float':
- nodata = np.nan
- ftype = 'FCELL'
- else:
- nodata = -2147483648
- ftype = 'CELL'
-
- # replace NaN values so that the prediction does not have a border
- result[np.nonzero(np.isnan(mask))] = nodata
-
- # on first iteration create the RasterRow object
- if rowblock == 0:
- if predict_type == 'raw':
- classification = RasterRow(output)
- classification.open('w', ftype, overwrite=True)
-
- # write the classification result
- for row in range(rowincr):
- newrow = Buffer((result.shape[1],), mtype=ftype)
- newrow[:] = result[row, :]
- classification.put_row(newrow)
-
- # perform prediction for class probabilities
- if predict_type == 'prob':
- result_proba = estimator.predict_proba(flat_pixels)
-
- # on first loop determine number of probability classes
- # and open rasterrow objects for writing
- if rowblock == 0:
- if index is None:
- index = range(result_proba.shape[1])
- n_classes = len(index)
- else:
- n_classes = len(np.unique(index))
-
- # create and open RasterRow objects for probabilities
- prob_out_raster = [0] * n_classes
- prob = [0] * n_classes
- for iclass, label in enumerate(index):
- prob_out_raster[iclass] = output + '_classPr' + str(label)
- prob[iclass] = RasterRow(prob_out_raster[iclass])
- prob[iclass].open('w', 'FCELL', overwrite=True)
-
- for iclass, label in enumerate(index):
- result_proba_class = result_proba[:, label]
- result_proba_class = result_proba_class.reshape((rowincr, current.cols))
- result_proba_class[np.nonzero(np.isnan(mask))] = np.nan
-
- for row in range(rowincr):
- newrow = Buffer((result_proba_class.shape[1],), mtype='FCELL')
- newrow[:] = result_proba_class[row, :]
- prob[iclass].put_row(newrow)
-
- # close all maps
- for i in range(n_features): rasstack[i].close()
- if predict_type == 'raw': classification.close()
- if predict_type == 'prob':
- try:
- for iclass in range(n_classes):
- prob[iclass].close()
- except:
- pass
-
-
-def model_classifiers(estimator, random_state, n_jobs, p, weights=None):
- """
- Provides the classifiers and parameters using by the module
-
- Args
- ----
- estimator: Name of estimator
- random_state: Seed to use in randomized components
- n_jobs: Integer, number of processing cores to use
- p: Dict, containing classifier setttings
- weights: None, or 'balanced' to add class_weights
-
- Returns
- -------
- clf: Scikit-learn classifier object
- mode: Flag to indicate whether classifier performs classification or
- regression
- """
-
- from sklearn.linear_model import LogisticRegression
- from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
- from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
- from sklearn.naive_bayes import GaussianNB
- from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
- from sklearn.ensemble import (
- RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier,
- ExtraTreesRegressor)
- from sklearn.ensemble import GradientBoostingClassifier
- from sklearn.ensemble import GradientBoostingRegressor
- from sklearn.svm import SVC
- from sklearn.neighbors import KNeighborsClassifier
-
- # convert balanced boolean to scikit learn method
- if weights is True:
- weights = 'balanced'
- else: weights = None
-
- # optional packages that add additional classifiers here
- if estimator == 'EarthClassifier' or estimator == 'EarthRegressor':
- try:
- from sklearn.pipeline import Pipeline
- from pyearth import Earth
-
- earth_classifier = Pipeline([('Earth',
- Earth(max_degree=p['max_degree'])),
- ('Logistic', LogisticRegression(n_jobs=n_jobs))])
-
- classifiers = {'EarthClassifier': earth_classifier,
- 'EarthRegressor': Earth(max_degree=p['max_degree'])}
- except:
- gscript.fatal('Py-earth package not installed')
-
- elif estimator == 'XGBClassifier' or estimator == 'XGBRegressor':
- try:
- from xgboost import XGBClassifier, XGBRegressor
-
- if p['max_depth'] is None:
- p['max_depth'] = int(3)
-
- classifiers = {
- 'XGBClassifier':
- XGBClassifier(learning_rate=p['learning_rate'],
- n_estimators=p['n_estimators'],
- max_depth=p['max_depth'],
- subsample=p['subsample'],
- nthread=n_jobs),
- 'XGBRegressor':
- XGBRegressor(learning_rate=p['learning_rate'],
- n_estimators=p['n_estimators'],
- max_depth=p['max_depth'],
- subsample=p['subsample'],
- nthread=n_jobs)}
- except:
- gscript.fatal('XGBoost package not installed')
- else:
- # core sklearn classifiers go here
- classifiers = {
- 'SVC': SVC(C=p['C'],
- class_weight=weights,
- probability=True,
- random_state=random_state),
- 'LogisticRegression':
- LogisticRegression(C=p['C'],
- class_weight=weights,
- random_state=random_state,
- n_jobs=n_jobs,
- fit_intercept=True),
- 'DecisionTreeClassifier':
- DecisionTreeClassifier(max_depth=p['max_depth'],
- max_features=p['max_features'],
- min_samples_split=p['min_samples_split'],
- min_samples_leaf=p['min_samples_leaf'],
- class_weight=weights,
- random_state=random_state),
- 'DecisionTreeRegressor':
- DecisionTreeRegressor(max_features=p['max_features'],
- min_samples_split=p['min_samples_split'],
- min_samples_leaf=p['min_samples_leaf'],
- random_state=random_state),
- 'RandomForestClassifier':
- RandomForestClassifier(n_estimators=p['n_estimators'],
- max_features=p['max_features'],
- min_samples_split=p['min_samples_split'],
- min_samples_leaf=p['min_samples_leaf'],
- class_weight=weights,
- random_state=random_state,
- n_jobs=n_jobs,
- oob_score=False),
- 'RandomForestRegressor':
- RandomForestRegressor(n_estimators=p['n_estimators'],
- max_features=p['max_features'],
- min_samples_split=p['min_samples_split'],
- min_samples_leaf=p['min_samples_leaf'],
- random_state=random_state,
- n_jobs=n_jobs,
- oob_score=False),
- 'ExtraTreesClassifier':
- ExtraTreesClassifier(n_estimators=p['n_estimators'],
- max_features=p['max_features'],
- min_samples_split=p['min_samples_split'],
- min_samples_leaf=p['min_samples_leaf'],
- class_weight=weights,
- random_state=random_state,
- n_jobs=n_jobs,
- oob_score=False),
- 'ExtraTreesRegressor':
- ExtraTreesRegressor(n_estimators=p['n_estimators'],
- max_features=p['max_features'],
- min_samples_split=p['min_samples_split'],
- min_samples_leaf=p['min_samples_leaf'],
- random_state=random_state,
- n_jobs=n_jobs,
- oob_score=False),
- 'GradientBoostingClassifier':
- GradientBoostingClassifier(learning_rate=p['learning_rate'],
- n_estimators=p['n_estimators'],
- max_depth=p['max_depth'],
- min_samples_split=p['min_samples_split'],
- min_samples_leaf=p['min_samples_leaf'],
- subsample=p['subsample'],
- max_features=p['max_features'],
- random_state=random_state),
- 'GradientBoostingRegressor':
- GradientBoostingRegressor(learning_rate=p['learning_rate'],
- n_estimators=p['n_estimators'],
- max_depth=p['max_depth'],
- min_samples_split=p['min_samples_split'],
- min_samples_leaf=p['min_samples_leaf'],
- subsample=p['subsample'],
- max_features=p['max_features'],
- random_state=random_state),
- 'GaussianNB': GaussianNB(),
- 'LinearDiscriminantAnalysis': LinearDiscriminantAnalysis(),
- 'QuadraticDiscriminantAnalysis': QuadraticDiscriminantAnalysis(),
- 'KNeighborsClassifier': KNeighborsClassifier(n_neighbors=p['n_neighbors'],
- weights=p['weights'],
- n_jobs=n_jobs)
- }
-
- # define classifier
- clf = classifiers[estimator]
-
- # classification or regression
- if estimator == 'LogisticRegression' \
- or estimator == 'DecisionTreeClassifier' \
- or estimator == 'RandomForestClassifier' \
- or estimator == 'ExtraTreesClassifier' \
- or estimator == 'GradientBoostingClassifier' \
- or estimator == 'GaussianNB' \
- or estimator == 'LinearDiscriminantAnalysis' \
- or estimator == 'QuadraticDiscriminantAnalysis' \
- or estimator == 'EarthClassifier' \
- or estimator == 'XGBClassifier' \
- or estimator == 'SVC' \
- or estimator == 'KNeighborsClassifier':
- mode = 'classification'
- else:
- mode = 'regression'
-
- return (clf, mode)
-
-
-def save_training_data(X, y, groups, file):
- """
- Saves any extracted training data to a csv file
-
- Args
- ----
- X: Numpy array containing predictor values
- y: Numpy array containing labels
- groups: Numpy array of group labels
- file: Path to a csv file to save data to
- """
-
- # if there are no group labels, create a nan filled array
- if groups is None:
- groups = np.empty((y.shape[0]))
- groups[:] = np.nan
-
- training_data = np.column_stack([X, y, groups])
- np.savetxt(file, training_data, delimiter=',')
-
-
-def load_training_data(file):
- """
- Loads training data and labels from a csv file
-
- Args
- ----
- file: Path to a csv file to save data to
-
- Returns
- -------
- X: Numpy array containing predictor values
- y: Numpy array containing labels
- groups: Numpy array of group labels, or None
- """
-
- training_data = np.loadtxt(file, delimiter=',')
- n_cols = training_data.shape[1]
- last_Xcol = n_cols-2
-
- # check to see if last column contains group labels or nans
- groups = training_data[:, -1]
-
- # if all nans then set groups to None
- if bool(np.isnan(groups).all()) is True:
- groups = None
-
- # fetch X and y
- X = training_data[:, 0:last_Xcol]
- y = training_data[:, -2]
-
- return(X, y, groups)
-
-
-def extract(response, predictors, lowmem=False):
- """
- Samples a list of GRASS rasters using a labelled raster
- Per raster sampling
-
- Args
- ----
- response: String; GRASS raster with labelled pixels
- predictors: List of GRASS rasters containing explanatory variables
- lowmem: Boolean, use numpy memmap to query predictors
-
- Returns
- -------
- training_data: Numpy array of extracted raster values
- training_labels: Numpy array of labels
- is_train: Row and Columns of label positions
- """
-
- current = Region()
-
- # open response raster as rasterrow and read as np array
- if RasterRow(response).exist() is True:
- roi_gr = RasterRow(response)
- roi_gr.open('r')
-
- if lowmem is False:
- response_np = np.array(roi_gr)
- else:
- response_np = np.memmap(
- tempfile.NamedTemporaryFile(),
- dtype='float32', mode='w+',
- shape=(current.rows, current.cols))
- response_np[:] = np.array(roi_gr)[:]
- else:
- gscript.fatal("GRASS response raster does not exist.... exiting")
-
- # determine number of predictor rasters
- n_features = len(predictors)
-
- # check to see if all predictors exist
- for i in range(n_features):
- if RasterRow(predictors[i]).exist() is not True:
- gscript.fatal("GRASS raster " + predictors[i] +
- " does not exist.... exiting")
-
- # check if any of those pixels are labelled (not equal to nodata)
- # can use even if roi is FCELL because nodata will be nan
- is_train = np.nonzero(response_np > -2147483648)
- training_labels = response_np[is_train]
- n_labels = np.array(is_train).shape[1]
-
- # Create a zero numpy array of len training labels
- if lowmem is False:
- training_data = np.zeros((n_labels, n_features))
- else:
- training_data = np.memmap(tempfile.NamedTemporaryFile(),
- dtype='float32', mode='w+',
- shape=(n_labels, n_features))
-
- # Loop through each raster and sample pixel values at training indexes
- if lowmem is True:
- feature_np = np.memmap(tempfile.NamedTemporaryFile(),
- dtype='float32', mode='w+',
- shape=(current.rows, current.cols))
-
- for f in range(n_features):
- predictor_gr = RasterRow(predictors[f])
- predictor_gr.open('r')
-
- if lowmem is False:
- feature_np = np.array(predictor_gr)
- else:
- feature_np[:] = np.array(predictor_gr)[:]
-
- training_data[0:n_labels, f] = feature_np[is_train]
-
- # close each predictor map
- predictor_gr.close()
-
- # convert any CELL maps no datavals to NaN in the training data
- for i in range(n_features):
- training_data[training_data[:, i] == -2147483648] = np.nan
-
- # convert indexes of training pixels from tuple to n*2 np array
- is_train = np.array(is_train).T
-
- # close the response map
- roi_gr.close()
-
- return(training_data, training_labels, is_train)
-
-
-def maps_from_group(group):
- """
- Parse individual rasters into a list from an imagery group
-
- Args
- ----
- group: String; GRASS imagery group
-
- Returns
- -------
- maplist: List containing individual GRASS raster maps
- map_names: List with print friendly map names
- """
- groupmaps = im.group(group=group, flags="g",
- quiet=True, stdout_=PIPE).outputs.stdout
-
- maplist = groupmaps.split(os.linesep)
- maplist = maplist[0:len(maplist)-1]
- map_names = []
-
- for rastername in maplist:
- map_names.append(rastername.split('@')[0])
-
- return(maplist, map_names)
-
-
-def extract_points(gvector, grasters, field):
- """
- Extract values from grass rasters using vector points input
-
- Args
- ----
- gvector: character, name of grass points vector
- grasters: list of names of grass raster to query
- field: character, name of field in table to use as response variable
-
- Returns
- -------
- X: 2D numpy array of training data
- y: 1D numpy array with the response variable
- coordinates: 2D numpy array of sample coordinates
- """
- # open grass vector
- points = VectorTopo(gvector.split('@')[0])
- points.open('r')
-
- # create link to attribute table
- points.dblinks.by_name(name=gvector)
-
- # extract table field to numpy array
- table = points.table
- cur = table.execute("SELECT {field} FROM {name}".format(field=field, name=table.name))
- y = np.array([np.isnan if c is None else c[0] for c in cur])
-
- # extract raster data
- X = np.zeros((points.num_primitives()['point'], len(grasters)), dtype=float)
- for i, raster in enumerate(grasters):
- rio = RasterRow(raster)
- values = np.asarray(get_raster_for_points(points, rio))
- coordinates = values[:, 1:3]
- X[:, i] = values[:, 3]
-
- # set any grass integer nodata values to NaN
- X[X == -2147483648] = np.nan
-
- # remove missing response data
- X = X[~np.isnan(y)]
- coordinates = coordinates[~np.isnan(y)]
- y = y[~np.isnan(y)]
-
- # close
- points.close()
-
- return(X, y, coordinates)
+#!/usr/bin/env python
+# -- coding: utf-8 --
+
+from __future__ import absolute_import
+import numpy as np
+import os
+import tempfile
+from copy import deepcopy
+from numpy.random import RandomState
+from grass.pygrass.modules.shortcuts import raster as r
+from grass.pygrass.raster import RasterRow
+from grass.pygrass.gis.region import Region
+from grass.pygrass.raster.buffer import Buffer
+from grass.pygrass.modules.shortcuts import imagery as im
+from grass.pygrass.vector import VectorTopo
+from grass.pygrass.vector.table import Link
+from grass.pygrass.utils import get_raster_for_points
+import grass.script as gscript
+from subprocess import PIPE
+
+def specificity_score(y_true, y_pred):
+ """
+ Calculate specificity score
+
+ Args
+ ----
+ y_true: 1D numpy array of truth values
+ y_pred: 1D numpy array of predicted classes
+
+ Returns
+ -------
+ specificity: specificity score
+ """
+ from sklearn.metrics import confusion_matrix
+
+ cm = confusion_matrix(y_true, y_pred)
+ tn = float(cm[0][0])
+ fp = float(cm[0][1])
+
+ return tn/(tn+fp)
+
+
+def varimp_permutation(estimator, X_test, y_true,
+ n_permutations, scorer,
+ random_state):
+
+ """
+ Method to perform permutation-based feature importance during
+ cross-validation (cross-validation is applied externally to this
+ method)
+
+ Procedure is:
+ 1. Pass fitted estimator and test partition X y
+ 2. Assess AUC on the test partition (bestauc)
+ 3. Permute each variable and assess the difference between bestauc and
+ the messed-up variable
+ 4. Repeat (3) for many random permutations
+ 5. Average the repeats
+
+ Args
+ ----
+ estimator: estimator that has been fitted to a training partition
+ X_test, y_true: data and labels from a test partition
+ n_permutations: number of random permutations to apply
+ scorer: scikit-learn metric function to use
+ random_state: seed to pass to the numpy random.seed
+
+ Returns
+ -------
+ scores: scores for each predictor following permutation
+ """
+
+ # calculate score on original variables without permutation
+ # determine best metric type for binary/multiclass/regression scenarios
+ y_pred = estimator.predict(X_test)
+ best_score = scorer(y_true, y_pred)
+
+ np.random.seed(seed=random_state)
+ rstate = RandomState(random_state)
+ scores = np.zeros((n_permutations, X_test.shape[1]))
+
+ # outer loop to repeat the pemutation rep times
+ for rep in range(n_permutations):
+
+ # inner loop to permute each predictor variable and assess
+ # difference in auc
+ for i in range(X_test.shape[1]):
+ Xscram = np.copy(X_test)
+ Xscram[:, i] = rstate.choice(X_test[:, i], X_test.shape[0])
+
+ # fit the model on the training data and predict the test data
+ y_pred = estimator.predict(Xscram)
+ scores[rep, i] = best_score-scorer(y_true, y_pred)
+ if scores[rep, i] < 0:
+ scores[rep, i] = 0
+
+ # average the repetitions
+ scores = scores.mean(axis=0)
+
+ return scores
+
+
+def cross_val_scores(estimator, X, y, groups=None, sample_weight=None, cv=3,
+ scoring='accuracy', feature_importances=False,
+ n_permutations=25, models=False, random_state=None):
+ """
+ Stratified Kfold and GroupFold cross-validation using multiple
+ scoring metrics and permutation feature importances
+
+ Args
+ ----
+ estimator: Scikit learn estimator
+ X: 2D numpy array of training data
+ y: 1D numpy array representing response variable
+ groups: 1D numpy array containing group labels
+ sample_weight: 1D numpy array[n_samples,] of sample weights
+ cv: Integer of cross-validation folds or sklearn.model_selection object
+ sampling: Over- or under-sampling object with fit_sample method
+ scoring: List of performance metrics to use
+ feature_importances: Boolean to perform permutation-based importances
+ n_permutations: Number of permutations during feature importance
+ models: Boolean, return a list of the fitted models
+ random_state: Seed to pass to the random number generator
+ """
+
+ from sklearn import metrics
+ from sklearn.model_selection import (
+ RandomizedSearchCV, GridSearchCV, StratifiedKFold)
+
+ # deepcopy estimator
+ estimator = deepcopy(estimator)
+ fitted_models = []
+
+ # create model_selection method
+ if isinstance(cv, int): cv = StratifiedKFold(n_splits=cv)
+
+ # create dictionary of lists to store metrics
+ if isinstance(scoring, basestring): scoring = [scoring]
+ scores = dict.fromkeys(scoring)
+ scores = {key: [] for key, value in scores.iteritems()}
+ scoring_methods = {'accuracy': metrics.accuracy_score,
+ 'balanced_accuracy': metrics.recall_score,
+ 'average_precision': metrics.average_precision_score,
+ 'brier_loss': metrics.brier_score_loss,
+ 'kappa': metrics.cohen_kappa_score,
+ 'f1': metrics.f1_score,
+ 'fbeta': metrics.fbeta_score,
+ 'hamming_loss': metrics.hamming_loss,
+ 'jaccard_similarity': metrics.jaccard_similarity_score,
+ 'log_loss': metrics.log_loss,
+ 'matthews_corrcoef': metrics.matthews_corrcoef,
+ 'precision': metrics.precision_score,
+ 'recall': metrics.recall_score,
+ 'specificity': specificity_score,
+ 'roc_auc': metrics.roc_auc_score,
+ 'zero_one_loss': metrics.zero_one_loss,
+ 'r2': metrics.r2_score,
+ 'neg_mean_squared_error': metrics.mean_squared_error}
+
+ byclass_methods = {'f1': metrics.f1_score,
+ 'fbeta': metrics.fbeta_score,
+ 'precision': metrics.precision_score,
+ 'recall': metrics.recall_score}
+
+ # create diction to store byclass metrics results
+ n_classes = len(np.unique(y))
+ labels = np.unique(y)
+ byclass_scores = dict.fromkeys(byclass_methods)
+ byclass_scores = {key: np.zeros((0, n_classes)) for key, value in byclass_scores.iteritems()}
+
+ # remove any byclass_scorers that are not in the scoring list
+ byclass_scores = {key: value for key, value in byclass_scores.iteritems() if key in scores}
+
+ # check if remaining scorers are valid sklearn metrics
+ for i in scores.keys():
+ try:
+ list(scoring_methods.keys()).index(i)
+ except:
+ print('Scoring ' + i + ' is not a valid scoring method')
+ print('Valid methods are:')
+ print(scoring_methods.keys())
+
+ # set averaging type for global binary or multiclass scores
+ if len(np.unique(y)) == 2 and all([0, 1] == np.unique(y)):
+ average = 'binary'
+ else:
+ average = 'macro'
+
+ # create np array to store feature importance scores
+ if feature_importances is True:
+ fimp = np.zeros((cv.get_n_splits(), X.shape[1]))
+ fimp[:] = np.nan
+ else:
+ fimp = None
+
+ # generate Kfold indices
+ if groups is None:
+ k_fold = cv.split(X, y)
+ else:
+ k_fold = cv.split(X, y, groups=groups)
+
+ # train on k-1 folds and test of k folds
+ for train_indices, test_indices in k_fold:
+
+ # create training and test folds
+ X_train, X_test = X[train_indices], X[test_indices]
+ y_train, y_test = y[train_indices], y[test_indices]
+ if groups is not None: groups_train = groups[train_indices]
+ else: groups_train = None
+
+ # subset training and test fold sample_weight
+ if sample_weight is not None: weights = sample_weight[train_indices]
+
+ # train estimator
+ if groups is not None and isinstance(estimator, (RandomizedSearchCV, GridSearchCV)) is True:
+ if sample_weight is None: estimator.fit(X_train, y_train, groups=groups_train)
+ else: estimator.fit(X_train, y_train, groups=groups_train, sample_weight=weights)
+ else:
+ if sample_weight is None: estimator.fit(X_train, y_train)
+ else: estimator.fit(X_train, y_train, sample_weight=weights)
+
+ # optionally store the fitted models on each resample
+ if models is True: fitted_models.append(deepcopy(estimator))
+
+ # prediction of test fold
+ y_pred = estimator.predict(X_test)
+
+ # calculate global performance metrics
+ for m in scores.keys():
+ # metrics that require probabilties
+ if m == 'brier_loss' or m == 'roc_auc':
+ y_prob = estimator.predict_proba(X_test)[:, 1]
+ scores[m] = np.append(
+ scores[m], scoring_methods[m](y_test, y_prob))
+
+ # metrics that have no averaging for multiclass
+ elif m == 'kappa' or m == 'specificity' or m == 'accuracy' \
+ or m == 'hamming_loss' or m == 'jaccard_similarity' \
+ or m == 'log_loss' or m == 'zero_one_loss' or m == 'matthews_corrcoef':
+ scores[m] = np.append(
+ scores[m], scoring_methods[m](y_test, y_pred))
+
+ # balanced accuracy
+ elif m == 'balanced_accuracy':
+ scores[m] = np.append(
+ scores[m], scoring_methods[m](
+ y_test, y_pred, average='macro'))
+
+ # metrics that have averaging for multiclass
+ else:
+ scores[m] = np.append(
+ scores[m], scoring_methods[m](
+ y_test, y_pred, average=average))
+
+ # calculate per-class performance metrics
+ for key in byclass_scores.keys():
+ byclass_scores[key] = np.vstack((
+ byclass_scores[key], byclass_methods[key](
+ y_test, y_pred, labels=labels, average=None)))
+
+ # feature importances using permutation
+ if feature_importances is True:
+ if bool((np.isnan(fimp)).all()) is True:
+ fimp = varimp_permutation(
+ estimator, X_test, y_test, n_permutations,
+ scoring_methods[scoring[0]],
+ random_state)
+ else:
+ fimp = np.row_stack(
+ (fimp, varimp_permutation(
+ estimator, X_test, y_test,
+ n_permutations, scoring_methods[scoring[0]],
+ random_state)))
+
+ return(scores, byclass_scores, fimp, fitted_models)
+
+
+def predict(estimator, predictors, output, predict_type='raw',
+ index=None, rowincr=25):
+ """
+ Prediction on list of GRASS rasters using a fitted scikit learn model
+
+ Args
+ ----
+ estimator: scikit-learn estimator object
+ predictors: list of GRASS rasters
+ output: Name of GRASS raster to output classification results
+ predict_type: character, 'raw' for classification/regression;
+ 'prob' for class probabilities
+ index: Optional, list of class indices to export
+ rowincr: Integer of raster rows to process at one time
+ """
+
+ # convert potential single index to list
+ if isinstance(index, int): index = [index]
+
+ # open predictors as list of rasterrow objects
+ current = Region()
+ n_features = len(predictors)
+ rasstack = [0] * n_features
+
+ for i in range(n_features):
+ rasstack[i] = RasterRow(predictors[i])
+ if rasstack[i].exist() is True:
+ rasstack[i].open('r')
+ else:
+ gscript.fatal("GRASS raster " + predictors[i] +
+ " does not exist.... exiting")
+
+ # Prediction using blocks of rows per iteration
+ for rowblock in range(0, current.rows, rowincr):
+ gscript.percent(rowblock, current.rows, rowincr)
+
+ # check that the row increment does not exceed the number of rows
+ if rowblock+rowincr > current.rows:
+ rowincr = current.rows - rowblock
+ img_np_row = np.zeros((rowincr, current.cols, n_features))
+
+ # loop through each row, and each band and add to 2D img_np_row
+ for row in range(rowblock, rowblock+rowincr, 1):
+ for band in range(n_features):
+ img_np_row[row-rowblock, :, band] = \
+ np.array(rasstack[band][row])
+
+ # create mask
+ img_np_row[img_np_row == -2147483648] = np.nan
+ mask = np.zeros((img_np_row.shape[0], img_np_row.shape[1]))
+ for feature in range(n_features):
+ invalid_indexes = np.nonzero(np.isnan(img_np_row[:, :, feature]))
+ mask[invalid_indexes] = np.nan
+
+ # reshape each row-band matrix into a n*m array
+ nsamples = rowincr * current.cols
+ flat_pixels = img_np_row.reshape((nsamples, n_features))
+
+ # remove NaNs prior to passing to scikit-learn predict
+ flat_pixels = np.nan_to_num(flat_pixels)
+
+ # perform prediction for classification/regression
+ if predict_type == 'raw':
+ result = estimator.predict(flat_pixels)
+ result = result.reshape((rowincr, current.cols))
+
+ # determine nodata value and grass raster type
+ if result.dtype == 'float':
+ nodata = np.nan
+ ftype = 'FCELL'
+ else:
+ nodata = -2147483648
+ ftype = 'CELL'
+
+ # replace NaN values so that the prediction does not have a border
+ result[np.nonzero(np.isnan(mask))] = nodata
+
+ # on first iteration create the RasterRow object
+ if rowblock == 0:
+ if predict_type == 'raw':
+ classification = RasterRow(output)
+ classification.open('w', ftype, overwrite=True)
+
+ # write the classification result
+ for row in range(rowincr):
+ newrow = Buffer((result.shape[1],), mtype=ftype)
+ newrow[:] = result[row, :]
+ classification.put_row(newrow)
+
+ # perform prediction for class probabilities
+ if predict_type == 'prob':
+ result_proba = estimator.predict_proba(flat_pixels)
+
+ # on first loop determine number of probability classes
+ # and open rasterrow objects for writing
+ if rowblock == 0:
+ if index is None:
+ index = range(result_proba.shape[1])
+ n_classes = len(index)
+ else:
+ n_classes = len(np.unique(index))
+
+ # create and open RasterRow objects for probabilities
+ prob_out_raster = [0] * n_classes
+ prob = [0] * n_classes
+ for iclass, label in enumerate(index):
+ prob_out_raster[iclass] = output + '_classPr' + str(label)
+ prob[iclass] = RasterRow(prob_out_raster[iclass])
+ prob[iclass].open('w', 'FCELL', overwrite=True)
+
+ for iclass, label in enumerate(index):
+ result_proba_class = result_proba[:, label]
+ result_proba_class = result_proba_class.reshape((rowincr, current.cols))
+ result_proba_class[np.nonzero(np.isnan(mask))] = np.nan
+
+ for row in range(rowincr):
+ newrow = Buffer((result_proba_class.shape[1],), mtype='FCELL')
+ newrow[:] = result_proba_class[row, :]
+ prob[iclass].put_row(newrow)
+
+ # close all maps
+ for i in range(n_features): rasstack[i].close()
+ if predict_type == 'raw': classification.close()
+ if predict_type == 'prob':
+ try:
+ for iclass in range(n_classes):
+ prob[iclass].close()
+ except:
+ pass
+
+
+def model_classifiers(estimator, random_state, n_jobs, p, weights=None):
+ """
+ Provides the classifiers and parameters using by the module
+
+ Args
+ ----
+ estimator: Name of estimator
+ random_state: Seed to use in randomized components
+ n_jobs: Integer, number of processing cores to use
+ p: Dict, containing classifier setttings
+ weights: None, or 'balanced' to add class_weights
+
+ Returns
+ -------
+ clf: Scikit-learn classifier object
+ mode: Flag to indicate whether classifier performs classification or
+ regression
+ """
+
+ from sklearn.linear_model import LogisticRegression
+ from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
+ from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
+ from sklearn.naive_bayes import GaussianNB
+ from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
+ from sklearn.ensemble import (
+ RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier,
+ ExtraTreesRegressor)
+ from sklearn.ensemble import GradientBoostingClassifier
+ from sklearn.ensemble import GradientBoostingRegressor
+ from sklearn.svm import SVC
+ from sklearn.neighbors import KNeighborsClassifier
+
+ # convert balanced boolean to scikit learn method
+ if weights is True:
+ weights = 'balanced'
+ else: weights = None
+
+ # optional packages that add additional classifiers here
+ if estimator == 'EarthClassifier' or estimator == 'EarthRegressor':
+ try:
+ from sklearn.pipeline import Pipeline
+ from pyearth import Earth
+
+ earth_classifier = Pipeline([('Earth',
+ Earth(max_degree=p['max_degree'])),
+ ('Logistic', LogisticRegression(n_jobs=n_jobs))])
+
+ classifiers = {'EarthClassifier': earth_classifier,
+ 'EarthRegressor': Earth(max_degree=p['max_degree'])}
+ except:
+ gscript.fatal('Py-earth package not installed')
+
+ elif estimator == 'XGBClassifier' or estimator == 'XGBRegressor':
+ try:
+ from xgboost import XGBClassifier, XGBRegressor
+
+ if p['max_depth'] is None:
+ p['max_depth'] = int(3)
+
+ classifiers = {
+ 'XGBClassifier':
+ XGBClassifier(learning_rate=p['learning_rate'],
+ n_estimators=p['n_estimators'],
+ max_depth=p['max_depth'],
+ subsample=p['subsample'],
+ nthread=n_jobs),
+ 'XGBRegressor':
+ XGBRegressor(learning_rate=p['learning_rate'],
+ n_estimators=p['n_estimators'],
+ max_depth=p['max_depth'],
+ subsample=p['subsample'],
+ nthread=n_jobs)}
+ except:
+ gscript.fatal('XGBoost package not installed')
+ else:
+ # core sklearn classifiers go here
+ classifiers = {
+ 'SVC': SVC(C=p['C'],
+ class_weight=weights,
+ probability=True,
+ random_state=random_state),
+ 'LogisticRegression':
+ LogisticRegression(C=p['C'],
+ class_weight=weights,
+ random_state=random_state,
+ n_jobs=n_jobs,
+ fit_intercept=True),
+ 'DecisionTreeClassifier':
+ DecisionTreeClassifier(max_depth=p['max_depth'],
+ max_features=p['max_features'],
+ min_samples_split=p['min_samples_split'],
+ min_samples_leaf=p['min_samples_leaf'],
+ class_weight=weights,
+ random_state=random_state),
+ 'DecisionTreeRegressor':
+ DecisionTreeRegressor(max_features=p['max_features'],
+ min_samples_split=p['min_samples_split'],
+ min_samples_leaf=p['min_samples_leaf'],
+ random_state=random_state),
+ 'RandomForestClassifier':
+ RandomForestClassifier(n_estimators=p['n_estimators'],
+ max_features=p['max_features'],
+ min_samples_split=p['min_samples_split'],
+ min_samples_leaf=p['min_samples_leaf'],
+ class_weight=weights,
+ random_state=random_state,
+ n_jobs=n_jobs,
+ oob_score=False),
+ 'RandomForestRegressor':
+ RandomForestRegressor(n_estimators=p['n_estimators'],
+ max_features=p['max_features'],
+ min_samples_split=p['min_samples_split'],
+ min_samples_leaf=p['min_samples_leaf'],
+ random_state=random_state,
+ n_jobs=n_jobs,
+ oob_score=False),
+ 'ExtraTreesClassifier':
+ ExtraTreesClassifier(n_estimators=p['n_estimators'],
+ max_features=p['max_features'],
+ min_samples_split=p['min_samples_split'],
+ min_samples_leaf=p['min_samples_leaf'],
+ class_weight=weights,
+ random_state=random_state,
+ n_jobs=n_jobs,
+ oob_score=False),
+ 'ExtraTreesRegressor':
+ ExtraTreesRegressor(n_estimators=p['n_estimators'],
+ max_features=p['max_features'],
+ min_samples_split=p['min_samples_split'],
+ min_samples_leaf=p['min_samples_leaf'],
+ random_state=random_state,
+ n_jobs=n_jobs,
+ oob_score=False),
+ 'GradientBoostingClassifier':
+ GradientBoostingClassifier(learning_rate=p['learning_rate'],
+ n_estimators=p['n_estimators'],
+ max_depth=p['max_depth'],
+ min_samples_split=p['min_samples_split'],
+ min_samples_leaf=p['min_samples_leaf'],
+ subsample=p['subsample'],
+ max_features=p['max_features'],
+ random_state=random_state),
+ 'GradientBoostingRegressor':
+ GradientBoostingRegressor(learning_rate=p['learning_rate'],
+ n_estimators=p['n_estimators'],
+ max_depth=p['max_depth'],
+ min_samples_split=p['min_samples_split'],
+ min_samples_leaf=p['min_samples_leaf'],
+ subsample=p['subsample'],
+ max_features=p['max_features'],
+ random_state=random_state),
+ 'GaussianNB': GaussianNB(),
+ 'LinearDiscriminantAnalysis': LinearDiscriminantAnalysis(),
+ 'QuadraticDiscriminantAnalysis': QuadraticDiscriminantAnalysis(),
+ 'KNeighborsClassifier': KNeighborsClassifier(n_neighbors=p['n_neighbors'],
+ weights=p['weights'],
+ n_jobs=n_jobs)
+ }
+
+ # define classifier
+ clf = classifiers[estimator]
+
+ # classification or regression
+ if estimator == 'LogisticRegression' \
+ or estimator == 'DecisionTreeClassifier' \
+ or estimator == 'RandomForestClassifier' \
+ or estimator == 'ExtraTreesClassifier' \
+ or estimator == 'GradientBoostingClassifier' \
+ or estimator == 'GaussianNB' \
+ or estimator == 'LinearDiscriminantAnalysis' \
+ or estimator == 'QuadraticDiscriminantAnalysis' \
+ or estimator == 'EarthClassifier' \
+ or estimator == 'XGBClassifier' \
+ or estimator == 'SVC' \
+ or estimator == 'KNeighborsClassifier':
+ mode = 'classification'
+ else:
+ mode = 'regression'
+
+ return (clf, mode)
+
+
+def save_training_data(X, y, groups, file):
+ """
+ Saves any extracted training data to a csv file
+
+ Args
+ ----
+ X: Numpy array containing predictor values
+ y: Numpy array containing labels
+ groups: Numpy array of group labels
+ file: Path to a csv file to save data to
+ """
+
+ # if there are no group labels, create a nan filled array
+ if groups is None:
+ groups = np.empty((y.shape[0]))
+ groups[:] = np.nan
+
+ training_data = np.column_stack([X, y, groups])
+ np.savetxt(file, training_data, delimiter=',')
+
+
+def load_training_data(file):
+ """
+ Loads training data and labels from a csv file
+
+ Args
+ ----
+ file: Path to a csv file to save data to
+
+ Returns
+ -------
+ X: Numpy array containing predictor values
+ y: Numpy array containing labels
+ groups: Numpy array of group labels, or None
+ """
+
+ training_data = np.loadtxt(file, delimiter=',')
+ n_cols = training_data.shape[1]
+ last_Xcol = n_cols-2
+
+ # check to see if last column contains group labels or nans
+ groups = training_data[:, -1]
+
+ # if all nans then set groups to None
+ if bool(np.isnan(groups).all()) is True:
+ groups = None
+
+ # fetch X and y
+ X = training_data[:, 0:last_Xcol]
+ y = training_data[:, -2]
+
+ return(X, y, groups)
+
+
+def extract(response, predictors, lowmem=False):
+ """
+ Samples a list of GRASS rasters using a labelled raster
+ Per raster sampling
+
+ Args
+ ----
+ response: String; GRASS raster with labelled pixels
+ predictors: List of GRASS rasters containing explanatory variables
+ lowmem: Boolean, use numpy memmap to query predictors
+
+ Returns
+ -------
+ training_data: Numpy array of extracted raster values
+ training_labels: Numpy array of labels
+ is_train: Row and Columns of label positions
+ """
+
+ current = Region()
+
+ # open response raster as rasterrow and read as np array
+ if RasterRow(response).exist() is True:
+ roi_gr = RasterRow(response)
+ roi_gr.open('r')
+
+ if lowmem is False:
+ response_np = np.array(roi_gr)
+ else:
+ response_np = np.memmap(
+ tempfile.NamedTemporaryFile(),
+ dtype='float32', mode='w+',
+ shape=(current.rows, current.cols))
+ response_np[:] = np.array(roi_gr)[:]
+ else:
+ gscript.fatal("GRASS response raster does not exist.... exiting")
+
+ # determine number of predictor rasters
+ n_features = len(predictors)
+
+ # check to see if all predictors exist
+ for i in range(n_features):
+ if RasterRow(predictors[i]).exist() is not True:
+ gscript.fatal("GRASS raster " + predictors[i] +
+ " does not exist.... exiting")
+
+ # check if any of those pixels are labelled (not equal to nodata)
+ # can use even if roi is FCELL because nodata will be nan
+ is_train = np.nonzero(response_np > -2147483648)
+ training_labels = response_np[is_train]
+ n_labels = np.array(is_train).shape[1]
+
+ # Create a zero numpy array of len training labels
+ if lowmem is False:
+ training_data = np.zeros((n_labels, n_features))
+ else:
+ training_data = np.memmap(tempfile.NamedTemporaryFile(),
+ dtype='float32', mode='w+',
+ shape=(n_labels, n_features))
+
+ # Loop through each raster and sample pixel values at training indexes
+ if lowmem is True:
+ feature_np = np.memmap(tempfile.NamedTemporaryFile(),
+ dtype='float32', mode='w+',
+ shape=(current.rows, current.cols))
+
+ for f in range(n_features):
+ predictor_gr = RasterRow(predictors[f])
+ predictor_gr.open('r')
+
+ if lowmem is False:
+ feature_np = np.array(predictor_gr)
+ else:
+ feature_np[:] = np.array(predictor_gr)[:]
+
+ training_data[0:n_labels, f] = feature_np[is_train]
+
+ # close each predictor map
+ predictor_gr.close()
+
+ # convert any CELL maps no datavals to NaN in the training data
+ for i in range(n_features):
+ training_data[training_data[:, i] == -2147483648] = np.nan
+
+ # convert indexes of training pixels from tuple to n*2 np array
+ is_train = np.array(is_train).T
+
+ # close the response map
+ roi_gr.close()
+
+ return(training_data, training_labels, is_train)
+
+
+def maps_from_group(group):
+ """
+ Parse individual rasters into a list from an imagery group
+
+ Args
+ ----
+ group: String; GRASS imagery group
+
+ Returns
+ -------
+ maplist: List containing individual GRASS raster maps
+ map_names: List with print friendly map names
+ """
+ groupmaps = im.group(group=group, flags="g",
+ quiet=True, stdout_=PIPE).outputs.stdout
+
+ maplist = groupmaps.split(os.linesep)
+ maplist = maplist[0:len(maplist)-1]
+ map_names = []
+
+ for rastername in maplist:
+ map_names.append(rastername.split('@')[0])
+
+ return(maplist, map_names)
+
+
+def extract_points(gvector, grasters, field):
+ """
+ Extract values from grass rasters using vector points input
+
+ Args
+ ----
+ gvector: character, name of grass points vector
+ grasters: list of names of grass raster to query
+ field: character, name of field in table to use as response variable
+
+ Returns
+ -------
+ X: 2D numpy array of training data
+ y: 1D numpy array with the response variable
+ coordinates: 2D numpy array of sample coordinates
+ """
+ # open grass vector
+ points = VectorTopo(gvector.split('@')[0])
+ points.open('r')
+
+ # create link to attribute table
+ points.dblinks.by_name(name=gvector)
+
+ # extract table field to numpy array
+ table = points.table
+ cur = table.execute("SELECT {field} FROM {name}".format(field=field, name=table.name))
+ y = np.array([np.isnan if c is None else c[0] for c in cur])
+
+ # extract raster data
+ X = np.zeros((points.num_primitives()['point'], len(grasters)), dtype=float)
+ for i, raster in enumerate(grasters):
+ rio = RasterRow(raster)
+ values = np.asarray(get_raster_for_points(points, rio))
+ coordinates = values[:, 1:3]
+ X[:, i] = values[:, 3]
+
+ # set any grass integer nodata values to NaN
+ X[X == -2147483648] = np.nan
+
+ # remove missing response data
+ X = X[~np.isnan(y)]
+ coordinates = coordinates[~np.isnan(y)]
+ y = y[~np.isnan(y)]
+
+ # close
+ points.close()
+
+ return(X, y, coordinates)
Property changes on: grass-addons/grass7/raster/r.learn.ml/r_learn_utils.py
___________________________________________________________________
Added: svn:mime-type
+ text/x-python
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.learn.ml/rfclassification.png
___________________________________________________________________
Modified: svn:mime-type
- application/octet-stream
+ image/png
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/Makefile
___________________________________________________________________
Added: svn:mime-type
+ text/x-makefile
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/erod.c
___________________________________________________________________
Added: svn:mime-type
+ text/x-csrc
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/hydro.c
___________________________________________________________________
Added: svn:mime-type
+ text/x-csrc
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/input.c
___________________________________________________________________
Added: svn:mime-type
+ text/x-csrc
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/main.c
___________________________________________________________________
Added: svn:mime-type
+ text/x-csrc
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/observation_points.c
___________________________________________________________________
Added: svn:mime-type
+ text/x-csrc
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/output.c
___________________________________________________________________
Added: svn:mime-type
+ text/x-csrc
Added: svn:eol-style
+ native
Modified: grass-addons/grass7/raster/r.sim.water.mp/r.sim.water.mp.html
===================================================================
--- grass-addons/grass7/raster/r.sim.water.mp/r.sim.water.mp.html 2017-04-29 20:09:01 UTC (rev 70988)
+++ grass-addons/grass7/raster/r.sim.water.mp/r.sim.water.mp.html 2017-04-29 21:42:18 UTC (rev 70989)
@@ -237,4 +237,4 @@
The International Series in Engineering and Computer Science: Volume 773. Springer New York Inc, p. 406.
</ul>
-<p><i>Last changed: $Date: 2016-03-08 09:06:33 +0100 (Út, 08 bře 2016) $</i>
+<p><i>Last changed: $Date$</i>
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/r.sim.water.mp.html
___________________________________________________________________
Added: svn:mime-type
+ text/html
Added: svn:keywords
+ Author Date Id
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/r_sim_water.png
___________________________________________________________________
Modified: svn:mime-type
- application/octet-stream
+ image/png
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/random.c
___________________________________________________________________
Added: svn:mime-type
+ text/x-csrc
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/simlib.h
___________________________________________________________________
Added: svn:mime-type
+ text/x-chdr
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/spearfish.sh
___________________________________________________________________
Added: svn:mime-type
+ text/x-sh
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/utils.c
___________________________________________________________________
Added: svn:mime-type
+ text/x-csrc
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sim.water.mp/waterglobs.h
___________________________________________________________________
Added: svn:mime-type
+ text/x-chdr
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sun.mp/Makefile
___________________________________________________________________
Added: svn:mime-type
+ text/x-makefile
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sun.mp/README
___________________________________________________________________
Added: svn:mime-type
+ text/plain
Added: svn:keywords
+ Author Date Id
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sun.mp/TODO
___________________________________________________________________
Added: svn:mime-type
+ text/plain
Added: svn:keywords
+ Author Date Id
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sun.mp/dalloc.c
___________________________________________________________________
Added: svn:mime-type
+ text/x-csrc
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sun.mp/local_proto.h
___________________________________________________________________
Added: svn:mime-type
+ text/x-chdr
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sun.mp/main.c
___________________________________________________________________
Added: svn:mime-type
+ text/x-csrc
Added: svn:eol-style
+ native
Modified: grass-addons/grass7/raster/r.sun.mp/r.sun.mp.html
===================================================================
--- grass-addons/grass7/raster/r.sun.mp/r.sun.mp.html 2017-04-29 20:09:01 UTC (rev 70988)
+++ grass-addons/grass7/raster/r.sun.mp/r.sun.mp.html 2017-04-29 21:42:18 UTC (rev 70989)
@@ -381,4 +381,4 @@
</address>
<p>
-<i>Last changed: $Date: 2016-02-19 02:42:11 +0100 (Pá, 19 úno 2016) $</i>
+<i>Last changed: $Date$</i>
Property changes on: grass-addons/grass7/raster/r.sun.mp/r.sun.mp.html
___________________________________________________________________
Added: svn:mime-type
+ text/html
Added: svn:keywords
+ Author Date Id
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sun.mp/rsunglobals.h
___________________________________________________________________
Added: svn:mime-type
+ text/x-chdr
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sun.mp/rsunlib.c
___________________________________________________________________
Added: svn:mime-type
+ text/x-csrc
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster/r.sun.mp/sunradstruct.h
___________________________________________________________________
Added: svn:mime-type
+ text/x-chdr
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster3d/r3.to.group/Makefile
___________________________________________________________________
Added: svn:mime-type
+ text/x-makefile
Added: svn:eol-style
+ native
Modified: grass-addons/grass7/raster3d/r3.to.group/r3.to.group.html
===================================================================
--- grass-addons/grass7/raster3d/r3.to.group/r3.to.group.html 2017-04-29 20:09:01 UTC (rev 70988)
+++ grass-addons/grass7/raster3d/r3.to.group/r3.to.group.html 2017-04-29 21:42:18 UTC (rev 70989)
@@ -29,4 +29,4 @@
Vaclav Petras, <a href="http://geospatial.ncsu.edu/osgeorel/">NCSU GeoForAll Lab</a><br>
<p>
-<i>Last changed: $Date: 2015-09-02 07:01:34 -0400 (Wed, 02 Sep 2015) $</i>
+<i>Last changed: $Date$</i>
Property changes on: grass-addons/grass7/raster3d/r3.to.group/r3.to.group.html
___________________________________________________________________
Added: svn:mime-type
+ text/html
Added: svn:keywords
+ Author Date Id
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/raster3d/r3.to.group/r3.to.group.py
___________________________________________________________________
Added: svn:mime-type
+ text/x-python
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/vector/v.profile/local_proto.h
___________________________________________________________________
Added: svn:mime-type
+ text/x-chdr
Added: svn:eol-style
+ native
Property changes on: grass-addons/grass7/vector/v.profile/processors.c
___________________________________________________________________
Added: svn:mime-type
+ text/x-csrc
Added: svn:eol-style
+ native
More information about the grass-commit
mailing list