[GRASS-SVN] r70814 - grass-addons/grass7/raster/r.vif

svn_grass at osgeo.org svn_grass at osgeo.org
Fri Mar 31 05:56:38 PDT 2017


Author: pvanbosgeo
Date: 2017-03-31 05:56:38 -0700 (Fri, 31 Mar 2017)
New Revision: 70814

Modified:
   grass-addons/grass7/raster/r.vif/r.vif.html
   grass-addons/grass7/raster/r.vif/r.vif.py
Log:
r.vif addon: computation of vif is now in Python, resulting in strongly reduced computing times. There is an option to use a random subsample of raster cells as input to further speed up the computations. There is furthermore a 'low-memory' option (this is the old implementation which uses r.regression.multi).

Modified: grass-addons/grass7/raster/r.vif/r.vif.html
===================================================================
--- grass-addons/grass7/raster/r.vif/r.vif.html	2017-03-29 00:13:25 UTC (rev 70813)
+++ grass-addons/grass7/raster/r.vif/r.vif.html	2017-03-31 12:56:38 UTC (rev 70814)
@@ -21,7 +21,7 @@
 e.g., multiple regression analysis.
 
 <p>The user can optionally select one or more variables to be
-retained in the stepwise selection. For example, the user selects
+retained in the stepwise selection. For example, let's assume the user selects
 the variable <i>bio_5</i> to be retained. If in any step this
 variable has the highest VIF, the variable with the next highest VIF
 will be removed instead (see the examples).
@@ -29,8 +29,31 @@
 <p>The user can set the 's' flag to only print the finally selected
 variables to the standard output (console). Note that this only
 works when the stepwise selection procedure is invoked, i.e., when
-the maxvif is set.
+the maxvif is set. This option makes it easier to use the output of r.vif
+in another function directly (see example ).
 
+<h2>NOTES</h2>
+
+To compute the vif all data layers are read in as a numpy array (non-data cells
+are ignored). When input layers are large or many memory usage may become
+problematic. In such cases the user may opt to sample raster values for random
+locations and use that to compute the vif. The quantity of random locations to
+be generated either can be defined as a positive integer, or as a percentage of
+the raster map layer's cells (see <a href="r.random.html">r.random</a> for
+details).
+
+<p>Using random sub-set of raster cells as input means that the vif values may
+vary between runs. If the sub-set is too small it may even lead to differences
+in variables selected when running the step-wise procedure. Special care should
+be taken when many of the equations are underdetermined (when the vif is
+<i>Inf</i>).
+
+<p>As an alternative, the user can set the <em>f</em> flag to evoke the
+'low-memory option'. This will use <a
+href="r.regression.multi">r.regression.multi</a> in the background (this is an
+addon that needs to be installed first). It uses all raster values and can
+handle very large raster layers. The disadvantage is that it runs much slower.
+
 <h2>EXAMPLES</h2>
 
 The following examples are based on the nc_climate_spm_2000_2012 sample data
@@ -123,7 +146,8 @@
 
 selected variables are:
 --------------------------------------
-2011_02_precip, 2011_03_precip, 2011_04_precip, 2011_05_precip, 2011_06_precip, 2011_07_precip, 2011_08_precip
+2011_02_precip, 2011_03_precip, 2011_04_precip, 2011_05_precip, 2011_06_precip,
+2011_07_precip, 2011_08_precip
 
 Statistics are written to results2.csv
 
@@ -144,7 +168,8 @@
 The output is:
 
 <div class="code"><pre>
-2011_01_precip,2011_04_precip,2011_05_precip,2011_06_precip,2011_07_precip,2011_08_precip
+2011_01_precip,2011_04_precip,2011_05_precip,2011_06_precip,2011_07_precip,
+2011_08_precip
 </pre></div>
 
 This output can be captured in a variable 'SELECTION', which is used
@@ -191,6 +216,6 @@
 
 <h2>AUTHOR</h2>
 
-Paulo van Breugel, paulo at ecodiv.org
+Paulo van Breugel, paulo at ecodiv.earth
 
 <p><i>Last changed: $Date$</i>

Modified: grass-addons/grass7/raster/r.vif/r.vif.py
===================================================================
--- grass-addons/grass7/raster/r.vif/r.vif.py	2017-03-29 00:13:25 UTC (rev 70813)
+++ grass-addons/grass7/raster/r.vif/r.vif.py	2017-03-31 12:56:38 UTC (rev 70814)
@@ -6,18 +6,15 @@
 # MODULE:       r.vif
 # AUTHOR(S):    Paulo van Breugel <p.vanbreugel AT gmail.com>
 # PURPOSE:      Calculate the variance inflation factor of set of
-#               variables. Alternatively, to speed the calculation up,
-#               this can be done for an user defined number of random
-#               cells. The user can set a maximum VIF, in wich case the VIF
-#               will calculated again after removing the variables with the
-#               highest VIF. This will be repeated till the VIF falls below
-#               the user defined VIF threshold value.
-# Dependency:    r.regression.multi
+#               variables. The computation is done using an user defined number
+#               (or percentage) of random cells (default 10.000) as input. 
+#               The user can set a maximum VIF, in wich case the VIF will
+#               calculated again after removing the variables with the highest
+#               VIF. This will be repeated till the VIF falls below the user
+#               defined VIF threshold value.
+ #
+# COPYRIGHT: (C) 2015 - 2017 Paulo van Breugel and the GRASS Development Team
 #
-# COPYRIGHT: (C) 2015 Paulo van Breugel
-#            http://ecodiv.org
-#            http://pvanb.wordpress.com/
-#
 #            This program is free software under the GNU General Public
 #            License (>=v2). Read the file COPYING that comes with GRASS
 #            for details.
@@ -45,201 +42,283 @@
 #% key: retain
 #% type: string
 #% gisprompt: old,cell,raster
-#% description: variables to retain
+#% description: variables
 #% required: no
 #% multiple: yes
 #% guisection: Input
 #%end
 
-#%option G_OPT_F_OUTPUT
-#% key:file
-#% description: Name of output text file
-#% required: no
+#%option
+#% key: n
+#% type: string
+#% description: number of sample points (number or percentage)
+#% key_desc: number
 #% guisection: Input
+#% answer: 100%
 #%end
 
 #%option
 #% key: maxvif
-#% type: string
+#% type: double
 #% description: Maximum vif
 #% key_desc: number
 #% guisection: Input
 #%end
 
+#%option G_OPT_F_OUTPUT
+#% key:file
+#% description: Name of output text file
+#% required: no
+#% guisection: Input
+#%end
+
 #%flag
 #% key: s
-#% description: Only print selected variables
+#% description: Only print selected variables to screen
 #%end
 
+#%flag
+#% key: f
+#% description: low-memory option (will use full raster layers)
+#%end
+
 #%rules
 #%requires_all: -s,maxvif
 #%end
 
-# =============================================================================
-# General
-# =============================================================================
-
 # import libraries
 import os
 import sys
 import math
 import numpy as np
-import grass.script as grass
+from scipy import stats
+import uuid
+import tempfile
+import atexit
+import string
+import grass.script as gs
 
-# Check if running in GRASS
-if not os.environ.has_key("GISBASE"):
-    grass.message("You must be in GRASS GIS to run this program.")
-    sys.exit(1)
 
+# Functions
+CLEAN_RAST = []
 
-# Check if layers exist
+
+def cleanup():
+    """Remove temporary maps specified in the global list. In addition,
+    remove temporary files"""
+    cleanrast = list(reversed(CLEAN_RAST))
+    for rast in cleanrast:
+        gs.run_command("g.remove", flags="f", type="all",
+                       name=rast, quiet=True)
+
+
+def tmpname(prefix):
+    """Generate a tmp name which contains prefix. Store the name in the
+    global list. Use only for raster maps."""
+    tmpf = prefix + str(uuid.uuid4())
+    tmpf = string.replace(tmpf, '-', '_')
+    CLEAN_RAST.append(tmpf)
+    return tmpf
+
+
 def CheckLayer(envlay):
+    """Check if the input layers exist. If not, exit with warning"""
     for chl in xrange(len(envlay)):
-        ffile = grass.find_file(envlay[chl], element='cell')
+        ffile = gs.find_file(envlay[chl], element='cell')
         if ffile['fullname'] == '':
-            grass.fatal("The layer " + envlay[chl] + " does not exist.")
+            gs.fatal("The layer " + envlay[chl] + " does not exist.")
 
 
+def ReadData(raster, n):
+    """Read in the raster layers as a numpy array. """
+    gs.message("Reading in the data ...")
+    if not n == "100%":
+        # Create mask random locations
+        new_mask = tmpname("rvif")
+        gs.run_command("r.random", input=raster[0], npoints=n, raster=new_mask,
+                       quiet=True)
+        exist_mask = gs.find_file(name='MASK', element='cell',
+                                  mapset=gs.gisenv()['MAPSET'])
+        if exist_mask['fullname']:
+            mask_backup = tmpname('rvifoldmask')
+            gs.run_command("g.rename", raster=["MASK", mask_backup],
+                           quiet=True)
+        gs.run_command("r.mask", raster=new_mask, quiet=True)
+
+    # Get the raster values at sample points
+    tmpcov = tempfile.mkstemp()[1]
+    with open(tmpcov, "w") as text_file:
+        text_file.write(
+                gs.read_command("r.stats", flags="1n", input=raster,
+                                quiet=True, separator="comma"))
+    p = np.loadtxt(tmpcov, skiprows=0, delimiter=",")
+
+    # Clean up
+    os.remove(tmpcov)
+    if not n == "100%":
+        gs.run_command("r.mask", flags="r", quiet=True)
+        if exist_mask['fullname']:
+            gs.run_command("g.rename", raster=[mask_backup, "MASK"],
+                           quiet=True)
+    return(p)
+
+
+def ComputeVif(mapx, mapy):
+    """Compute rsqr of linear regression between layers mapx and mapy."""
+    Xi = np.hstack((mapx, np.ones((mapx.shape[0], 1))))
+    mod, resid = np.linalg.lstsq(Xi, mapy)[:2]
+    if resid.size == 0:
+        resid = 0
+    r2 = float(1 - resid / (mapy.size * mapy.var()))
+    if float(r2) > 0.9999999999:
+        vif = float("inf")
+        sqrtvif = float("inf")
+    else:
+        vif = 1 / (1 - r2)
+        sqrtvif = math.sqrt(vif)
+    return [vif, sqrtvif]
+
+
+def ComputeVif2(mapx, mapy):
+    vifstat = gs.read_command("r.regression.multi", flags="g", quiet=True,
+                              mapx=mapx, mapy=mapy)
+    vifstat = vifstat.split('\n')
+    vifstat = [i.split('=') for i in vifstat]
+    if float(vifstat[1][1]) > 0.9999999999:
+        vif = float("inf")
+        sqrtvif = float("inf")
+    else:
+        rsqr = float(vifstat[1][1])
+        vif = 1 / (1 - rsqr)
+        sqrtvif = math.sqrt(vif)
+    return [vif, sqrtvif]
+
+
 # main function
-def main():
+def main(options, flags):
+    """Main function, called at execution time."""
 
     # Variables
-    IPF = options['maps']
-    IPF = IPF.split(',')
-    CheckLayer(IPF)
-    IPR = options['retain']
-    if IPR != '':
-        IPR = IPR.split(',')
+    IPF = options['maps'].split(',')
+    IPR = options['retain'].split(',')
+    if IPR != ['']:
         CheckLayer(IPR)
-
         for k in xrange(len(IPR)):
             if IPR[k] not in IPF:
                 IPF.extend([IPR[k]])
     IPFn = [i.split('@')[0] for i in IPF]
     IPRn = [i.split('@')[0] for i in IPR]
-
     MXVIF = options['maxvif']
     if MXVIF != '':
         MXVIF = float(MXVIF)
     OPF = options['file']
+    n = options['n']
     flag_s = flags['s']
+    flag_f = flags['f']
 
-    # =========================================================================
-    # Calculate VIF and write to standard output (& optionally to file)
-    # =========================================================================
-
     # Determine maximum width of the columns to be printed to std output
     name_lengths = []
     for i in IPF:
         name_lengths.append(len(i))
     nlength = max(name_lengths)
 
+    # Read in data
+    if not flag_f:
+        p = ReadData(IPF, n)
+
     # Create arrays to hold results (which will be written to file at end)
     out_vif = []
     out_sqrt = []
     out_variable = []
-    out_round = []
 
-    # VIF is computed for each variable
-    # -------------------------------------------------------------------------
+    # VIF is computed once only
     if MXVIF == '':
         # Print header of table to std output
         print('{0[0]:{1}s} {0[1]:8s} {0[2]:8s}'.format(
-            ['variable', 'vif', 'sqrtvif'], nlength))
+                ['variable', 'vif', 'sqrtvif'], nlength))
 
         # Compute the VIF
-        for k in xrange(len(IPF)):
-            MAPy = IPF[k]
-            MAPx = IPF[:]
-            del MAPx[k]
-            vifstat = grass.read_command("r.regression.multi",
-                                         flags="g", quiet=True,
-                                         mapx=MAPx, mapy=MAPy)
-            vifstat = vifstat.split('\n')
-            vifstat = [i.split('=') for i in vifstat]
-            if float(vifstat[1][1]) > 0.9999999999:
-                vif = float("inf")
-                sqrtvif = float("inf")
+        for i, e in enumerate(IPFn):
+            # Compute vif using full rasters
+            if flag_f:
+                y = IPF[i]
+                x = IPF[:]
+                del x[i]
+                vifstat = ComputeVif2(x, y)
+            # Compute vif using sample
             else:
-                rsqr = float(vifstat[1][1])
-                vif = 1 / (1 - rsqr)
-                sqrtvif = math.sqrt(vif)
-
-            # Write results to array (this will be printed to file at end
-            # of script)
-            out_vif.append(vif)
-            out_sqrt.append(sqrtvif)
-            out_variable.append(MAPy)
-            print('{0[0]:{1}s} {0[1]:8.2f} {0[2]:8.2f}'.format([MAPy, vif,
-                  sqrtvif], nlength))
+                y = p[:, i]
+                x = np.delete(p, i, axis=1)
+                vifstat = ComputeVif(x, y)
+            # Write stats to arrays
+            out_vif.append(vifstat[0])
+            out_sqrt.append(vifstat[1])
+            out_variable.append(e)
+            print('{0[0]:{1}s} {0[1]:8.2f} {0[2]:8.2f}'.format([IPFn[i],
+                  vifstat[0], vifstat[1]], nlength))
         print
         if len(OPF) > 0:
-            print("Statistics are written to " + OPF + "\n")
+            print("Statistics are written to {}\n".format(OPF))
 
-    # The stepwise variable selection procedure is run
-    # -------------------------------------------------------------------------
+    # The VIF stepwise variable selection procedure
     else:
         rvifmx = MXVIF + 1
         m = 0
         remove_variable = 'none'
         out_removed = []
-        txt_message = "Working"
+        out_round = []
 
         # The VIF will be computed across all variables. Next, the variable
         # with highest vif is removed and the procedure is repeated. This
         # continuous till the maximum vif across the variables > maxvif
+
+        if flag_s:
+            gs.message("Computing statistics ...")
+
         while MXVIF < rvifmx:
             m += 1
             rvif = np.zeros(len(IPF))
 
             # print the header of the output table to the console
             if not flag_s:
-                print
+                print("\n")
                 print("VIF round " + str(m))
                 print("--------------------------------------")
                 print('{0[0]:{1}s} {0[1]:>8s} {0[2]:>8s}'.format(
                     ['variable', 'vif', 'sqrtvif'], nlength))
-            else:
-                txt_message = txt_message + "."
-                grass.message(txt_message)
 
             # Compute the VIF and sqrt(vif) for all variables in this round
-            for k in xrange(len(IPF)):
-                MAPy = IPF[k]
-                MAPx = IPF[:]
-                del MAPx[k]
-                vifstat = grass.read_command("r.regression.multi",
-                                             flags="g", quiet=True,
-                                             mapx=MAPx, mapy=MAPy)
-                vifstat = vifstat.split('\n')
-                vifstat = [i.split('=') for i in vifstat]
-                if float(vifstat[1][1]) > 0.9999999999:
-                    vif = float("inf")
-                    sqrtvif = float("inf")
+            for k, e in enumerate(IPFn):
+                # Compute vif using full rasters
+                if flag_f:
+                    y = IPF[k]
+                    x = IPF[:]
+                    del x[k]
+                    vifstat = ComputeVif2(x, y)
                 else:
-                    rsqr = float(vifstat[1][1])
-                    vif = 1 / (1 - rsqr)
-                    sqrtvif = math.sqrt(vif)
+                    # Compute vif using sample
+                    y = p[:, k]
+                    x = np.delete(p, k, axis=1)
+                    vifstat = ComputeVif(x, y)
 
-                # Write results to arrays (which will be written to csv file
-                # at end of script)
-                out_vif.append(vif)
-                out_sqrt.append(sqrtvif)
-                out_variable.append(MAPy)
+                # Write results to arrays
+                out_vif.append(vifstat[0])
+                out_sqrt.append(vifstat[1])
+                out_variable.append(e)
                 out_round.append(m)
                 out_removed.append(remove_variable)
 
                 # print result to console
                 if not flag_s:
-                    print('{0[0]:{1}s} {0[1]:8.2f} {0[2]:8.2f}'.format([MAPy,
-                          vif, sqrtvif], nlength))
+                    print('{0[0]:{1}s} {0[1]:8.2f} {0[2]:8.2f}'.
+                          format([IPFn[k], vifstat[0], vifstat[1]], nlength))
 
                 # If variable is set to be retained by the user, the VIF
                 # is set to -9999 to ensure it will not have highest VIF
                 if IPFn[k] in IPRn:
                     rvif[k] = -9999
                 else:
-                    rvif[k] = vif
+                    rvif[k] = vifstat[0]
 
             # Compute the maximum vif across the variables for this round and
             # remove the variable with the highest VIF
@@ -249,17 +328,15 @@
                 remove_variable = IPFn[rvifindex]
                 del IPF[rvifindex]
                 del IPFn[rvifindex]
+                if not flag_f:
+                    p = np.delete(p, rvifindex, axis=1)
 
         # Write final selected variables to std output
         if not flag_s:
-            print
+            print("/n")
             print("selected variables are: ")
             print("--------------------------------------")
             print(', '.join(IPFn))
-            print
-            if len(OPF) > 0:
-                print("Statistics are written to " + OPF + "\n")
-            print
         else:
             print(','.join(IPFn))
 
@@ -274,12 +351,16 @@
             else:
                 text_file.write("round,removed,variable,vif,sqrtvif\n")
                 for i in xrange(len(out_vif)):
-                    text_file.write('{0:d},{1:s},{2:s},{3:.6f},{4:.6f}\n'.format(
-                    out_round[i], out_removed[i], out_variable[i],
-                    out_vif[i], out_sqrt[i]))
+                    text_file.write('{0:d},{1:s},{2:s},{3:.6f},{4:.6f}\n'.
+                                    format(out_round[i], out_removed[i],
+                                           out_variable[i], out_vif[i],
+                                           out_sqrt[i]))
         finally:
             text_file.close()
+            gs.message("\n")
+            gs.message("Statistics are written to " + OPF + "\n")
 
+
 if __name__ == "__main__":
-    options, flags = grass.parser()
-    sys.exit(main())
+    atexit.register(cleanup)
+    sys.exit(main(*gs.parser()))



More information about the grass-commit mailing list