[GRASS-user] Cross-validation scripts for v.surf.rst?

Tue May 1 13:42:19 EDT 2007

Helena:

I am getting the following warning when running a modification of the script
Markus sent out:

WARNING: taking too long to find points for interpolation--please change
         the region to area where your points are. Continuing
         calculations...

I should note the program IS running, and does complete, it just takes
awhile...

Here's my mod of the script Markus sent me:

***

#!/bin/sh
#
# Written by Jaro Hofierka
#
# Modified by Jonathan Greenberg (greenberg at ucdavis.edu) 5/1/07 to work with
v.surf.rst
#
# this is a script for a cross-validation analysis of RST parameters
# OUTPUT: CSV table
#

# j - smoothing - the number after decimal point, e.g. sm=0.1 is defined as
j = 10
# i - tension

INMAP=delta_soundings_070406_frankstract
ZCOL=v4

OUTFILESTATS=/tmp/data_cv.csv

######### nothing to change below

rm -f $OUTFILESTATS
echo "tension;smoothing;mean;population_stddev" > $OUTFILESTATS

j=10
while  [ $j -le 90 ]
do

  i=10
  while  [ $i -le 150 ]
  do

  TNS=`echo $i`
  SMTH=`echo $j`
  TNSSMTH=`echo "t"$i"s0"$j `
  echo "Computing tension/smoothing $TNSSMTH..."

   #interpolate sites CV differences:
   v.surf.rst -c input=$INMAP cvdev=data_cv_$TNSSMTH tension="$i"
smooth=0."$j" --o zcolumn=$ZCOL

   #calculate univariate statistics for sites: 
   eval `v.univar -g data_cv_$TNSSMTH col=flt1 type=point | grep
'tension\|smoothing\|mean\|population_stddev'`
   echo "$TNS;$SMTH;$mean;$population_stddev" >> $OUTFILESTATS
   i=`expr $i + 10`
 done

 j=`expr $j + 10`
done
echo  "Finished. Written $OUTFILESTATS"

***

-----Original Message-----
From: Helena Mitasova [mailto:hmitaso at unity.ncsu.edu] 
Sent: Friday, April 27, 2007 5:49 PM
To: Hamish
Cc: Jonathan Greenberg; grassuser at grass.itc.it
Subject: Re: [GRASS-user] Cross-validation scripts for v.surf.rst?

On Apr 27, 2007, at 1:52 AM, Hamish wrote:

> Jonathan Greenberg wrote:
>>
>> I was wondering if anyone had written/acquired any cross-validation
>> scripts for v.surf.rst to optimize the tension/smoothing parameters
>> (they are alluded to in the documentation)?
>
>
> I was thinking the same. A shell script loop to test many values isn't
> hard, although it might take a long time try all possibilities, log
> v.univar results for each attempt, then search the result matrix  
> for the
> smallest error combo.

Markus pointed out the script, few months ago, I have added links to  
pdf versions of
the relevant papers to the man page
http://grass.itc.it/grass63/manuals/html63_user/v.surf.rst.html
(2002 and 2005 papers have some answers to the below questions)

>
>
> Questions:
>
> * how does changing the region resolution affect the cross-validation
> result? could you drop down to a half or quarter of the target raster
> resolution to do the cross-validation tests and find the optimum  
> value,
> then when back at full res will the best values for those still be the
> same?

it should not affect it at all - crossvalidation should completely skip
any raster computation and the values are computed only in the skipped
points. Let me know if changing the resolution influences the  
computation
or result in any way.
>
> also how does changing the region res affect computational time? is  
> most
> of the time is spent computing the splines, or by making the res  
> coarser
> are you effectively changing npmin & segmax settings?

you are not changing npmin &segmax, but if you use default dmin -
it is set to half cell size - you would be changing density of points  
if you
have several points per cell and that in turn changes dnorm  
(normalization
factor) that scales the tension - I have just put a hint into the book
how to keep tension constant if your dnorm changes - I will add it to  
manual
as time allows.

> should they be
> adjusted in tandem with the resolution?

no, just keep dmin constant

>
> is choosing a small (representative) subregion at the original
> resolution preferred? how much of an art is there to picking a
> representative subregion? could the script first scan the map for a
> subregion with similar morphometric indices/fractal depth/stdev/
> point density/whatever/ as the overall map, to do the trials on?

with crossvalidation, bigger issue than choosing a representative
subregion is to have representative input data in the first place -  
otherwise
the parameters found by crossvalidation are not optimal - there is
a lot of literature on when crossvalidation works and when it does not.
>
>
> * are the smoothing and tension variables independent? (roughly):
> min(f(smooth)) + min(f(tension)) == min( f(smooth,tension) )  ?

no - see the 2005 paper, they are linked (as you lower tension,
smoothing effectively increases, preventing potential overshoots).

> can you hold one of those terms steady, find the best fit using the
> other, then hold the other steady while you vary the first? will the
> variables found in that way be the final answer, or if they are  
> somewhat
> dependent should you use the result of the first set of tests as  
> hinting
> to help repeat the experiment and thus spiral towards the center?

both approaches should work (see what Jaro used in the 2002 paper)
>
> Are the effects simple/smooth enough that the script could be "smart"
> and dynamically adjust step size by rate of change of the cross-
> validation variance to quick hone in on the best parameters?

see the 2005 and 2002 papers to see how CV error changes with
parameters - it is pretty smooth.

Regarding your question on what takes most computational time
I am somewhat puzzled how much time the linear equation solver takes -
it used to be the computation of the grid that took up most time
(so crossvalidation was very fast because in each run you would compute
value in just a single point). Now it is very slow and segmax & npmin
that control the size of the system of equations make a huge difference
in speed (so if you have dense enough points use e.g. segmax 30 and
npmin=150 rather than the defaults and it will run much faster).
  At some point for GRASS5* we have
replaced the function that we have used (a C-rewrite of some
old fortran program) by G_ludcmp which I assumed would be faster,
but I am not sure that is the reason for the slow down. It can also be
my fantasy because the data sets are now so much larger and it might
have been the same.

Helena

>
>
> Hamish