[GRASS-user] Cross-validation scripts for v.surf.rst?

Tue May 8 18:44:58 EDT 2007

Helena:

    Just to verify, I should be looking for the lowest mean cross-validation
difference to choose my tension/smoothing parameter?

--j

On 4/27/07 5:48 PM, "Helena Mitasova" <hmitaso at unity.ncsu.edu> wrote:

> 
> On Apr 27, 2007, at 1:52 AM, Hamish wrote:
> 
>> Jonathan Greenberg wrote:
>>> 
>>> I was wondering if anyone had written/acquired any cross-validation
>>> scripts for v.surf.rst to optimize the tension/smoothing parameters
>>> (they are alluded to in the documentation)?
>> 
>> 
>> I was thinking the same. A shell script loop to test many values isn't
>> hard, although it might take a long time try all possibilities, log
>> v.univar results for each attempt, then search the result matrix
>> for the
>> smallest error combo.
> 
> Markus pointed out the script, few months ago, I have added links to
> pdf versions of
> the relevant papers to the man page
> http://grass.itc.it/grass63/manuals/html63_user/v.surf.rst.html
> (2002 and 2005 papers have some answers to the below questions)
> 
>> 
>> 
>> Questions:
>> 
>> * how does changing the region resolution affect the cross-validation
>> result? could you drop down to a half or quarter of the target raster
>> resolution to do the cross-validation tests and find the optimum
>> value,
>> then when back at full res will the best values for those still be the
>> same?
> 
> it should not affect it at all - crossvalidation should completely skip
> any raster computation and the values are computed only in the skipped
> points. Let me know if changing the resolution influences the
> computation
> or result in any way.
>> 
>> also how does changing the region res affect computational time? is
>> most
>> of the time is spent computing the splines, or by making the res
>> coarser
>> are you effectively changing npmin & segmax settings?
> 
> you are not changing npmin &segmax, but if you use default dmin -
> it is set to half cell size - you would be changing density of points
> if you
> have several points per cell and that in turn changes dnorm
> (normalization
> factor) that scales the tension - I have just put a hint into the book
> how to keep tension constant if your dnorm changes - I will add it to
> manual
> as time allows.
> 
>> should they be
>> adjusted in tandem with the resolution?
> 
> no, just keep dmin constant
> 
>> 
>> is choosing a small (representative) subregion at the original
>> resolution preferred? how much of an art is there to picking a
>> representative subregion? could the script first scan the map for a
>> subregion with similar morphometric indices/fractal depth/stdev/
>> point density/whatever/ as the overall map, to do the trials on?
> 
> with crossvalidation, bigger issue than choosing a representative
> subregion is to have representative input data in the first place -
> otherwise
> the parameters found by crossvalidation are not optimal - there is
> a lot of literature on when crossvalidation works and when it does not.
>> 
>> 
>> * are the smoothing and tension variables independent? (roughly):
>> min(f(smooth)) + min(f(tension)) == min( f(smooth,tension) )  ?
> 
> no - see the 2005 paper, they are linked (as you lower tension,
> smoothing effectively increases, preventing potential overshoots).
> 
>> can you hold one of those terms steady, find the best fit using the
>> other, then hold the other steady while you vary the first? will the
>> variables found in that way be the final answer, or if they are
>> somewhat
>> dependent should you use the result of the first set of tests as
>> hinting
>> to help repeat the experiment and thus spiral towards the center?
> 
> both approaches should work (see what Jaro used in the 2002 paper)
>> 
>> Are the effects simple/smooth enough that the script could be "smart"
>> and dynamically adjust step size by rate of change of the cross-
>> validation variance to quick hone in on the best parameters?
> 
> see the 2005 and 2002 papers to see how CV error changes with
> parameters - it is pretty smooth.
> 
> Regarding your question on what takes most computational time
> I am somewhat puzzled how much time the linear equation solver takes -
> it used to be the computation of the grid that took up most time
> (so crossvalidation was very fast because in each run you would compute
> value in just a single point). Now it is very slow and segmax & npmin
> that control the size of the system of equations make a huge difference
> in speed (so if you have dense enough points use e.g. segmax 30 and
> npmin=150 rather than the defaults and it will run much faster).
>   At some point for GRASS5* we have
> replaced the function that we have used (a C-rewrite of some
> old fortran program) by G_ludcmp which I assumed would be faster,
> but I am not sure that is the reason for the slow down. It can also be
> my fantasy because the data sets are now so much larger and it might
> have been the same.
> 
> Helena
> 
> 
>> 
>> 
>> Hamish

-- 
Jonathan A. Greenberg, PhD
Postdoctoral Scholar
Center for Spatial Technologies and Remote Sensing (CSTARS)
University of California, Davis
One Shields Avenue
The Barn, Room 250N
Davis, CA 95616
Cell: 415-794-5043
AIM: jgrn307
MSN: jgrn307 at hotmail.com