[GRASS-user] Cross-validation scripts for v.surf.rst?
hmitaso at unity.ncsu.edu
Tue May 8 21:33:30 EDT 2007
On May 8, 2007, at 6:44 PM, Jonathan Greenberg wrote:
> Just to verify, I should be looking for the lowest mean cross-
> difference to choose my tension/smoothing parameter?
not mean - that should be close to zero (positive and negative
should cancel out) - if it is not, interpolate the crossvalidation
to a raster map and overlay it with your input points
to see whether you may have an isolated point(s) with value
or above the rest of your data that is causing the bias (in that area
you need more
samples to support that extreme value).
Use mean of absolute values of differences as a measure of predictive
interpolation accuracy - so you chose the parameters that give you
mean of absolute values (you need to use v.univar.sh) as the most
There are some papers that explain why for this particular case mean
values is better than RMSE.
> On 4/27/07 5:48 PM, "Helena Mitasova" <hmitaso at unity.ncsu.edu> wrote:
>> On Apr 27, 2007, at 1:52 AM, Hamish wrote:
>>> Jonathan Greenberg wrote:
>>>> I was wondering if anyone had written/acquired any cross-validation
>>>> scripts for v.surf.rst to optimize the tension/smoothing parameters
>>>> (they are alluded to in the documentation)?
>>> I was thinking the same. A shell script loop to test many values
>>> hard, although it might take a long time try all possibilities, log
>>> v.univar results for each attempt, then search the result matrix
>>> for the
>>> smallest error combo.
>> Markus pointed out the script, few months ago, I have added links to
>> pdf versions of
>> the relevant papers to the man page
>> (2002 and 2005 papers have some answers to the below questions)
>>> * how does changing the region resolution affect the cross-
>>> result? could you drop down to a half or quarter of the target
>>> resolution to do the cross-validation tests and find the optimum
>>> then when back at full res will the best values for those still
>>> be the
>> it should not affect it at all - crossvalidation should completely
>> any raster computation and the values are computed only in the
>> points. Let me know if changing the resolution influences the
>> or result in any way.
>>> also how does changing the region res affect computational time? is
>>> of the time is spent computing the splines, or by making the res
>>> are you effectively changing npmin & segmax settings?
>> you are not changing npmin &segmax, but if you use default dmin -
>> it is set to half cell size - you would be changing density of points
>> if you
>> have several points per cell and that in turn changes dnorm
>> factor) that scales the tension - I have just put a hint into the
>> how to keep tension constant if your dnorm changes - I will add it to
>> as time allows.
>>> should they be
>>> adjusted in tandem with the resolution?
>> no, just keep dmin constant
>>> is choosing a small (representative) subregion at the original
>>> resolution preferred? how much of an art is there to picking a
>>> representative subregion? could the script first scan the map for a
>>> subregion with similar morphometric indices/fractal depth/stdev/
>>> point density/whatever/ as the overall map, to do the trials on?
>> with crossvalidation, bigger issue than choosing a representative
>> subregion is to have representative input data in the first place -
>> the parameters found by crossvalidation are not optimal - there is
>> a lot of literature on when crossvalidation works and when it does
>>> * are the smoothing and tension variables independent? (roughly):
>>> min(f(smooth)) + min(f(tension)) == min( f(smooth,tension) ) ?
>> no - see the 2005 paper, they are linked (as you lower tension,
>> smoothing effectively increases, preventing potential overshoots).
>>> can you hold one of those terms steady, find the best fit using the
>>> other, then hold the other steady while you vary the first? will the
>>> variables found in that way be the final answer, or if they are
>>> dependent should you use the result of the first set of tests as
>>> to help repeat the experiment and thus spiral towards the center?
>> both approaches should work (see what Jaro used in the 2002 paper)
>>> Are the effects simple/smooth enough that the script could be
>>> and dynamically adjust step size by rate of change of the cross-
>>> validation variance to quick hone in on the best parameters?
>> see the 2005 and 2002 papers to see how CV error changes with
>> parameters - it is pretty smooth.
>> Regarding your question on what takes most computational time
>> I am somewhat puzzled how much time the linear equation solver
>> takes -
>> it used to be the computation of the grid that took up most time
>> (so crossvalidation was very fast because in each run you would
>> value in just a single point). Now it is very slow and segmax & npmin
>> that control the size of the system of equations make a huge
>> in speed (so if you have dense enough points use e.g. segmax 30 and
>> npmin=150 rather than the defaults and it will run much faster).
>> At some point for GRASS5* we have
>> replaced the function that we have used (a C-rewrite of some
>> old fortran program) by G_ludcmp which I assumed would be faster,
>> but I am not sure that is the reason for the slow down. It can
>> also be
>> my fantasy because the data sets are now so much larger and it might
>> have been the same.
> Jonathan A. Greenberg, PhD
> Postdoctoral Scholar
> Center for Spatial Technologies and Remote Sensing (CSTARS)
> University of California, Davis
> One Shields Avenue
> The Barn, Room 250N
> Davis, CA 95616
> Cell: 415-794-5043
> AIM: jgrn307
> MSN: jgrn307 at hotmail.com
More information about the grass-user