[GRASS-dev] Some comments on the v.surf.icw add-on

Wed Jan 6 15:09:50 EST 2010

Ben wrote:
> thanks for your in-depth comments. They are proving very
> helpful. I am interpolating artifact types diversity for a
> study on prehistoric cultural patterns in Eurasia.

hmmm interesting. I've actually used a simpler r.cost approach
to look at a species diversity index on two sides of an island.
for that we were not interpolating, just adding the true distance
matrix between each point as another variable in the multivariate
generalized linear model. I hadn't thought to interpolate it
but maybe there is something in that.

how many starting points?
what region size (rows x cols)?
how long does that take for you? (+cpu/mem specs)
I'd love to see a screenshot too just to see what it looks like.

> Obviously, an interpolation method that includes the cost of
> traveling through a topographically varied landscape will
> yield much better results in this scenario (and my tests have
> shown that it really does). So v.surf.icw has done a really
> nice job for me

fyi there was a paper in Ecology (??) in 2008 which used a
similar (but different) method to look at animal migration
with topographic constraints / past bottlenecks. software was
online IIRC.

> (btw. the standard IDW formula with power=4
> gave best results).

ie biases to the influence of the more local sites.
on what basis do you judge that is the best result? ie how
to justify it in a journal article beyond personal preference?
(as yesterday I'm not sure if there is always a good answer,
most folks use kriging or idw because everyone else does and
those are the software tools available, but that doesn't
quantifiably prove nothin beyond inertia & mob rule)

> So nice in fact, that I am contemplating about whether/how
> to integrate cost distance support into the v.surf.idw C
> module.

I'm not sure it would help much,

> The things I miss mostly in the current, scripted version
> are:
> 1. Floating point IDW exponents.

that's easy and I've already done it locally in the past IIRC.
the d^n uses pow(d,n) instead of d*d*d just for that reason
actually;  so setting n to a FP value is no problem. The only
thing I think would be to make the $DIVISOR sanity adjustment
into a formula instead of hardcoded values.

it could be that it works now, just beware of skipping the
$DIVISOR anti-saturation thing if n gets very big. hmm, maybe
I should work in log() to solve that instead..

> 2. Specification of number of points to use in interpolation.

hmm, I though I had that at some point but for my work I just
use all points so I guess I dropped it. doable I guess, but you
still have to calculate all the cost maps to know what n points
are the closest (if your cost map is constant you could do a
first-pass euclidean min-distance step to cut them down, but
that doesn't help if the cost surface is variable).
I guess you could add max_cost= to r.cost, then dump that
point like it does the out-of-region points if it comes back
as a NULL cost. hmm... that would be doable I think and help
short circuit some of the long r.cost runs. (the savings from
the fewer r.mapcalc calls is probably less important than
aborting r.cost early)

which leads to...

> 3. Speed.
>
> I think these could be most easily addressed in a C program.

It could help a bit, but in the end the expensive part is r.cost
and having a C wrapper around that won't speed it up any. Also
the partial maps are so big that they wont fit in memory and
so need to be written to disk, which also really slows stuff
down. Also a C module doing what r.mapcalc does will not be
appreciably faster than calling r.mapcalc as a sep. process.
Glynn's new multi-threaded r.mapcalc in grass7 might help some.
maybe I could work in my poor-man's multithreading trick I use
with r.sun to speed things up, but perhaps that is best left
to the user or libraries instead of the middle-ware.

So the bulk of the time is not spent doing bash things (opening
and closing processes), it is spent in C code already. FWIW I
do plan on porting it to python at some point, which will make
it cleaner, but I don't expect any faster.

things with lots of loops containing programs calls is really
where C has the great advantage over bash. ISTR that bash (ie
linux on my old pc) will let me open and close about 25
programs a second max. whereas in C calls to lib fns are much
much quicker.

the greatest speed enhancement has been with the new r.cost
which is maybe >50 times faster than when I wrote the original
script. so that opens a lot of new doors. I was spending about
24 hours per run before.. (as long as it's done by the time I
come into work the next day or by monday morning...)

> However, 1 and 2 should also be possible in the scripted
> version?

#1 anyway, maybe #2.

> For floating point support, an external tool like "bc"
> would be required:
>
> http://www.linuxjournal.com/content/floating-point-math-bash

I prefer to use awk, as it is typically already installed
whereas bc or dc requires an addition dep. But r.mapcalc does
the bulk of the math already so fp isn't needed much in the
script.

> One more thing: the program warns the user if the number of
> input points is high and waits for a key press to confirm.

-with a 30 sec timeout before going on anyway-
I put that delay there because the module is so noisy that it
was easy to miss the warning message at the beginning.

> That's not a nice thing if you have it running in a shell
> loop to create some output over night...

at the time I wrote it, 50 points on a region 1500x1500 would
take about 24 hours due to the old-slow r.cost. so 30sec was
a small percent of that. Now that r.cost is faster that number
should be increased from 50 to something bigger, but to what?
I'd say give the warning if it will take longer than 4-12 hours.

regards,
Hamish