[GRASS-user] r.neighbors velocity

Fri Jun 28 20:28:12 PDT 2013

Ivan wrote:
>> the region is 4312*5576
>> the moving window 501
>> GRASS is the stable version on a machine with 8 core and 32 gb RAM.
>> Ubuntu 12.04
>>
>> it seems that the proprietary software is able to perform the analysis
>> in 2/3 seconds

I expect he's probably correct in that statement, but it's the *compiler*
used not the code behind it, and GRASS compiled in the same way would
be/is just as fast.

Sören wrote:
> this sounds very interesting.
>
> Your map has a size of 4312*5576 pixel? That's about 100MB in case of
> a type integer or type float map or about 200MB in case of a type
> double map. You must have a very fast HD or SSD to read and write such
> a map in under 2/3 seconds?

500mb/s IO for a SSD is not unusual, 300mb/s for spinning platter RAID
is pretty common. It's good to run a few replicates of the benchmark
so the 2nd+ times the data is already cached in RAM. (as long as the
region is not too huge to hold it there)

> In case your moving window has a size of 501 pixel (not 501x501 pixel!),
> the amount of operations that must be performed is at least 4312*5576*501.
> That's about 12 billion ops. Amazing to do this in 2/3 seconds.
> I have written a little program to see how my Intel core i5 performs
> processing this amount of operations. Well it needs about 100 seconds.

I was able get the same down to just over 1 second wall-time on a plain
consumer desktop chip. (!)

> Here the code, compiled with optimization:
>
>#include <stdio.h>
>
>
>int main()
>{
>        unsigned int i, j, k;
>        register double v = 0.0;
>
>
>        for(i = 0; i < 4321; i++) {
>                for(j = 0; j < 5576; j++) {
>                        for(k = 0; k < 501; k++) {
>                                v = v + (double)(i + j + k)/3.0;
>                        }
>                }
>        }
>        printf("v is %g\n", v);
>}
>
> soeren at vostro:~/src$ gcc -O3 numtest.c -o numtest
> soeren at vostro:~/src$ time ./numtest 
> v is 2.09131e+13
>
> real1m49.292s
> user1m49.223s
> sys0m0.000s
>
> Your proprietary software must run highly parallel using a fast
> GPU or an ASIC to keep the processing time under 2/3 seconds?
>
> Unfortunately r.neighbors is not able to compete with such a
> powerful software,

sure it is! :)

> since it is not reading the entire map into RAM and does not run
> on GPU's or ASIC's. But r.neighbors is able to process maps that
> are to large to fit into the RAM. :)
>
> Can you please tell us what software is so incredible fast?

I ran some quick trials with your sample program with both gcc 4.6
(ubuntu 12.04) and Intel's icc 12.1 on the same computer.

Diplomatically speaking, I see gcc 4.8 has just arrived in Debian/sid,
and I look forward to exploring how its new auto-vectorization features
are coming along.

The results however, are not so diplomatic and speak for themselves..
and for this simple test case* it isn't pretty.
(* so atypically easy for the compiler to optimize)

test system: i7 3770, lots of RAM
replicates are presented in horizontal columns.

standard gcc, with & without -O3 and -march=native: (all ~same)
real  1m14.507s | 1m14.559s | 1m14.513s | 1m14.514s
user  1m14.289s | 1m14.305s | 1m14.297s | 1m14.297s
sys   0m0.000s  | 0m0.028s  | 0m0.000s  | 0m0.000s
--

standard Intel icc with & without -O3:
v is 2.09131e+13

real  0m21.979s | 0m21.967s | 0m21.958s | 0m21.994s
user  0m21.909s | 0m21.901s | 0m21.897s | 0m21.929s
sys   0m0.000s  | 0m0.000s  | 0m0.000s  | 0m0.000s
--

icc with the "-fast" compiler switch:
$ icc -fast soeren_speed_test.c -o soeren_speed_test_icc_fast
   # note 900kb for executable vs's gcc's 8kb.
$ time ./soeren_speed_test_icc_fast
v is 2.09131e+13

real  0m3.273s | 0m3.274s | 0m3.275s
user  0m3.260s | 0m3.260s | 0m3.264s
sys   0m0.000s | 0m0.000s | 0m0.000s

(there's your 3 seconds)
--

icc -funroll-loops:
real  0m22.008s | 0m21.998s
user  0m21.941s | 0m21.929s
sys   0m0.000s  | 0m0.000s

(no extra gain in this case)
--

icc -parallel:  (running on 8 hyperthread (ie 4 real) cores)
   # binary size: 30kb
real  0m6.034s  | 0m6.005s  | 0m6.005s
user  0m46.531s | 0m46.603s | 0m46.519s
sys   0m0.024s  | 0m0.028s  | 0m0.044s
--

icc -parallel -fast:
   # binary size 2.2 megabytes
$ time ./soeren_speed_test_icc_parallel+fast 
v is 2.09131e+13

real  0m1.002s | 0m1.002s | 0m1.002s
user  0m6.768s | 0m6.796s | 0m6.780s
sys   0m0.004s | 0m0.004s | 0m0.008s

I tried a number of times but couldn't break the 1 second barrier. :)

-----
I also ran it on an AMD Phenom II X6 1090T  (icc -xHost --> -xSSSE3 ?)
All times "real"; all output was "v is 2.09131e+13".

gcc 4.4.5 with standard-opts: 7kb binary
 == near parity single-threaded performance with the new i7 chip from
    the 2 year old AMD Phenom and older copy of gcc! (stock debian/squeeze)
  1m16.175s | 1m15.634s | 1m16.029s

icc 12.1 with standard-opts:
  0m32.975s | 0m33.079s | 0m33.249s

icc with "-fast" opt: (700kb binary)
  0m9.577s | 0m9.572s | 0m9.583s

icc with -parallel auto-MP: (31kb binary)
 == again near parity with the new i7 chip! even with the Intel-biased 
    compiler.  "user" cpu-time was actually less. the advantage of 6 real
    cores vs 4 real+4virtual ones.*
  0m6.406s  | 0m6.404s  | 0m6.404s
  0m37.106s | 0m37.170s | 0m37.106s
  0m0.044s  | 0m0.040s  | 0m0.028s

icc with -fast and -parallel: (2mb binary)
  0m2.002s  | 0m2.002s  | 0m2.002s
  0m10.765s | 0m10.769s | 0m10.769s
  0m0.016s  | 0m0.012s  | 0m0.008s

(* I know from some earlier tests that hyperthreading is computationally
overheady, on a 12 real + 12virtual core Xeon, using about 11 real cores
took the same wall-clock time as 12 real + 5 virtual cores, it only beat
the 12 real cores as I got up to about 19 total cores, and full 24 cores
was in the new percentage points gain & very much diminishing returns)

regards,
Hamish

ps- we have to pay for icc/ifort academic (research) licenses now, but
the student (homework/classroom) license for linux is still gratis
if you dig around their dev website. Also AMD has their Open64
compiler to play with http://developer.amd.com/tools/open64/Pages/