[GRASS-user] r.neighbors velocity

Fri Jun 28 23:35:37 PDT 2013

Hi,

here are the same results for Soeren's test program, with the Open64
compiler from AMD:

 - Same AMD X6 CPU as below.
 - Open64 compiler 4.5.2.1 from AMD  (GPLv2, LGPL)

I just downloaded the pre-built RHEL5 binary tarball and they worked
on Debian/squeeze, I just made an alias to the executable in the un-
tarred bin/ dir to get it to work.
 see also http://wiki.open64.net/index.php/Installation_on_Ubuntu
Source is available of course, but according to the Debian ITP ticket
it's a bit of a pain to build there.

straight opencc:

real  0m59.015s | 0m58.972s | 0m58.963s
user  0m58.760s | 0m58.812s | 0m58.624s
sys   0m0.248s  | 0m0.136s  | 0m0.300s
--

opencc -O3:

real    0m35.203s | 0m35.173s | 0m35.204s
user    0m35.206s | 0m35.174s | 0m35.206s
sys     0m0.000s  | 0m0.000s  | 0m0.000s
--

opencc -Ofast (with or without -march=auto for native bytecode)

real  0m13.389s | 0m13.402s | 0m13.435s
user  0m13.389s | 0m13.405s | 0m13.437s
sys   0m0.000s  | 0m0.000s  | 0m0.000s
--

opencc -Ofast -march=auto -apo on a 6-(real)-core CPU
v is 2.09131e+13

real  0m2.552s  | 0m2.595s  | 0m2.591s
user  0m14.857s | 0m14.725s | 0m14.725s
sys   0m0.008s  | 0m0.024s  | 0m0.016s

'-apo' is autoparallelization, poorly documented, but it works!
it adds OpenMP pragmas where it thinks it can && where it will
cause a gain; I'm glad to see it's not just for the fotran
compiler anymore.

So the Open64 compiler is not quite as fast as Intel's one for this
test case, but it's pretty close versus the more versatile gcc in the
far distance. Executable file size for all of the above was less than
12kb, since it can link to local OS shared libs.

I haven't tried it with llvm/clang.

Now I wonder which flags to use to recreate -Ofast in gcc to make it
a fairer comparison..

Hamish

> I also ran it on an AMD Phenom II X6 1090T  (icc -xHost --> -xSSSE3 ?)
> All times "real"; all output was "v is 2.09131e+13".
> 
> gcc 4.4.5 with standard-opts: 7kb binary
>  == near parity single-threaded performance with the new i7 chip from
>     the 2 year old AMD Phenom and older copy of gcc! (stock debian/squeeze)
>   1m16.175s | 1m15.634s | 1m16.029s
> 
> icc 12.1 with standard-opts:
>   0m32.975s | 0m33.079s | 0m33.249s
> 
> icc with "-fast" opt: (700kb binary)
>   0m9.577s | 0m9.572s | 0m9.583s
> 
> icc with -parallel auto-MP: (31kb binary)
>  == again near parity with the new i7 chip! even with the Intel-biased
>     compiler.  "user" cpu-time was actually less. the advantage of 6 real
>     cores vs 4 real+4virtual ones.*
>   0m6.406s  | 0m6.404s  | 0m6.404s
>   0m37.106s | 0m37.170s | 0m37.106s
>   0m0.044s  | 0m0.040s  | 0m0.028s
> 
> icc with -fast and -parallel: (2mb binary)
>   0m2.002s  | 0m2.002s  | 0m2.002s
>   0m10.765s | 0m10.769s | 0m10.769s
>   0m0.016s  | 0m0.012s  | 0m0.008s