<div dir="ltr">Hi,<div>i have implemented a "real" average neighborhood algorithm that runs in parallel using openmp. The source code and the benchmark shell script is attached.</div><div><br></div><div>The neighbor program computes the average moving window of arbitrary size. The size of the map rows x cols and the size of the moving window  (odd number cols==rows) can be specified.</div>

<div><br></div><div>./neighbor rows cols mw_size</div><div><br></div><div>IMHO the new program is better for compiler comparison and neighborhood operation performance.</div><div><br></div><div>This is the benchmark on my 5 year old AMD phenom 4 core computer using 1, 2 and 4 threads:</div>

<div><br></div><div><div><div>gcc -Wall -fopenmp -lgomp -Ofast main.c -o neighbor</div><div>export OMP_NUM_THREADS=1</div><div>time ./neighbor 5000 5000 23</div></div><div>real<span class="" style="white-space:pre"> </span>0m37.211s</div>

<div>user<span class="" style="white-space:pre">        </span>0m36.998s</div><div>sys<span class="" style="white-space:pre">       </span>0m0.196s</div><div><br></div><div><div>export OMP_NUM_THREADS=2</div><div>time ./neighbor 5000 5000 23</div>

</div><div>real<span class="" style="white-space:pre">    </span>0m19.907s</div><div>user<span class="" style="white-space:pre">      </span>0m38.890s</div><div>sys<span class="" style="white-space:pre">       </span>0m0.248s</div><div>

<br></div><div><div>export OMP_NUM_THREADS=4</div><div>time ./neighbor 5000 5000 23</div></div><div>real<span class="" style="white-space:pre"> </span>0m10.170s</div><div>user<span class="" style="white-space:pre">      </span>0m38.466s</div>

<div>sys<span class="" style="white-space:pre"> </span>0m0.192s</div></div><div><br></div><div>Happy hacking, compiling and testing. :)</div><div><br></div><div>Best regards</div><div>Soeren</div><div><br></div><div><br></div>

</div><div class="gmail_extra"><br><br><div class="gmail_quote">2013/6/29 Markus Metz <span dir="ltr"><<a href="mailto:markus.metz.giswork@gmail.com" target="_blank">markus.metz.giswork@gmail.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5">On Sat, Jun 29, 2013 at 1:26 PM, Hamish <<a href="mailto:hamish_b@yahoo.com">hamish_b@yahoo.com</a>> wrote:<br>

> Markus Metz wrote:<br>

><br>

>> Some more results with Sören's test program on a Intel(R) Core(TM) i5<br>

>> CPU M450 @ 2.40GHz (2 real cores, 4 fake cores) with gcc 4.7.2 and<br>

>> clang 3.3<br>

>><br>

>> gcc -O3<br>

>> v is 2.09131e+13<br>

>><br>

>> real    2m0.393s<br>

>> user    1m57.610s<br>

>> sys    0m0.003s<br>

>><br>

>> gcc -Ofast<br>

>> v is 2.09131e+13<br>

>><br>

>> real    0m7.218s<br>

>> user    0m7.018s<br>

>> sys    0m0.017s<br>

><br>

><br>

> nice. one thing we need to remember though is that it's not entirely<br>

> free, one thing -Ofast turns on is -ffast-math,<br>

> """<br>

>  This option is not turned on by any -O option besides -Ofast since it can<br>

>  result in incorrect output for programs that depend on an exact<br>

>  implementation of IEEE or ISO rules/specifications for math functions. It<br>

>  may, however, yield faster code for programs that do not require the<br>

>  guarantees of these specifications.<br>

> """<br>

><br>

> which may not be fit for our purposes.<br>

><br>

><br>

> With the ifort compiler there is '-fp-model precise' which allows only<br>

> optimizations which don't harm the results. Maybe gcc has something<br>

> similar.<br>

<br>

</div></div>In gcc, you can turn of -ffoo with -fno-foo, maybe this way you can<br>

use -Ofast -fno-fast-math to preserve IEEE specifications.<br>

<div class="HOEnZb"><div class="h5">><br>

> Glad to see -floop-parallelize-all in gcc 4.7, it will help us identify<br>

> places to focus OpenMP work on.<br>

><br>

><br>

> Hamish<br>

><br>

</div></div></blockquote></div><br></div>