Ok folks,<div>I am a bit confused now. After setting OMP_NUM_THREADS=1 and exporting, I get </div><div><br></div><div><div> 100%</div><div>v.surf.rst complete.</div><div><br></div><div>real<span class="Apple-tab-span" style="white-space:pre"> </span>352m46.451s</div>
<div>user<span class="Apple-tab-span" style="white-space:pre"> </span>341m14.196s</div><div>sys<span class="Apple-tab-span" style="white-space:pre"> </span>2m16.477s</div><div> </div><div>Over 100 minutes faster. So the multiple cores get in each other's way...</div>
<div><br></div><div>Recompiling without OpenMP.....</div><div><br></div><div><br></div><div>Thanks!</div><div><br></div><div>Doug</div><div><br></div><div><br></div><div><br></div><div class="gmail_quote">On Mon, Feb 25, 2013 at 12:14 AM, Hamish <span dir="ltr"><<a href="mailto:hamish_b@yahoo.com" target="_blank">hamish_b@yahoo.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
to test the efficiency (does 650% of the CPU go 6.5x as fast as<br>
running 100% on a single core?) you can use the OMP_* environment<br>
variables. from the bash command line:<br>
<br>
<br>
# try running it serially:<br>
OMP_NUM_THREADS=1<br>
export OMP_NUM_THREADS<br>
time g.module ...<br>
<br>
<br>
# let OpenMP set number of concurrent threads to number of local CPU cores<br>
unset OMP_NUM_THREADS<br>
time g.module ...<br>
<br>
<br>
then compare the overall & system time to complete.<br>
see <a href="http://grasswiki.osgeo.org/wiki/OpenMP#Run_time" target="_blank">http://grasswiki.osgeo.org/wiki/OpenMP#Run_time</a><br>
<br>
if that is horribly inefficient, it will probably be more<br>
efficient to run multiple (different) jobs serially, at the same<br>
time. The bash "wait" command is quite nice for that, waits<br>
for all backgrounded jobs to complete before going on.<br>
<br>
for r.in.{xyz|lidar|mb} this works quite well for generating<br>
multiple statistics at the same time, as the jobs will all want<br>
to read the same part of the input file at the about the same<br>
time, so it will still be fresh in the disk cache keeping I/O<br>
levels low. (see the r3.in.xyz scripts)<br>
<br>
<br>
for v.surf.bspline my plan was to put each of the data subregions<br>
in their own thread; for v.surf.rst my plan was to put each of<br>
the quadtree squares into their own thread. Since each thread<br>
introduces a finite amount of time to create and destroy, the<br>
goal is to make fewer, longer running ones. Anything more than ~<br>
an order of mangnitude more that the number of cores you have is<br>
unneeded overhead.<br>
<br>
e.g., processing all satellite bands at the same time is a nice<br>
efficient win. If you process all 2000 rows of a raster map in<br>
2000 just-an-instant-to-complete threads, the create/destroy<br>
overhead to thread survival time really takes its toll.<br>
Even as thread creation/destruction overheads become more<br>
efficiently handled by the OSs and compilers, the situation will<br>
still be the same. The interesting case is OpenCL, where your<br>
video card can run 500 GPU units..<br>
<span class="HOEnZb"><font color="#888888"><br>
<br>
Hamish<br>
</font></span></blockquote></div><br><br clear="all"><div><br></div>-- <br><div>Doug Newcomb</div><div>USFWS</div><div>Raleigh, NC</div><div>919-856-4520 ext. 14 <a href="mailto:doug_newcomb@fws.gov" target="_blank">doug_newcomb@fws.gov</a></div>
<div>---------------------------------------------------------------------------------------------------------</div><div>The opinions I express are my own and are not representative of the official policy of the U.S.Fish and Wildlife Service or Dept. of the Interior. Life is too short for undocumented, proprietary data formats.</div>
</div>