<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Dear Charles, Daniel, All,<br>
</p>
<p>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">It is possible to do 2x unroll of the Clenshaw loop to avoid the
shuffling of variables (t = xx(u0, u1), u1 = u0, u0 = t). See the
function SinCosSeries in geodesic.c where this is done.</pre>
</blockquote>
I applied the trick to avoid swapping the variables to both the
rolled and unrolled version -- thanks Charles for pointing me to
that trick in <i>SinCosSeries()</i>, I was already wondering how
we could save that.</p>
<p>
<blockquote type="cite"><span>I recommend against unrolling the
loops. </span></blockquote>
<blockquote type="cite"><font face="Georgia">Any modern compiler
will make these optimizations, tuned to the target
architecture</font></blockquote>
</p>
<p>I am not an optimization expert by any mean.<br>
However, based on initial tests running the authalic ==>
geodetic conversion using the Clenshaw algorithm 1 billion times,
my unrolled version of the <i>clenshaw()</i> appears to be
roughly 3% faster than the static inline one (27.7 seconds vs.
28.5 seconds, including generating a random input latitude), with
-O2, MMX, SSE and GCC fast-math optimizations turned on (using
GCC, not G++).<br>
<br>
I think the extra logic overhead from a generic <i>clenshaw()</i>
could explain this difference.<br>
</p>
<p></p>
<p>
<blockquote type="cite"><span> This makes the code longer and </span><span>harder
to read. </span></blockquote>
</p>
<p>Personally, I actually find the expanded version much easier to
understand what's going on, but that's just me :)</p>
<p>
<blockquote type="cite"><span> You also lose the flexibility of
adjusting the number </span><span>of terms in the expansion
at runtime.</span></blockquote>
</p>
<p>That would be a very good argument, but that is not functionality
that is being exposed anywhere at the moment, not even as a
compile-time option.<br>
Would it be desirable to allow selecting how many orders / terms
to use, either at compile-time or at runtime in PROJ? If so, how
we would go about making this option available?<br>
</p>
<p>3% might be a relatively small performance improvement, but I
would not call it negligible.<br>
However, I'm fine with using the <i>clenshaw()</i> function
inline if that's what we want to do.<br>
</p>
<p>
<blockquote type="cite">Undoubtedly, we could do a better job
centralizing some of these core capabilities, Clenshaw (and its
complex counterpart) + general auxiliary latitude conversions,
so that we don't have essentially duplicate code scattered all
over the place.</blockquote>
</p>
<p>Agreed. There is also <i>clens()</i> in <i>tmerc.cpp </i>(and
<i>clenS() </i>there for the complex version) implementing this
Clenshaw summation.</p>
<p>Thank you!<br>
</p>
<p>Kind regards,</p>
<p>-Jerome<br>
</p>
<div class="moz-cite-prefix">On 9/12/24 12:57 PM, DANIEL STREBE
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:3310DAA0-E68B-4DA0-9F26-CD7C8F389A2F@aol.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr"><br>
</div>
<div dir="ltr"><br>
<blockquote type="cite">On Sep 12, 2024, at 05:09, Charles
Karney via PROJ <a class="moz-txt-link-rfc2396E" href="mailto:proj@lists.osgeo.org"><proj@lists.osgeo.org></a> wrote:<br>
<br>
</blockquote>
</div>
<blockquote type="cite">
<div dir="ltr"><span>I recommend against unrolling the loops.
This makes the code longer and</span><br>
<span>harder to read. You also lose the flexibility of
adjusting the number</span><br>
<span>of terms in the expansion at runtime.</span><br>
<span></span><br>
<span>…But</span><br>
<span>remember that compilers can do the loop unrolling for
you. Also,</span><br>
<span>doesn't the smaller code size with the loops result in
fewer cache</span><br>
<span>misses?</span><br>
</div>
</blockquote>
<br>
<div><font face="Georgia">I think Charles is spot-on here. Any
modern compiler will make these optimizations, tuned to the
target architecture. Different architectures will prefer
different amount of unrolling, so it’s best not to
second-guess by hard-coding. Loop overhead of a simple counter
is zero, normally, because of unrolling in the short cases and
because the branch prediction will favor continuation in the
longer cases. Meanwhile the loop counting happens in parallel
in one of the ALUs while the FPUs do their thing.</font></div>
<div><font face="Georgia"><br>
</font></div>
<div><font face="Georgia">— daan Strebe</font></div>
</blockquote>
</body>
</html>