<!DOCTYPE html>

<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Dear Charles, Daniel, All,<br>

    </p>

    <p>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">It is possible to do 2x unroll of the Clenshaw loop to avoid the

shuffling of variables (t = xx(u0, u1), u1 = u0, u0 = t).  See the

function SinCosSeries in geodesic.c where this is done.</pre>

      </blockquote>

      I applied the trick to avoid swapping the variables to both the

      rolled and unrolled version -- thanks Charles for pointing me to

      that trick in <i>SinCosSeries()</i>, I was already wondering how

      we could save that.</p>

    <p>

      <blockquote type="cite"><span>I recommend against unrolling the

          loops.  </span></blockquote>

      <blockquote type="cite"><font face="Georgia">Any modern compiler

          will make these optimizations, tuned to the target

          architecture</font></blockquote>

    </p>

    <p>I am not an optimization expert by any mean.<br>

      However, based on initial tests running the authalic ==>

      geodetic conversion using the Clenshaw algorithm 1 billion times,

      my unrolled version of the <i>clenshaw()</i> appears to be

      roughly 3% faster than the static inline one (27.7 seconds vs.

      28.5 seconds, including generating a random input latitude), with

      -O2, MMX, SSE and GCC fast-math optimizations turned on (using

      GCC, not G++).<br>

       <br>

      I think the extra logic overhead from a generic <i>clenshaw()</i>

      could explain this difference.<br>

    </p>

    <p></p>

    <p>

      <blockquote type="cite"><span> This makes the code longer and </span><span>harder

          to read. </span></blockquote>

    </p>

    <p>Personally, I actually find the expanded version much easier to

      understand what's going on, but that's just me :)</p>

    <p>

      <blockquote type="cite"><span> You also lose the flexibility of

          adjusting the number </span><span>of terms in the expansion

          at runtime.</span></blockquote>

    </p>

    <p>That would be a very good argument, but that is not functionality

      that is being exposed anywhere at the moment, not even as a

      compile-time option.<br>

      Would it be desirable to allow selecting how many orders / terms

      to use, either at compile-time or at runtime in PROJ? If so, how

      we would go about making this option available?<br>

    </p>

    <p>3% might be a relatively small performance improvement, but I

      would not call it negligible.<br>

      However, I'm fine with using the <i>clenshaw()</i> function

      inline if that's what we want to do.<br>

    </p>

    <p>

      <blockquote type="cite">Undoubtedly, we could do a better job

        centralizing some of these core capabilities, Clenshaw (and its

        complex counterpart) + general auxiliary latitude conversions,

        so that we don't have essentially duplicate code scattered all

        over the place.</blockquote>

    </p>

    <p>Agreed. There is also <i>clens()</i> in <i>tmerc.cpp </i>(and

      <i>clenS() </i>there for the complex version) implementing this

      Clenshaw summation.</p>

    <p>Thank you!<br>

    </p>

    <p>Kind regards,</p>

    <p>-Jerome<br>

    </p>

    <div class="moz-cite-prefix">On 9/12/24 12:57 PM, DANIEL STREBE

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:3310DAA0-E68B-4DA0-9F26-CD7C8F389A2F@aol.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr"><br>

      </div>

      <div dir="ltr"><br>

        <blockquote type="cite">On Sep 12, 2024, at 05:09, Charles

          Karney via PROJ <a class="moz-txt-link-rfc2396E" href="mailto:proj@lists.osgeo.org"><proj@lists.osgeo.org></a> wrote:<br>

          <br>

        </blockquote>

      </div>

      <blockquote type="cite">

        <div dir="ltr"><span>I recommend against unrolling the loops.

             This makes the code longer and</span><br>

          <span>harder to read.  You also lose the flexibility of

            adjusting the number</span><br>

          <span>of terms in the expansion at runtime.</span><br>

          <span></span><br>

          <span>…But</span><br>

          <span>remember that compilers can do the loop unrolling for

            you.  Also,</span><br>

          <span>doesn't the smaller code size with the loops result in

            fewer cache</span><br>

          <span>misses?</span><br>

        </div>

      </blockquote>

      <br>

      <div><font face="Georgia">I think Charles is spot-on here. Any

          modern compiler will make these optimizations, tuned to the

          target architecture. Different architectures will prefer

          different amount of unrolling, so it’s best not to

          second-guess by hard-coding. Loop overhead of a simple counter

          is zero, normally, because of unrolling in the short cases and

          because the branch prediction will favor continuation in the

          longer cases. Meanwhile the loop counting happens in parallel

          in one of the ALUs while the FPUs do their thing.</font></div>

      <div><font face="Georgia"><br>

        </font></div>

      <div><font face="Georgia">— daan Strebe</font></div>

    </blockquote>

  </body>

</html>