[PROJ] Geodetic to Authalic latitude conversions

Thu Sep 12 12:56:53 PDT 2024

Dear Charles, Daniel, All,

> It is possible to do 2x unroll of the Clenshaw loop to avoid the
> shuffling of variables (t = xx(u0, u1), u1 = u0, u0 = t).  See the
> function SinCosSeries in geodesic.c where this is done.
I applied the trick to avoid swapping the variables to both the rolled 
and unrolled version -- thanks Charles for pointing me to that trick in 
/SinCosSeries()/, I was already wondering how we could save that.

> I recommend against unrolling the loops. 
> Any modern compiler will make these optimizations, tuned to the target 
> architecture

I am not an optimization expert by any mean.
However, based on initial tests running the authalic ==> geodetic 
conversion using the Clenshaw algorithm 1 billion times, my unrolled 
version of the /clenshaw()/ appears to be roughly 3% faster than the 
static inline one (27.7 seconds vs. 28.5 seconds, including generating a 
random input latitude), with -O2, MMX, SSE and GCC fast-math 
optimizations turned on (using GCC, not G++).

I think the extra logic overhead from a generic /clenshaw()/ could 
explain this difference.

>  This makes the code longer and harder to read. 

Personally, I actually find the expanded version much easier to 
understand what's going on, but that's just me :)

>  You also lose the flexibility of adjusting the number of terms in the 
> expansion at runtime.

That would be a very good argument, but that is not functionality that 
is being exposed anywhere at the moment, not even as a compile-time option.
Would it be desirable to allow selecting how many orders / terms to use, 
either at compile-time or at runtime in PROJ? If so, how we would go 
about making this option available?

3% might be a relatively small performance improvement, but I would not 
call it negligible.
However, I'm fine with using the /clenshaw()/ function inline if that's 
what we want to do.

> Undoubtedly, we could do a better job centralizing some of these core 
> capabilities, Clenshaw (and its complex counterpart) + general 
> auxiliary latitude conversions, so that we don't have essentially 
> duplicate code scattered all over the place.

Agreed. There is also /clens()/ in /tmerc.cpp /(and /clenS() /there for 
the complex version) implementing this Clenshaw summation.

Thank you!

Kind regards,

-Jerome

On 9/12/24 12:57 PM, DANIEL STREBE wrote:
>
>
>> On Sep 12, 2024, at 05:09, Charles Karney via PROJ 
>> <proj at lists.osgeo.org> wrote:
>>
>> I recommend against unrolling the loops.  This makes the code longer and
>> harder to read.  You also lose the flexibility of adjusting the number
>> of terms in the expansion at runtime.
>>
>> …But
>> remember that compilers can do the loop unrolling for you.  Also,
>> doesn't the smaller code size with the loops result in fewer cache
>> misses?
>
> I think Charles is spot-on here. Any modern compiler will make these 
> optimizations, tuned to the target architecture. Different 
> architectures will prefer different amount of unrolling, so it’s best 
> not to second-guess by hard-coding. Loop overhead of a simple counter 
> is zero, normally, because of unrolling in the short cases and because 
> the branch prediction will favor continuation in the longer cases. 
> Meanwhile the loop counting happens in parallel in one of the ALUs 
> while the FPUs do their thing.
>
> — daan Strebe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/proj/attachments/20240912/5b8d1cd9/attachment.htm>