[PROJ] Vector/SIMD acceleration

Fri Apr 17 10:29:17 PDT 2020

I wonder whether it's possible to get the benefits of vectorization
without massive of changes to the code.  Perhaps the basic projection
functions could be templated to allow Eigen arrays of PJ_LP's to be
passed.  Eigen already has overloads for componentwise arithmetic
operations and handles SIMD vectorization automatically.

A major advantage of this approach is that PROJ doesn't need to get into
the weeds with SIMD instructions.  So when a new instruction set comes
along we can (I hope) rely on the maintains of Eigen to do the hard
work.  (I see that there's already some interoperability with Eigen and
CUDA.)

If this approach is followed, I would also recommend that the basic
floating point type, double, be either templated or typedef'ed.  This
would allow PROJ to be compiled to use long double or quad precision
which is often useful for tracking down round-off errors.

   --Charles

On 4/16/20 11:18 AM, Even Rouault wrote:
> Hi,
> 
> I've lately worked (again (*)) on a proof of concept of the Transverse 
> Mercator forward transformation to use Intel SIMD instructions to 
> transform several coordinate pairs simultaneously, potentially for use 
> by the proj_trans_array() / proj_trans_generic() functions. Transverse 
> Mercator is a very good candidate for that as it is quite expensive, and 
> has few branches.
> 
> The impact on the projection code is minimal, and the conversion of the 
> original code was mostly straightforward, by using C++ templates and 
> operator overloading: you mostly replace occurences of "double" by a 
> templated type, and depending on how it is instanciated, it can expand 
> to a single, 2, 4, 8, etc. doubles, either in a single or several SIMD 
> registers. Optimizers do a good job at generating good assembly from that.
> 
> SIMD instrinsincs are available for basic arithmetic operations and 
> comparisons, but not for trigonometric (sin, cos, etc.) and other 
> transcendent (exp, log, ...) functions that are often needed to 
> implement projections, and are usually the computation bottlenecks.
> 
> The SLEEF Vectorized Math Library (https://sleef.org/), using Boost License
> 
> (~ MIT), provides such operations, and with very good accuracy (accuracy 
> of 1 ULP for double precision). It is portable accross OS and supports 
> different architectures.
> 
> On my standalone prototype (outside of PROJ infrastructure, with just 
> the forward TMerc code extracted), I get a 3.8x speedup with the AVX2 + 
> FMA instruction sets, compared to a build with AVX2 + FMA enabled with 
> the original non-vector implementation, and using SLEEF. This is when 
> transforming 8 coordinate pairs at the same time. This 3.8x speed-up is 
> close to the optimal 4 factor (AVX/AVX2 256bit vectors can store 4 
> doubles). Without SLEEF, the speedup is 1.35x
> 
> I guess that with AVX-512 available, gains in the [4x, 8x[ range could 
> be expected, but I haven't tested.
> 
> With pure SSE2 that comes automatically with x86_64, I can get a 1.55x 
> speed-up with SLEEF (optimal would be x2 due to the 128 bit SSE 
> vectors). Without SLEEF, the speedup is 1.35x as well.
> 
> I would expect similar gains on the reverse path of etmerc which has 
> equivalent complexity. Snyder's tmerc, geographic <--> cartesian 
> conversions, etc. would likely be other good candidates.
> 
> SLEEF could be made an optional dependency of PROJ. When it is not 
> available, the execution of trigonometric & transcendent functions is of 
> course serialized, hence the reduced efficiency.
> 
> I would expect the actual gains, once the needed changes to be able to 
> integrate that in PROJ itself are done, to be less than what I got on 
> the prototype, due to other overheads in code between the user call and 
> the actual projection code. But there's probably improvements that could 
> be done to reduce current overheads.
> 
> Is there an interest in seeing that integrated in PROJ ? I guess this is 
> mostly of interest for people transforming at least billions of points. 
> A few millions is probably not enough to really appreciate the 
> difference: I can already get 4 million points/sec transformed by 
> proj_trans() with tmerc.
> 
> The question of funding such work would also remained to be solved.
> 
> Even
> 
> (*) I had a feeling of deja-vu when writing this email, and actually I 
> realized I wrote a similar one almost 5 years ago
> 
> ( http://lists.maptools.org/pipermail/proj/2015-June/007169.html ). C++ 
> at that time seemed to be a hurdle for a number of people, but luckily 
> we have gone through it now.
> 
> -- 
> 
> Spatialys - Geospatial professional services
> 
> http://www.spatialys.com
> 
> 
> _______________________________________________
> PROJ mailing list
> PROJ at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/proj
>