[PROJ] Vector/SIMD acceleration

Even Rouault even.rouault at spatialys.com
Thu Apr 16 08:18:24 PDT 2020


Hi,

I've lately worked (again (*)) on a proof of concept of the Transverse Mercator forward 
transformation to use Intel SIMD instructions to transform several coordinate pairs 
simultaneously, potentially for use by the proj_trans_array() / proj_trans_generic() functions. 
Transverse Mercator is a very good candidate for that as it is quite expensive, and has few 
branches.

The impact on the projection code is minimal, and the conversion of the original code was 
mostly straightforward, by using C++ templates and operator overloading: you mostly 
replace occurences of "double" by a templated type, and depending on how it is instanciated, 
it can expand to a single, 2, 4, 8, etc. doubles, either in a single or several SIMD registers. 
Optimizers do a good job at generating good assembly from that.

SIMD instrinsincs are available for basic arithmetic operations and comparisons, but not for 
trigonometric (sin, cos, etc.) and other transcendent (exp, log, ...) functions that are often 
needed to implement projections, and are usually the computation bottlenecks.

The SLEEF Vectorized Math Library (https://sleef.org/), using Boost License
(~ MIT), provides such operations, and with very good accuracy (accuracy of 1 ULP for double 
precision). It is portable accross OS and supports different architectures.

On my standalone prototype (outside of PROJ infrastructure, with just the forward TMerc 
code extracted), I get a 3.8x speedup with the AVX2 + FMA instruction sets, compared to a 
build with AVX2 + FMA enabled with the original non-vector implementation, and using 
SLEEF. This is when transforming 8 coordinate pairs at the same time. This 3.8x speed-up is 
close to the optimal 4 factor (AVX/AVX2 256bit vectors can store 4 doubles). Without SLEEF, 
the speedup is 1.35x
I guess that with AVX-512 available, gains in the [4x, 8x[ range could be expected, but I 
haven't tested.

With pure SSE2 that comes automatically with x86_64, I can get a 1.55x speed-up with SLEEF 
(optimal would be x2 due to the 128 bit SSE vectors). Without SLEEF, the speedup is 1.35x as 
well.

I would expect similar gains on the reverse path of etmerc which has equivalent complexity. 
Snyder's tmerc, geographic <--> cartesian conversions, etc. would likely be other good 
candidates.

SLEEF could be made an optional dependency of PROJ. When it is not available, the 
execution of trigonometric & transcendent functions is of course serialized, hence the 
reduced efficiency.

I would expect the actual gains, once the needed changes to be able to integrate that in 
PROJ itself are done, to be less than what I got on the prototype, due to other overheads in 
code between the user call and the actual projection code. But there's probably 
improvements that could be done to reduce current overheads.

Is there an interest in seeing that integrated in PROJ ? I guess this is mostly of interest for 
people transforming at least billions of points. A few millions is probably not enough to 
really appreciate the difference: I can already get 4 million points/sec transformed by 
proj_trans() with tmerc.

The question of funding such work would also remained to be solved.

Even

(*) I had a feeling of deja-vu when writing this email, and actually I realized I wrote a similar 
one almost 5 years ago
( http://lists.maptools.org/pipermail/proj/2015-June/007169.html ). C++ at that time seemed 
to be a hurdle for a number of people, but luckily we have gone through it now.

-- 
Spatialys - Geospatial professional services
http://www.spatialys.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/proj/attachments/20200416/047f9f1b/attachment.html>


More information about the PROJ mailing list