[Proj] Experiment to speed up proj.4 by 2 or more
support at mnspoint.com
support at mnspoint.com
Thu Jul 2 07:06:58 PDT 2015
Hello Even,
we have already tested all possible optimizations SSE2 etc. which ever
we couild
switch our compiler to produce .. the bad news is that even if the
performance
could be somewhat better in some speed aspects .. the overall system
performance
got worse and for example the screen updateing matters and interrupt
servicing
got worse! .. so basically there is not much to be gained taking that
road. And
we switched back not to optimize some important sections since the
overloading
of the cpu made the OS not any more work so efficiently and fluently.
(At least for some processors)
The problem with very hard optimizations is that the results can be very
strange
with different processors an operating systems. And since most of them
have only
been tested with rather "lame" programs by the manufacturers of
processors and
operating systems .. it is usually best not to try too much! .. or at
least
all combinations should be tested .. which might be very huge a task!
If you anyway like to do some special C++ work on the proj.4 package,
please add
the syntax scanner in front of it .. so that it makes sure the user
entered
some projection definitions that the library did really understand. That
would
reduce the number of errors the users makes with the rather complex
definitions
the library needs. (No the Proj.4v accepts almost what ever definitions
and says
nothing about the fact that it was maybe totally discarded and did not
make any
sense)..
And name it something else than "Proj.4" .. there is already "libProj4"
.. maybe
"Cpp.Proj.4" for example .. so you are free to do what ever! :D
I have not had time to check what the github people have done with the
package?
Most likely nothing but taken all the glory and destroyed some important
sections? :) -- which is the usual approach.. haha :)
http://libproj4.maptools.org/
regards: Janne.
----------------------------------------------
Even Rouault kirjoitti 23.06.2015 18:29:
> Hi,
>
> I've done an experiment to use Intel SIMD intrinsics
> (https://en.wikipedia.org/wiki/SIMD), and I think they could be
> beneficial for
> proj, when called to transform several coordinates at a time.
>
> I've used the SSE2 instruction set (128 bit registers, so 2 doubles at
> a
> time), and I managed to speed up the inverse Transverse Mercator
> ellipsoidal
> transformation (ie. from projected to geodetic) by a factor of ~ 2
> (excluding
> potential datum transformations)
>
> One key for performance was to find an efficient way of computing the
> usual
> transcendental functions (ie. sin, cos, tan and their inverse, exp, ln,
> etc...) with SIMD registers, since they are not included in the
> instruction
> set. Otherwise you have to collect each component of the SIMD register,
> evaluate it with the x87 coprocessor, and reassemble the SIMD register
> from
> the computed components, which kills all the other performance gains.
> The
> SLEEF library (http://freecode.com/projects/sleef) has such routines,
> is in
> the public domain and works rather well (with gcc/clang, although it
> has some
> rough edges when trying with MSVC, but nothing that cannot be overcome)
>
> I've encapsulated the use of SSE2 intrinsics in a C++ class with
> overloading
> of arithmetics operators, so the resulting code looks pretty much
> similar to
> the original C code, which is great for readability (although the
> original C
> code isn't always very readable ;-)), and confidence that it doesn't
> introduce
> errors. Conditionnal branches are not so great for SIMD performance,
> but there
> are tricks to rewrite some of them with a ternary-like operator.
>
> SLEEF also supports the AVX & AVX2+FMA instruction sets (256 bit
> registers),
Hello
> which could also lead to a further ~ x2 gain over SSE2.
>
> So I was wondering if there was :
>
> 1) interest of the project in pursuing into that approach (which
> involves
> introducing C++ in the code base, as an implementation detail, the
> interface
> being unchanged). We could imagine to have the same source files
> compiled
> several times with different register sizes, with runtime selection of
> the
> appropriate variant (note: SSE2 is guaranteed to be available on all
> x86_64
> compatible processors. AVX/AVX2 is for more recent CPUs).
>
> 2) ... and sponsors interested in making that happen.
>
> Finally, the proof of concept:
> * regular code (runs in ~30s on Core i5 750 @ 2.67GHz ):
> https://gist.github.com/rouault/946104d0b98e8e8cc564
> * SSE2 code (~14s):
> https://gist.github.com/rouault/3bbc31c9f12391d79920
>
> Best regards,
>
> Even
More information about the Proj
mailing list