[PROJ] Is PROJ unreasonably slow?
Thomas Knudsen
knudsen.thomas at gmail.com
Wed Mar 11 01:45:32 PDT 2026
TL;DR: Is PROJ unreasonably slow? The answer is "Mostly no", but with a few
caveats...
DISCLAIMER: The evidence presented below is weak - timing experiments repeated
only 2-4 times, on just a single computer. But at first sight the speed tests
all hint at the same conclusion: That PROJ could be made significantly faster.
Closer inspection, however, shows that, while in some corner cases, PROJ really
can be made faster, in the general cases, it already really is quite fast.
But despite the weak evidence, and the limited speed gains expected, I allow
myself to present the material here, to frame some potential items for
discussion and/or action - because changes, that may improve speed, may also
improve maintainability. And while the latter is hard to quantify, the former is
readily quantifiable with a stop watch.
Setting
-------
I recently did a validation experiment, by reimplementing the PROJ based
canonical transformation from ITRF2014 to SWEREF99, the Swedish ETRS89
realization.
The reimplementation is based on Rust geodesy (RG) [1], and the validation is
carried out by transforming a test set of 10 million randomly generated
coordinates. First using the RG coordinate processing program "kp" [2], then the
PROJ work horse "cs2cs" [3].
But I did not get around to actually validate, except for a handful of
sufficiently convincing random checks. I got sidetracked by something more
concerning - namely that PROJ appeared to be unreasonably slow:
While kp transformed the 10 million coordinates in a little less than 17
seconds, cs2cs needed more than 20 minutes - i.e. a factor of approximately 75
times slower.
Now, the ITRF2014->SWEREF99 transformation is non-trivial [5] and includes grid
lookups in the 5th and 7th step of the pipeline (the grid was, by the way,
already fully downloaded with projsync). So I had a hunch, that the random
organization of the input data might be poison for PROJ's grid caching. And
correctly so: After sorting the 10 million input points from north to south, the
run time went from about 1200 seconds, to about 100 seconds.
A respectable speed-up, although 6 times slower than RG. But as the points are
still randomly (dis)organized in the east-west direction, there may be even more
speed up possible for a more piecewise localized data set, such as the
coordinates of e.g. a typical OGC "simple feature" geometry collection.
But before constructing a test data set with such characteristics, I figured, I
would take a closer look at some PROJ operators not depending on grid access, to
get a feeling for how much time goes with accessing the test coordinates, and
converting the external text format to the internal IEEE 754 floating point
format.
For this, I had to change tool from cs2cs, to cct [4]. But the results were
still disappointing: Running the same 10 million coordinates through first a
no-operation (proj=noop), then a pipeline of 8 concatenated noops, gave these
results:
NOOP
Running kp first.
One noop. kp: 8.95 s, cct: 83 s, (kp 9.27 times faster)
Eight noops. kp: 9.38 s, cct: 97 s. (kp 10.34 times faster)
Running cct first.
One noop. kp: 9.56 s, cct: 83 s, (kp 8.68 times faster)
Eight noops. kp: 9.65 s, cct: 92 s. (kp 9.53 times faster)
To see what it costs to do some real computation, I also tried projecting the
test coordinates to UTM zone 32. Here kp consistently ran at close to 13
seconds, while cct had 3 cases of close to 100 seconds, and an oddly speedy
outlier of just 70 seconds. I suspect I may have misread the clock there.
UTM
Running kp first.
utm zone=32 kp: 13.54 s, cct: 70 s, (kp 5.17 times faster)
utm zone=32 kp: 12.37 s, cct: 96 s, (kp 7.76 times faster)
Running cct first.
utm zone=32 kp: 13.36 s, cct: 96 s, (kp 7.18 times faster)
utm zone=32 kp: 13.57 s, cct: 101 s, (kp 7.43 times faster)
But still, RG seems to be between 5 and 7 times faster than PROJ: Even when
comparing the worst kp time with the outlying cct time, kp is still more than 5
times faster than cct.
Why is it so? Some potential reasons
------------------------------------
FIRST, RG does grid access by lumping all grids together in an indexed file, a
"unigrid", and accessing that file through memory mapping (the rationale, with
the punchline "Don't implement poorly, what the OS already provides
excellently!" is described in [6]).
PROJ accesses files in chunks, and since, in PROJ, files are typically band
interleaved, PROJ needs 3 accesses far away from each other, to get all relevant
information for a single point value. Whereas RG uses point interleave, and
hence gets all relevant information from a single read operation.
Also, PROJ uses compressed files. And in this specific case
(eur_nkg_nkg17rfvel.tiff), the file is just 303x313 grid nodes, each node
consisting of 3 32 bit values, hence 303x313x3x4 bytes = 1_138_068 bytes.
But the compression is rather modest: The compressed file weighs 715_692 bytes,
i.e. a reduction of just 38%, but it prohibits direct access into the file.
RG skips all this, accesses the file as if it is one long array, and leaves all
caching/swapping to the OS, which has a much better general view of the system
resources available, than any single running process.
SECOND, PROJ handles single coordinates, while RG handles collections. Among
other things, this leads to a reduction in the number of function calls: PROJ
loops over the coordinates, and calls an operator on each coordinate, while RG
calls an operator, and let the operator loop over the coordinates. For the same
reason, PROJ needs to interpret a pipeline for each coordinate, while RG just
interprets the pipeline once for each collection of e.g. 100_000 coordinates.
Now, interpreting a pipeline is not a heavy task: Essentially, it is just an
iterator over the steps of the pipeline. But it is a little piece of extra
ceremony, that needs to be set up for every single coordinate.
This leads me on to the THIRD potential reason, namely that PROJ's internal data
flow is rather complex, carrying leftovers that made good sense back when PROJ
was simply a projection library, but which are mostly annoying today.
When all operators were projections, it made good sense to centralize the
handling of e.g. the central meridian into the pj_fwd and pj_inv functions.
Today, it is to a large degree something that needs being worked around, when
the operator is not a projection, but another kind of geodetic operator [7].
Also, originally PROJ was strictly 2D, so pj_fwd and pj_inv handles 2D data
only. When we had to extend it with both 3D and 4D variations, we also got
functional duplication and undesired messiness. This is likely one of the
reasons that PROJ's combined implementation of pipeline and stack functionality
weighs in at 725 lines, while RG, which has a unified data flow architecture,
provides (mostly) the same functionality in just 188 lines of code (in
both cases
including blank lines and comments).
RG started its life as an experiment with simpler data flows in geodetic
software. I believe it has succeeded in this respect. But I cannot yet provide
conclusive evidence, that this difference between RG and PROJ, also results in
faster execution. It is worth checking, though, and worth considering whether it
is worth the effort to retrofit a similar data flow architecture into PROJ? It
would clearly be a herculean task.
How to interpret the numbers above?
-----------------------------------
First and foremost: As I stated up front, the evidence is weak, but it is also
unambiguous, and while being a far cry from being able to answer the question
whether PROJ is "unreasonably slow" conclusively, at least it indicates that
there are ways to making PROJ faster. Whether this will be worth the effort is
another discussion.
That said, onto the interpretation.
The input file is 406 MB, and I ran the tests twice: Once with PROJ running
first, once with RG running first. This should reveal whether disk caching made
a difference. It doesn't seem to, however.
The full SWEREF transformation pipeline is evidently unreasonably slow, and
there is good evidence (the dramatic difference between sorted and random
input), that this is due to a grid access corner case. So PROJ is unreasonably
slow, when presented with unreasonable input data sets.
Once the input is sorted, however, the PROJ timing clocks in at around 100 s, no
matter whether we do the full transformation, the 8 noops, or the single UTM
projection.
So PROJ is very sensitive to the spatial ordering of input coordinate tuples. RG
not at all. Given the description above (band interleave vs. node/pixel
interleave, hand held caching vs. leaving it to the OS), this is probably not at
all surprising.
But PROJ has the additional feature of being able to automagically downloading
missing grid files tile-wise, where RG is stuck with what the user has a priori
entered into the unigrid, or manually post-registered at run time.
In the present test case, the download-on-demand feature is (hopefully) not
used, since the file is fully downloaded with projsync already. But might it
influence the overall grid access speed? I have not looked into that part of the
code recently, but I'm sure Even will spot it, if there are cheap gains to reap
here.
The I/O effect
--------------
Now, let's assume that the single-NOOP case mostly reflects the effort of
converting text based coordinates to the internal IEEE 754 binary format. I/O is
clearly a large part of the difference between kp, and the (cct, cs2cs) tuple:
"Anything" takes around 10 seconds for RG/kp, and "Anything" takes around 100
seconds for PROJ/(cct, cs2cs) - there is at least some evidence, that this is
because string-to-double (and v.v.) are surprisingly heavyweight operations.
But cs2cs uses the platform native `strtod()` function for string-to-double,
while cct uses `proj_strtod()` [8], which a.o. allows underscores as
thousands-separators (42_000). Both routines appear equally slow, compared to
the Rust version used in kp.
Apparently it just so happens, that the built in Rust converter is much faster
than typical C/C++ implementations. This may very well be the case: Rust's float
parsing was dramatically improved by Alexander Huszagh some years ago [9][10],
but whether this could account for a 10 times speed up, compared to C, is
unlikely.
I do not trust myself to build a reliable C++ platform, for timing the "real
functionality only" (i.e. ignoring the i/o overhead). I would, however, be
willing to provide a Rust version for intercomparison, if anyone would take up
the C++ task.
But fortunately PROJ chairman, Kristian Evers, upon reading an early version of
this text, reminded me, that the proj app supports binary I/O (and actually that
exact part of the PROJ source code was the target of my first contribution to
PROJ, way back in 1999. So shame on me for not thinking about this possibility).
Running the utm-projection case through proj (the app), with binary input,
significantly speeds up things, making PROJ almost as fast as RG, although with
only half the size of input and output, since proj is strictly 2D.
But switching to binary output as well, makes it even faster: With binary input
and binary output, proj projects 10 million input points in just 3 seconds, i.e.
300 ns/point. This is roughly 4 times as fast as kp, although also with just
half the amount of input and output, and no numeric conversion.
This indicates that the floating point-to-string output is an even heavier load
than the string-to-floating point input. This is perhaps not surprising,
although the widespread interest in optimizing the former is much more recent
for the latter.
But taking a look at some published benchmarks is encouraging: David Tolnay's
Rust based shootout [11] indicates that the very recent (November 2025)
zmij-algorithm performs almost 8 times better than Rust's default floating
point-to-string implementation. Even wilder, when comparing with system-supplied
implementations: Victor Zverovich, the creator of the zmij algorithm, in his owm
benchmarks [12] measures a 100 times (not 100%, 100 times!) speed up compared to
the system provided ostringstream implementation, running on an Apple M1.
Hence, we may expect the PROJ command line filters (proj, cct, cs2cs) to speed
up significantly, as system libraries mature and include faster floating
point-to-string-to-floating point operations... if that ever happens.
Obviously, we could also decide to introduce dependencies on stand alone
implementations, such as zmij. It is, however, questionable whether it is worth
the effort: Back in the 1980's, when Gerald Evenden created PROJ (the system),
it was to a very large degree in order to use proj (the app) to handle
projections for his map plotting system, MAPGEN, where much of the work was
implemented as Unix shell pipelines, hence constantly doing floating point I/O.
I conjecture that this is also the reason for proj's binary I/O functionality:
It may have sped up things significantly.
At that time of history, switching to some (not yet available) floating point
I/O algorithms would have made much sense, since so much work was done using
shell pipelines. Today, we can safely assume that in most cases, PROJ is used as
a linked library in a larger (GIS) system, and all inter-library communication
is binary.
When PROJ is used from the command line, it is (probably) mostly by specialists,
testing hypotheses, or checking a few reference-system-defining benchmarks. And
handling even tens of thousands of input points will take insignificant amounts
of time on a reasonably modern computer.
But I/O still takes some time: The recently launched "rewrite GDAL in Rust"
initiative, OxiGDAL [13] uses proj4rs [14], for its coordinate handling (proj4rs
is a Rust implementation of proj4js, which in turn is a JavaScript
reimplementation of PROJ.4). And OxiGDAL claims a handling time of 100
ns/coordinate tuple. Comparing this to the 300 ns from proj (the app) above
leads to the not-terribly-unreasonable conjecture that proj (the app) spends one
third of its time reading, one third on computing, and the last third on writing
the result.
Hence, I would expect us to find, that the general functionality is comparable
in speed between RG and PROJ (and proj4rs), while there is probably some modest
gains to realize in PROJ's handling of grids. So to answer my initial question:
No - PROJ is not unreasonably slow at the library level, although it sure can be
sped up.
But at the application level, there should be quite a bit of gains possible in
the floating point parsing. Whether we should or should not take on this task is
dubious: Although I wrote proj_strtod, I would not trust myself to doing a
reliable C++ port of Alexander Huszagh's work from Rust. But in the other end of
the I/O pipeline, the original version of the super fast zmij output algorithm
is already written in C++, under a MIT licence, and hence unproblematic to use
in the PROJ code base.
But I would highly prefer to leave this kind of code to reside in system
libraries, not in an application library, like PROJ.
Nevertheless: Hope y'all will consider this (much too) long writeup, and give it
a deep thought, whether rearchitecting PROJ, and to what extent, may be worth
the effort.
/Thomas Knudsen
[1] Rust Geodesy: https://lib.rs/geodesy
https://github.com/busstoptaktik/geodesy
[2] kp: https://github.com/busstoptaktik/geodesy/blob/main/ruminations/003-rumination.md
[3] cs2cs: https://proj.org/en/stable/apps/cs2cs.html
[4] cct: https://proj.org/en/stable/apps/cct.html
[5] The ITRF2014->SWEREF99 transformation:
$ projinfo -o proj --hide-ballpark -s itrf2014 -t sweref99
+proj=pipeline
+step +proj=axisswap +order=2,1
+step +proj=unitconvert +xy_in=deg +xy_out=rad
+step +proj=cart +ellps=GRS80
+step +proj=helmert +x=0 +y=0 +z=0 +rx=0.001785 +ry=0.011151
+rz=-0.01617 +s=0
+dx=0 +dy=0 +dz=0 +drx=8.5e-05 +dry=0.000531 +drz=-0.00077 +ds=0
+t_epoch=2010 +convention=position_vector
+step +inv +proj=deformation +t_epoch=2000 +grids=eur_nkg_nkgrf17vel.tif
+ellps=GRS80
+step +proj=helmert +x=0.03054 +y=0.04606 +z=-0.07944 +rx=0.00141958
+ry=0.00015132 +rz=0.00150337 +s=0.003002
+convention=position_vector
+step +proj=deformation +dt=-0.5 +grids=eur_nkg_nkgrf17vel.tif
+ellps=GRS80
+step +inv +proj=cart +ellps=GRS80
+step +proj=unitconvert +xy_in=rad +xy_out=deg
+step +proj=axisswap +order=2,1
[6] Rumination 012: Unigrids and the UG grid maintenance utility
https://github.com/busstoptaktik/geodesy/blob/main/ruminations/012-rumination.md
[7] Even Rouault om lam0:
https://github.com/OSGeo/PROJ/pull/4667/changes#diff-bfb0c333155a0c8bf863b0a3e76df46cfddf646cd5f13d6313eb8a3cb123f5f1R58
[8] proj_strtod():
https://github.com/OSGeo/PROJ/blob/master/src/apps/proj_strtod.cpp
[9] Update Rust Float-Parsing Algorithms to use the Eisel-Lemire algorithm
https://github.com/rust-lang/rust/pull/86761
[10] Implementing a Fast, Correct Float Parser
https://internals.rust-lang.org/t/implementing-a-fast-correct-float-parser/14670
[11] David Tolnay's dtoa-benchmark: https://github.com/dtolnay/dtoa-benchmark
[12] Victor Zverovich's zmij algorithm: https://github.com/vitaut/zmij/
[13] OxiGDAL - Pure Rust Geospatial Data Abstraction Library:
https://github.com/cool-japan/oxigdal
[14] proj4rs - Rust adaptation of PROJ.4: https://crates.io/crates/proj4rs
More information about the PROJ
mailing list