[gdal-dev] Performance regression testing/benchmarking for CI

Tue Oct 10 12:08:31 PDT 2023

Hi Even,

> With virtualization, it is hard to guarantee that other things happening
on the host running the VM might not interfer. Even locally on my own
machine, I initially saw strong variations in timings

The advice I've come across for benchmarking is to use the minimum time
from the set of runs as the comparison statistic, rather than the mean,
maximum, etc. The minimum is the most robust estimate of the "real" runtime
- every run is slowed by some amount due to external load on the system,
and the minimum time is the benchmark run with the least external load
(assuming you're not having issues with test burn-in).

It's been a while since I used pytest-benchmark, but I think I remember
needing to make sure that benchmark times from one machine/hardware type/OS
weren't trying to be compared to another. Similarly, this means that a
developer can't make a change and then compare their locally measured
runtime to a previously recorded CI runtime - the two simply aren't
comparable. Perhaps not a surprise to you, but I highlight it in case PRs
with incorrect claims of speedups start appearing.

Cheers,
Daniel

On Tue, 10 Oct 2023 at 19:09, Even Rouault via gdal-dev <
gdal-dev at lists.osgeo.org> wrote:

> Hi,
>
> I'm experimenting with adding performance regression testing in our CI.
> Currently our CI has quite extensive functional coverage, but totally
> lacks performance testing. Given that we use pytest, I've spotted
> pytest-benchmark (https://pytest-benchmark.readthedocs.io/en/latest/) as
> a likely good candidate framework.
>
> I've prototyped things in https://github.com/OSGeo/gdal/pull/8538
>
> Basically, we now have a autotest/benchmark directory where performance
> tests can be written.
>
> Then in the CI, we checkout a reference commit, build it and run the
> performance test suite in --benchmark-save mode
>
> And then we run the performance test suite on the PR in
> --benchmark-compare mode with a --benchmark-compare-fail="mean:5%"
> criterion (which means that a test fails if its mean runtime is 5%
> slower than the reference one)
>
>  From what I can see, pytest-benchmark behaves correctly if tests are
> removed or added (that is not failing, just skipping them during
> comparison). The only thing one should not do is modify an existing test
> w.r.t the reference branch.
>
> Does someone has practical experience of pytest-benchmark, in particular
> in CI setups? With virtualization, it is hard to guarantee that other
> things happening on the host running the VM might not interfer. Even
> locally on my own machine, I initially saw strong variations in timings,
> which can be reduced to acceptable deviation by disabling Intel
> Turboboost feature (echo 1 | sudo tee
> /sys/devices/system/cpu/intel_pstate/no_turbo)
>
> Even
>
> --
> http://www.spatialys.com
> My software is free, but my time generally not.
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20231010/7f3c65e2/attachment.htm>