[gdal-dev] Performance regression testing/benchmarking for CI

Tue Oct 10 22:53:10 PDT 2023

Hi,

No experience with pytest-benchmark, but I maintain an unrelated project that runs some benchmarks on CI, and here are some things worth mentioning:

 - we store the results as a newline-delimited JSON file in a different GitHub repository (https://raw.githubusercontent.com/rust-analyzer/metrics/master/metrics.json, warning, it's a 5.5 MB unformatted JSON)
 - we have an in-browser dashboard that retrieves the whole file and displays them: https://rust-analyzer.github.io/metrics/
 - we do track build time and overall run time, but we're more interested in correctness
 - the display is a bit of a mess (partly due to trying to keep the setup as simple as possible), but you can look for the "total time", "total memory" and "build" to get an idea
 - we store the runner CPU type and memory in that JSON; they're almost all Intel, but they do upgrade from time to time
 - we even have two AMD EPYC runs, note that boost is disabled in a different way there (we don't try to disable it, though)
 - we also try to measure the CPU instruction count (the perf counter), but it doesn't work on GitHub and probably in most VMs
 - the runners have been very reliable, but not really consistent in performance
 - a bigger problem for us was that somebody actually needs to look at the dashboard to spot any regressions and investigate them (some are caused by external changes)
 - in 3-5 years we'll probably have to trim down the JSON or switch to a different storage

Laurentiu

On Tue, Oct 10, 2023, at 21:08, Even Rouault via gdal-dev wrote:
> Hi,
>
> I'm experimenting with adding performance regression testing in our CI. 
> Currently our CI has quite extensive functional coverage, but totally 
> lacks performance testing. Given that we use pytest, I've spotted 
> pytest-benchmark (https://pytest-benchmark.readthedocs.io/en/latest/) as 
> a likely good candidate framework.
>
> I've prototyped things in https://github.com/OSGeo/gdal/pull/8538
>
> Basically, we now have a autotest/benchmark directory where performance 
> tests can be written.
>
> Then in the CI, we checkout a reference commit, build it and run the 
> performance test suite in --benchmark-save mode
>
> And then we run the performance test suite on the PR in 
> --benchmark-compare mode with a --benchmark-compare-fail="mean:5%" 
> criterion (which means that a test fails if its mean runtime is 5% 
> slower than the reference one)
>
>  From what I can see, pytest-benchmark behaves correctly if tests are 
> removed or added (that is not failing, just skipping them during 
> comparison). The only thing one should not do is modify an existing test 
> w.r.t the reference branch.
>
> Does someone has practical experience of pytest-benchmark, in particular 
> in CI setups? With virtualization, it is hard to guarantee that other 
> things happening on the host running the VM might not interfer. Even 
> locally on my own machine, I initially saw strong variations in timings, 
> which can be reduced to acceptable deviation by disabling Intel 
> Turboboost feature (echo 1 | sudo tee 
> /sys/devices/system/cpu/intel_pstate/no_turbo)
>
> Even
>
> -- 
> http://www.spatialys.com
> My software is free, but my time generally not.
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev