[gdal-dev] Performance regression testing/benchmarking for CI
Even Rouault
even.rouault at spatialys.com
Sun Oct 15 05:00:26 PDT 2023
Le 15/10/2023 à 13:34, Javier Jimenez Shaw via gdal-dev a écrit :
> Hi Even. Thanks, it sounds good.
> However I see a potential problem. I see that you use once
> "SetCacheMax". We should not forget about that in the future for
> sensible tests. The cache of gdal is usually a percentage of the total
> memory, that may change among the environments and time.
Javier,
What is sure is that the timings got in one session of the perf tests in
CI are comparable to nothing else that timings done in the same session
(and that's already challenging!). So the effect of the RAM available in
the CI worker might affect the speed of the tests, but it will affect it
in the same way for both the reference run and the tested run (while the
GDAL_CACHEMAX=5% setting remains the same and the general working of the
block cache remains similar). I anticipate that at some points changes
in GDAL might make the perf test suite no longer comparable to the
current reference version and that we will have to upgrade the commit of
the reference version while that happens. Actually if the perf test
suite is extended, it might be useful to upgrade the commit of the
reference version at release time of feature releases. For example, when
GDAL 3.8.0 is released, it will become the reference point for 3.9.0
development, and so on (otherwise we wouldn't get perf regression
testing of added tests). The downside of this is that this wouldn't
catch progressive slowdowns over several release cycles. But given that
I had to raise the threshold for failure to > 30% regression to avoid
false positives, the perf test suite (at least when run in CI with all
its unpredictability) can only catch major "instant" regressions.
Even
>
> On Wed, 11 Oct 2023, 07:53 Laurențiu Nicola via gdal-dev,
> <gdal-dev at lists.osgeo.org> wrote:
>
> Hi,
>
> No experience with pytest-benchmark, but I maintain an unrelated
> project that runs some benchmarks on CI, and here are some things
> worth mentioning:
>
> - we store the results as a newline-delimited JSON file in a
> different GitHub repository
> (https://raw.githubusercontent.com/rust-analyzer/metrics/master/metrics.json,
> warning, it's a 5.5 MB unformatted JSON)
> - we have an in-browser dashboard that retrieves the whole file
> and displays them: https://rust-analyzer.github.io/metrics/
> - we do track build time and overall run time, but we're more
> interested in correctness
> - the display is a bit of a mess (partly due to trying to keep
> the setup as simple as possible), but you can look for the "total
> time", "total memory" and "build" to get an idea
> - we store the runner CPU type and memory in that JSON; they're
> almost all Intel, but they do upgrade from time to time
> - we even have two AMD EPYC runs, note that boost is disabled in
> a different way there (we don't try to disable it, though)
> - we also try to measure the CPU instruction count (the perf
> counter), but it doesn't work on GitHub and probably in most VMs
> - the runners have been very reliable, but not really consistent
> in performance
> - a bigger problem for us was that somebody actually needs to
> look at the dashboard to spot any regressions and investigate them
> (some are caused by external changes)
> - in 3-5 years we'll probably have to trim down the JSON or
> switch to a different storage
>
> Laurentiu
>
> On Tue, Oct 10, 2023, at 21:08, Even Rouault via gdal-dev wrote:
> > Hi,
> >
> > I'm experimenting with adding performance regression testing in
> our CI.
> > Currently our CI has quite extensive functional coverage, but
> totally
> > lacks performance testing. Given that we use pytest, I've spotted
> > pytest-benchmark
> (https://pytest-benchmark.readthedocs.io/en/latest/) as
> > a likely good candidate framework.
> >
> > I've prototyped things in https://github.com/OSGeo/gdal/pull/8538
> >
> > Basically, we now have a autotest/benchmark directory where
> performance
> > tests can be written.
> >
> > Then in the CI, we checkout a reference commit, build it and run
> the
> > performance test suite in --benchmark-save mode
> >
> > And then we run the performance test suite on the PR in
> > --benchmark-compare mode with a --benchmark-compare-fail="mean:5%"
> > criterion (which means that a test fails if its mean runtime is 5%
> > slower than the reference one)
> >
> > From what I can see, pytest-benchmark behaves correctly if
> tests are
> > removed or added (that is not failing, just skipping them during
> > comparison). The only thing one should not do is modify an
> existing test
> > w.r.t the reference branch.
> >
> > Does someone has practical experience of pytest-benchmark, in
> particular
> > in CI setups? With virtualization, it is hard to guarantee that
> other
> > things happening on the host running the VM might not interfer.
> Even
> > locally on my own machine, I initially saw strong variations in
> timings,
> > which can be reduced to acceptable deviation by disabling Intel
> > Turboboost feature (echo 1 | sudo tee
> > /sys/devices/system/cpu/intel_pstate/no_turbo)
> >
> > Even
> >
> > --
> > http://www.spatialys.com
> > My software is free, but my time generally not.
> >
> > _______________________________________________
> > gdal-dev mailing list
> > gdal-dev at lists.osgeo.org
> > https://lists.osgeo.org/mailman/listinfo/gdal-dev
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
--
http://www.spatialys.com
My software is free, but my time generally not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20231015/410ff991/attachment.htm>
More information about the gdal-dev
mailing list