[Benchmarking] General rules for handling benchmarking

Sat Sep 26 13:29:22 EDT 2009

Hi,
today on IRC we had a talk about how to handle a few details
of the benchmarking process (me, Frank and Jeff). We would
like to present the results of the discussion to the rest of
the community so that we have a known set of rules about
how to do the benchmarks.

Running the benchmarks
---------------------------

The benchmarks are usually classified in cold and hot benchmarks.
Cold ones represent a completely un-cached filesystem and applications
just started, they represent the lowest possible performance of the
system.
Hot benchmarks represent a situation in which the file system and the
application managed to get their caches full and thus represent the
highest possible performance of the system.

Neither of them is realistic, a long running system will have
file system and application level caches full with something,
yet not necessarily something useful to serve the next incoming
request.
In the specific case of Java based app a fully cold benchmark
does not make sense either, since the JVM takes some time to
figure out which parts of the bytecode are hotspost and
compile them to native code on the fly. The average state
of a Java app server already sees the hot spots fully compiled
into native code, the fully cold case it tipically 2-3 times
slower than that.

Given repeating cold benchmarks is hard and, in the case of
Java based apps, not representative, we suggest we run
hot ones, also because it's easy: one runs the same benchmarking
suite repeatedly until the results level off.
Typically this requires 2-4 runs.
Since this will give an advantage to systems that do perform
data or resource caching,
either the benchmarks should be setup so that no caching
is possible (the main reason why we don't repeat the
same request 1500 times in a row) or the presentation will list
what kind of caching each system is doing.
If a configuration is available to turn off the data/resource
caching it would be interesting to show results of the system
in both setups.

Synchronizing activities
---------------------------------------

When someone is running a benchmark others should refrain to
perform any other activity on goliath: that someone will state
he's about to run the benchmarks on the #foss4g channel so that
other do know it's time to stop the presses.
Keeping an eye on "top" could also prove to be beneficial (so
that one can see that only the system under test is actually
generating load).

Given we're running hot benchmarks we may want to shut down
the servers not under test to free memory for the fs cache.
It would be nice, anyways, to have documented ways to start
and shut down each system, so that we can also check each
other server on occasion (running a suite or just compare
the output of the same call on the different servers).

Publishing results
----------------------------------------

The test results will be published in a wiki page.
Everybody will be able to see them, but we kindly ask
people not to advertise them on blogs, twits and the like
until the presentation is given.

Evolution (and when that ends)
-----------------------------------------

This is a personal addition not discussed on IRC.

The results will change over time as we get closer
to the presentation due to various details, such as
upgrading libraries, tuning the runtime, and for the
unstable version of the products, fine tuning of
the code (we remind we're testing a "stable" version
and a "beta" version of each server).

The wiki page will be upgraded. I was wondering
if we want to keep an history of how we got from
the first results to the final ones?

Also, what deadline do we set for the results to
be "final", un-modifiable?
5 minutes before the presentation? The day before?
The day before FOSS4G starts?

Cheers
Andrea

-- 
Andrea Aime
OpenGeo - http://opengeo.org
Expert service straight from the developers.