[Benchmarking] General rules for handling benchmarking

Sun Sep 27 09:30:12 EDT 2009

Hi Andrea,

Thank you for summarizing our IRC discussions.  I have a couple of 
comments below, in the "Evolution" section:

Andrea Aime wrote:
> Hi,
> today on IRC we had a talk about how to handle a few details
> of the benchmarking process (me, Frank and Jeff). We would
> like to present the results of the discussion to the rest of
> the community so that we have a known set of rules about
> how to do the benchmarks.
> 
> Running the benchmarks
> ---------------------------
> 
> The benchmarks are usually classified in cold and hot benchmarks.
> Cold ones represent a completely un-cached filesystem and applications
> just started, they represent the lowest possible performance of the
> system.
> Hot benchmarks represent a situation in which the file system and the
> application managed to get their caches full and thus represent the
> highest possible performance of the system.
> 
> Neither of them is realistic, a long running system will have
> file system and application level caches full with something,
> yet not necessarily something useful to serve the next incoming
> request.
> In the specific case of Java based app a fully cold benchmark
> does not make sense either, since the JVM takes some time to
> figure out which parts of the bytecode are hotspost and
> compile them to native code on the fly. The average state
> of a Java app server already sees the hot spots fully compiled
> into native code, the fully cold case it tipically 2-3 times
> slower than that.
> 
> Given repeating cold benchmarks is hard and, in the case of
> Java based apps, not representative, we suggest we run
> hot ones, also because it's easy: one runs the same benchmarking
> suite repeatedly until the results level off.
> Typically this requires 2-4 runs.
> Since this will give an advantage to systems that do perform
> data or resource caching,
> either the benchmarks should be setup so that no caching
> is possible (the main reason why we don't repeat the
> same request 1500 times in a row) or the presentation will list
> what kind of caching each system is doing.
> If a configuration is available to turn off the data/resource
> caching it would be interesting to show results of the system
> in both setups.
> 
> Synchronizing activities
> ---------------------------------------
> 
> When someone is running a benchmark others should refrain to
> perform any other activity on goliath: that someone will state
> he's about to run the benchmarks on the #foss4g channel so that
> other do know it's time to stop the presses.
> Keeping an eye on "top" could also prove to be beneficial (so
> that one can see that only the system under test is actually
> generating load).
> 
> Given we're running hot benchmarks we may want to shut down
> the servers not under test to free memory for the fs cache.
> It would be nice, anyways, to have documented ways to start
> and shut down each system, so that we can also check each
> other server on occasion (running a suite or just compare
> the output of the same call on the different servers).
> 
> Publishing results
> ----------------------------------------
> 
> The test results will be published in a wiki page.
> Everybody will be able to see them, but we kindly ask
> people not to advertise them on blogs, twits and the like
> until the presentation is given.
> 
> Evolution (and when that ends)
> -----------------------------------------
> 
> This is a personal addition not discussed on IRC.
> 
> The results will change over time as we get closer
> to the presentation due to various details, such as
> upgrading libraries, tuning the runtime, and for the
> unstable version of the products, fine tuning of
> the code (we remind we're testing a "stable" version
> and a "beta" version of each server).

I think we need to clear up what versions we are testing - yesterday on 
IRC there were differing opinions of what versions of the software is 
considered "stable" and what is considered "beta".  In the MapServer 
case, version 5.6.0-beta1 was released last week, so do we consider this 
  as the "stable" version to test or is this classed in the opposite 
"open" test class.  In my mind, I thought the latest "stable" MapServer 
release was 5.4.2 (and 5.6.0 would be the next "stable" version).

Maybe we can clear this issue up right now and update the "Rules of 
Engagement" section on the wiki with the correct versions to test.  thanks.

> 
> The wiki page will be upgraded. I was wondering
> if we want to keep an history of how we got from
> the first results to the final ones?

Great idea, a history would show how we all improved our software, and 
show how important this benchmarking exercise is.

> 
> Also, what deadline do we set for the results to
> be "final", un-modifiable?
> 5 minutes before the presentation? The day before?
> The day before FOSS4G starts?
> 

I erased my response that said "one week before the event starts" 
because I counted the weeks left before the event...not much time left. 
  So in that case maybe we go with the day before FOSS4G begins, which 
would be Oct 19th.  Thoughts?

BTW we can use the SVN folder /benchmarking/docs/ to share the 
presentation slides.

-jeff