[Benchmarking] meeting summary

Thu Jun 3 13:16:13 EDT 2010

Hello all,

Yesterday's meeting was problematic at many levels. 

>From my perspective, the pacing was way too agressive, trying to reach
'rules of engagement' without first agreeing on what the benchmarking
effort is trying to do. So while Jeff writes up in his summary:

        All agreed to this year's "Rules of Engagement"

    --http://wiki.osgeo.org/wiki/Benchmarking_2010#Previous_IRC_meeting 

my feeling is that we, at our end at least, could not have agreed to
anything since we do not understand the background pre-suppositions in
play.

For example, there seems to be a notion that it makes sense for a 
        "'baseline' test with the data in its raw format". 
This apparently means that servers should be constrained in some way to
use shapefiles directly, possibly forcing servers to read from the file
for every request or something similar. I don't understand the exact
constraint nor why we are mucking around at this level of detail.
Working on the WMS and other standards at the OGC has trained me
actively to avoid such dictates so it is hard for me to think in this
way. Since most WMS servers allow users to use data in shapefile format,
I am puzzled why the agreement would not instead be for all servers to
use the shapefile data in the same way that they expect their users to
use shapefile data by default. A server which forced all its users to
put their shapefile data into a database would be excluded by the rule
above so it seems that such a rule, because not generally applicable,
probably does not make sense.

Rather than work by constraining how a server acts, other than that it
follow essentially its default behaviour (i.e. "Don't game the test"), I
was expecting to start with a discussion of what the server would be
expected to do, i.e. by discussing the testing regime. Then we could
work backwards to figure out what kind of data would be necessary to
expose the strengths and weaknesses of different approaches.

There are many questions related to establishing what will actually be
tested by the benchmarks.

      * How will correctness be handled?
                If a server returns bad images, we simply drop it for
                that test?

      * Will benchmarks be run against WMS 1.0 or 1.3?
      * 
      * To what extent will the benchmark test CRS's?
                Since this is potentially one of the more costly
                operations, to what extent would this be test.

      * Will the benchmarks test SLD support?
                This is another potentially costly operation.

Then there are also many questions related to the test metrics. As best
as I can make out from the results published from last year's test
        http://www.slideshare.net/gatewaygeomatics.com/...
        ...wms-performance-shootout
the principal metric calculated the average response time calculated
over a series of requests. As I understand it, this is due to the use of
JMeter as a testing system. Unfortunately, as all introductory courses
in statistics spend time exploring, the mean is a particularly poor
measure of central tendency for certain distributions, of which the
Erlang is a textbook example. The lack of any measure of variance
further reduces the conclusions that can be drawn from the published
tests. I would presume the benchmarking effort would want to produce
usable results based on robust statistics and that there therefore ought
to be some discussion of how this could be achieved. When I asked Frank
if every team would have time to test the other servers, I had in mind
generating a set of metrics in which I would have confidence, even if
such metrics do not interest anyone else.

So, before plunging pell-mell into 'rules of engagement' it seems a
certain amount of foundation work needs to be done to establish what we
hope to benchmark and how we might go about showing the performance of
various WMS servers. 

--adrian

P.S. 
As a point of starting collaboration which might be fun and would make a
good story for the FOSS4G both as a testimony to interoperability and in
order to test the collective accuracy of the various servers, I wonder
if we could chain together all the servers which can have other WMSs as
data sources to shuffle the same image across as many servers as
possible. We should get out the same image as we started with. Anyone
interested?