Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

Richard B. Johnson (root@chaos.analogic.com)
Mon, 23 Sep 2002 10:43:02 -0400 (EDT)


On Tue, 24 Sep 2002, Con Kolivas wrote:

> Quoting "Richard B. Johnson" <root@chaos.analogic.com>:
>
> > On Mon, 23 Sep 2002, Ryan Anderson wrote:
> >
> > > On Mon, Sep 23, 2002 at 08:30:21PM +1000, Con Kolivas wrote:
> > > > Quoting Ingo Molnar <mingo@elte.hu>:
> > > > > On Mon, 23 Sep 2002, Con Kolivas wrote:
> > > > >
> > > > > how many times are you running each test? You should run them at
> > least
> > > > > twice (ideally 3 times at least), to establish some sort of
> > statistical
> > > > > noise measure. Especially IO benchmarks tend to fluctuate very
> > heavily
> > > > > depending on various things - they are also very dependent on the
> > initial
> > > > > state - ie. how the pagecache happens to lay out, etc. Ie. a
> > meaningful
> > > > > measurement result would be something like:
> > > >
> > > > Yes you make a very valid point and something I've been stewing over
> > privately
> > > > for some time. contest runs benchmarks in a fixed order with a "priming"
> > compile
> > > > to try and get pagecaches etc back to some sort of baseline (I've been
> > trying
> > > > hard to make the results accurate and repeatable).
> > >
> > > Well, run contest once, discard the results. Run it 3 more times, and
> > > you should have started the second, third and fourth runs with similar
> > initial conditions.
> > >
> > > Or you could run the contest 3 times, rebooting between each run....
> > > (automating that is a little harder, of course.)
> > >
> > > IANAS, however.
> > >
> >
> > (1) Obtain statistics from a number of runs.
> > (2) Throw away the smallest and largest.
> > (3) Average whatever remains.
> >
> > This works for many "real-world" things because it removes noise-spikes
> > that could unfairly poison the average.
>
> That is the system I was considering. I just need to run enough benchmarks to
> make this worthwhile though. That means about 5 for each it seems - which may
> take me a while. A basic mean will suffice for a measure of central tendency. I
> also need to quote some measure of variability. Standard deviation?
>
> Con
>
> .... Standard deviation?
^^^^^^^^^^^^^^^^^^^

Yes I like that, but does this measure "goodness of the test" or
something else? To make myself clear, let's look at some ridiculous
extreme condition. Your test really takes 1 second, but during your
tests there is a ping-flood that causes your test to take an hour.
Since the ping-flood is continuous, it smoothes out the noise of
your one-second test, making it 1/3600 of its true value. The
standard deviation looks very good but instead of showing that
your measurements were "good", it really shows that they are "bad".

I think a goodness-of-the-test indicator relates to the ratio of
the faster:slower tests. I don't know what you would call this, but
if your average was generated by 3 fast tests plus 1 slow test, it
would indicate a better "goodness" than 1 fast test and 3 slow ones.
It shows that external effects are not influencing the test results
as much with the "more-good" goodness.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/