Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

Con Kolivas (conman@kolivas.net)
Tue, 24 Sep 2002 12:45:51 +1000


Quoting Mark Hahn <hahn@physics.mcmaster.ca>:

> > Yes. Definitely the outliers appear to make the difference to the results.
> The
>
> best score is clearly the most important, along with some measure of
> spread.
>
> best-worst is a lousy measure of spread; stdev is not bad for that
> (or closely related measures like absdev, or stdev-from-mean, etc.)
>
> for contests, best is definitely the first score you want.
>

Normally yes. This is quite different. We want to know if there can be periods
where the machine is busy doing file IO to the exclusion of everything else. If
anything, the worst is the measure we want. Even the worst performing kernels
I've tried can have the occasional very good score, but look at these results
how I've presented them in the follow up and you'll see what I mean:

from the new thread I've started entitled
[BENCHMARK] Statistical representation of IO load results with contest

[...SNIP]
n=5 for number of samples

Kernel Mean CI(95%)
2.5.38 411 344-477
2.5.39-gcc32 371 224-519
2.5.38-mm2 95 84-105

The mean is a simple average of the results, and the CI(95%) are the 95%
confidence intervals the mean lies between those numbers. These numbers seem to
be the most useful for comparison.

Comparing 2.5.38(gcc2.95.3) with 2.5.38(gcc3.2) there is NO significant
difference (p 0.56)

Comparing 2.5.38 with 2.5.38-mm2 there is a significant diffence (p<0.001)
[SNIP...]

when I've run dozens of tests previously on the same kernel I've found that even
with a mean of 400 rarely a value of 80 will come up. Clearly this lowest score
does not give us the information we need.

Con.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/