Re: IO degradation in 2.4.17-pre2 vs. 2.4.16

Andrew Morton (akpm@zip.com.au)
Sat, 01 Dec 2001 13:34:23 -0800


Jason Holmes wrote:
>
> I saw in a previous thread that the interactivity improvements in
> 2.4.17-pre2 had some adverse effect on IO throughput and since I was
> already evaluating 2.4.16 for a somewhat large fileserving project, I
> threw 2.4.17-pre2 on to see what has changed. Throughput while serving
> a large number of clients is important to me, so my tests have included
> using dbench to try to see how things scale as clients increase.
>
> 2.4.16:
>
> ...
> Throughput 210.989 MB/sec (NB=263.736 MB/sec 2109.89 MBit/sec) 16 procs
> ...
>
> 2.4.17-pre2:
>
> ...
> Throughput 153.672 MB/sec (NB=192.09 MB/sec 1536.72 MBit/sec) 16 procs
> ...

This is expected, and tunable.

The thing about dbench is this: it creates files and then it
quickly deletes them. It is really, really important to understand
this!

If the kernel allows processes to fill all of memory with dirty
data and to *not* start IO on that data, then this really helps
dbench, because when the delete comes along, that data gets tossed
away and is never written.

If you have enough memory, an entire dbench run can be performed
and it will do no disk IO at all.

The 2.4.17-pre2 change meant that the kernel starts writeout of
dirty data earlier, and will cause the writer to block, to
prevent it from filling all memory with write() data. This is
how the kernel is actually supposed to work, but it wasn't working
right, and the mistake benefitted dbench. The net effect is that
a dbench run does a lot more IO.

If your normal operating workload creates files and does *not*
delete them within a few seconds, then the -pre2 change won't
make much difference at all, as your bonnie++ figures show.

If your normal operating workloads _does_ involve very short-lived
files then you can optimise for that load by increasing the
kernel's dirty buffer thresholds:

mnm:/home/akpm> cat /proc/sys/vm/bdflush
40 0 0 0 500 3000 60 0 0
^^ ^^
nfract nfract_sync

These two numbers are percentages.

nfract: percentage of physical memory at which a write()r will
start writeout.

nfract_sync: percentage of physical memory at which a write()r
will block on some writeout (writer throttling).

You'll find that running

echo 80 0 0 0 500 3000 90 0 0 > /proc/sys/vm/bdflush

will boost your dbench throughput muchly.

dbench is a good stability and stress tester. It is not a good
benchmark, and it is not representative of most real-world
workloads.

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/