Re: ext3 throughput woes on certain (possibly heavily fragmented) files

Simon Kirby (sim@netnation.com)
Mon, 16 Sep 2002 15:39:11 -0700


On Fri, Sep 06, 2002 at 06:24:57PM +0100, Stephen C. Tweedie wrote:

> Ext2 has a preallocation mechanism so that if you have multiple
> writes, they get dealt with to some extent as a single allocation.
> However, that doesn't work over close(): the preallocated blocks are
> discarded wheneven we close the file.
>
> The problem with mail files, though, is that they tend to grow quite
> slowly, so the writes span very many transactions and we don't have
> that opportunity for coalescing the writes. Actively defragmenting on
> writes is an alternative in that case.

We recently switched a large mail spool from ext2 to ext3 with default
journalling, and we are now having huge problems with disk I/O load.

We have fsync and friends disabled for performance reasons. With ext2,
the machine would happily hum along with an average load of 0.2 and a
usual 400 kB - 800 kB write every 5 seconds, with about 10 kB/sec read in
every second.

Now with ext3, the machine has a load average of about 15 and writing
happens almost all of the time. "vmstat 1" output:

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 42 2 79368 47196 100456 1080348 0 0 0 3036 2514 2077 18 21 60
0 76 2 79368 44264 100456 1080348 0 0 0 1776 1266 823 4 3 92
0 111 3 79368 41248 100456 1080348 0 0 0 1952 1176 722 4 5 91
0 132 2 79368 39432 100460 1080348 0 0 0 1368 1007 612 1 3 96
0 67 3 79368 34412 100460 1080628 0 0 0 2884 1968 1246 18 13 69
0 41 2 79368 36572 100468 1080828 0 0 24 4020 2661 1530 16 21 64
0 32 3 79368 31736 100500 1081456 0 0 0 3688 2696 2061 26 22 52
0 39 3 79368 24588 100528 1082164 0 0 4 3800 2636 2643 30 21 50
0 32 4 79368 21500 100536 1082832 0 0 24 3216 2404 2419 32 15 54
5 28 2 79368 18160 100536 1083360 0 0 0 3416 2372 2164 24 19 57
0 25 4 79368 19748 100552 1082896 0 0 4 4120 2544 2421 17 21 62
4 16 4 79368 18216 100560 1083284 0 0 0 3532 2115 2361 20 17 63
0 37 2 79368 17240 100568 1083456 0 0 16 2376 1817 1691 8 12 80
1 67 3 79368 15112 100568 1083456 4 0 4 1644 1051 723 6 4 90
1 88 3 79368 12028 100572 1083464 0 0 8 1884 1102 684 6 3 91
0 108 3 79368 10132 100572 1083468 0 0 0 1716 924 503 3 3 94
15 0 2 79368 14460 100548 1081996 0 0 12 3852 2609 2000 17 25 59
0 39 3 79368 13252 100576 1082220 0 0 52 4288 2740 2095 19 19 62

This box is primarily running a POP3 server (written in-house to cache
mbox offsets, so that it can handle a huge volume of mail), and also
exports the mail spool via NFS to other servers which run exim (-fsync).
nfsd is exported async. Everything is mounted noatime, nodiratime. No
applications should be calling sync/fsync/fdatasync or using O_SYNC.
It's a mail server, so everything is fragmented.

We're using dotlocking. Would this cause metadata journalling? We had
to hash the mail spool a long time ago do to system time eating all CPU
(the ext2 linear directory scan to find a slot available in the spool
directory to add the dotlock file). I estimate about 200 - 300 dotlock
files are created per second, but these should all be asynchronous.
Would switching to fctnl() locking (if this works over NFS) solve the
problem?

A "ps -eo pid,stat,args,wchan | grep simpopd | grep ' D '" shows POP3
processes stuck in either "down" or in "do_get_write_access", which
appears to be a journal function.

We notice there are some ext3 updates included as a patch to vanilla
2.4.18 in the newest Red Hat kernel, including changes to the
do_get_write_access function. Have improvements in this area been made?

Thanks!

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ sim@stormix.com ][ sim@netnation.com ]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/