RE: ext3-2.4-0.9.5

Peter J. Braam (braam@clusterfilesystem.com)
Mon, 30 Jul 2001 09:43:14 -0600


Hi Andrew,

Boy, you've had quite a weekend again.

Do you think this includes the fix for Shirish' bug?

- Peter -

> -----Original Message-----
> From: ext3-users-admin@redhat.com [mailto:ext3-users-admin@redhat.com]On
> Behalf Of Andrew Morton
> Sent: Monday, July 30, 2001 9:19 AM
> To: lkml; ext3-users@redhat.com
> Subject: ext3-2.4-0.9.5
>
>
>
> The latest ext3 patches against linux-2.4.7 and linux-2.4.7-ac3 are at
>
> http://www.uow.edu.au/~andrewm/linux/ext3/
>
> Changes since 0.9.4 include:
>
> - Fixed a bug which could trip an assertion failure when using small
> journals under heavy load in full data journalling mode.
>
> - A patch from Ted plus the latest version of e2fsprogs plus the stomping
> of various ext3 bugs gives us preliminary support for external journals.
>
> - Redesigned the handling of synchronous operations. Much simplified and
> several bugs fixed.
>
> - Drastically improved throughput with synchronous mounts - they're now as
> efficient as `chattr +S'.
>
> - Fixed an O(n^2) bottleneck in the commit code.
>
> - Implemented transaction handle batching for a big throughput increase
> with synchronous operations.
>
>
> The external journal code seems to work OK - brief usage details
> are at the
> web site. The intent here is that the external journal be an NVRAM device
> (or a disk) which can be used to accelerate full-data journalling.
> Simulation using a normal RAM drive indicates that we can double
> throughput
> with some loads (dbench) but not others (synctest). More work is
> needed to
> fully characterise this.
>
>
> For the synchronous operations I've put together an application which
> attempts to simulate an MTA's behaviour. The simulator is called
> `synctest' and it is in ext3 CVS. There's a copy at
> http://www.uow.edu.au/~andrewm/synctest.c - I'd really appreciate
> it if the MTA
> guys could poke some useful holes in the modelling.
>
> The simulator launches a (large) number of sub-processes. Each subprocess
> does the following:
>
> for 100 different filenames
> create a file
> write some data to the file (5k to 250k, exponential distribution)
> optionally fsync() the file
> close the file
> optionally fsync() the file's parent dir
> rename the file
> optionally fsync() the file's parent dir
> rename the file
> optionally fsync() the file's parent dir
> rename the file
> optionally fsync() the file's parent dir
> unlink the file from 30 passes ago.
>
> (I'm told that postfix does a lot of renaming).
>
>
> Now, it makes a very great deal of difference how these files are
> organised
> in directories. If you have 100 processes each doing synchronous
> operations
> in separate directories then the new transaction batching in ext3 gives it
> enormous scalability, whereas ext2 basically stops.
>
> If you have 100 processes each doing synchronous operations in a
> single big
> directory then ext2 does OK, and ext3 is only slightly quicker than ext2.
> This is because the VFS serialises operations on particular
> directories via
> parent->i_sem and defeats ext3 transaction batching.
>
> Most testing was performed on a `chattr +S' directory tree
> because that seems
> to be a convenient way to operate popular MTAs.
>
> ext3 relied upon the `chattr' setting to provide synchronous semantics for
> all directory and write operations. For ext2, the synctest `-f'
> option was
> used to fsync the data at the end of the write.
>
> The following tests were executed on a modern IDE disk with disk write
> caching enabled. Internal journal. 100 processes were used in
> every test.
> The number of `synctest' processes per directory was altered.
>
> The final column represents ext2 throughput without `chattr +S', but using
> fsync() to sync the parent directory and the data.
>
> processes/dir number of ext2 completion ext3 completion ext2 (no
> directories time (minutes) time (minutes) chattr)
>
> 50 2 7:24 5:10 3:24
> 20 5 9:21 3:31
> 10 10 11:09 3:05 6:01
> 5 20 14:37 3:02
> 1 100 23:10 2:44 9:44
>
>
> Apparently postfix will typically use 256 directories for hashing its
> mailspool files. The reason for this is, presumably, to avoid
> having single
> directories with hundreds of thousands of files in them. Postfix
> will spawn
> hundreds of processes to work on those directories. So the last
> row of this
> table is the interesting one.
>
> ext2 bogs down because it has so much metadata to write - it is spread all
> over the disk and cannot benefit from write clustering.
>
> ext3 stopped scaling at 20 processes per directory because the limiting
> factor was checkpointing all the data and metadata into the main
> filesystem.
> Seeking. The time taken to write the data to the journal is
> negligible when
> compared with this. In fact, the same testing was performed with
> an external
> journal on RAM disk and the throughput was basically unaltered. More main
> memory will really help improve things here.
>
> A 400 megabyte journal was used. What happens is that ext3 happily writes
> all outgoing data into the journal in linear 100 megabyte chunks until you
> run out of either a) journal space or b) memory. Then the whole
> world stops
> for 15-20 seconds while hundreds of megabytes of stuff is written all over
> the main filesystem. This is optimal, but perhaps not desirable. Using a
> smaller journal size will tame this behaviour nicely. Or use ordered-data
> mode which runs smoothly, performs well and has full synchronous behaviour
> and recoverability.
>
>
> Conclusions. Assuming that `synctest' is somewhat like a real
> MTA, and that
> the MTA is using two-level hashing we can say that:
>
>
> - chattr +S on ext2 costs you 2:1 or 3:1 throughput when compared with
> fsync()-on-data and fsync()-on-dir.
>
> - full-journalling ext3 can offer a 3x to 10x improvement over ext2,
> depending upon how ext2 is used and the directory layout/task count.
>
> - ext2 likes to have few directories, many processes per directory.
>
> - ext3 likes many directories, few processes per directory.
>
> - We can write data to the journal much faster than we can checkpoint that
> data into the main filesystem, so the benefit of an external
> journal device
> (spinning or NVRAM) has not been demonstrated.
>
> - The holding of i_sem over the parent is a severe scalability limitation
> with synchronous metadata operations. Better to have:
>
> void *opaque;
> down(&parent->i_sem);
> file->f_op->op(&opaque, args...);
> up(&parent->i_sem);
> if (IS_SYNC(inode))
> inode->i_op->wait_on_stuff(opaque);
>
> -
>
>
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users@redhat.com
> https://listman.redhat.com/mailman/listinfo/ext3-users
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/