Re: ext3-2.4-0.9.4

Andrew Morton (akpm@zip.com.au)
Thu, 26 Jul 2001 21:42:37 +1000


Matthias Andree wrote:
>
> On Thu, 26 Jul 2001, Andrew Morton wrote:
>
> > data=journal
> >
> > All data (as well as to metadata) is written to the journal
> > before it is released to the main fs for writeback.
> >
> > This is a specialised mode - for normal fs usage you're better
> > off using ordered data, which has the same benefits of not corrupting
> > data after crash+recovery. However for applications which require
> > synchronous operation such as mail spools and synchronously exported
> > NFS servers, this can be a performance win. I have seen dbench
>
> In ordered and journal mode, are meta data operations, namely creating a
> file, rename(), link(), unlink() "synchronous" in the sense that after
> the call has returned, the effect of this call is never lost, i. e., if
> link(2) has returned and the machine crashes immediately, will the next
> recovery ALWAYS recover the link?

No, they're not synchronous by default. After recovery they
will either be wholly intact, or wholly absent.

> Or will ext3 still need chattr +S?

Yes, if the app doesn't support O_SYNC or fsync(). I believe
that MTA's *do* support those things.

> Does it still support chattr +S at all?

Yes.

> Synchronous meta data operations are crucial for mail transfer agents
> such as Postfix or qmail. Postfix has up until now been setting
> chattr +S /var/spool/postfix, making original (esp. soft-updating) BSD
> file systems significantly faster for data (payload) writes in this
> directory than ext2.

If postfix is capable of opening the files O_SYNC or of doing
fsync() on them then the `chattr +s' is no longer necessary - unlike
ext2, when the O_SYNC write() or the fsync() return, the directory
contents (as well as the inode, bitmaps, data, etc) will all be tight on
disk and will be restored after a crash.

This should speed things up considerably, especially with journalled-data
mode. I need to test and characterise this some more to come up with some
quantitative results and configuration recommendations.

BTW, if you have more-than-modest throughput requirements, don't
even *think* of mounting the fs with `mount -o sync'. Our performance
in this mode is terrible :(

I have a hack somewhere which fixes this as much as it can be fixed, but
I didn't even bother committing it. It's feasible, but tiresome.

A better solution is to fix some lock inversion problems in the core
kernel which prevent optimal implementation of data-journalling
filesystems. I don't really expect this to occur medium-term or ever.

A middle-ground solution may be to add an fs-private `osync' mount
option, so all files are treated similarly to O_SYNC, which would
work well.

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/