Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

Patrick J. LoPresti (patl@curl.com)
15 Jul 2002 14:56:36 -0400


Matthias Andree <matthias.andree@stud.uni-dortmund.de> writes:

> > For example, a typical MTA might follow this paradigm:
> >
> > write temp file
> > fsync()
> > rename temp file to destination
> > fsync()
>
> So does fsync() guarantee rename() persistence across crash on all file
> systems and kernel versions? IIRC, no.

It depends on what you fsync() :-).

On BSD, fsync() of a file's descriptor will commit the rename of that
file to disk.

On Linux, fsync() of the *directory's* descriptor is required. And
yes, this will work across file systems and Linux versions, according
to Linus/Alan/etc.

> That'd be inefficient for the double fsync().

But it is necessary. See below.

> Postfix is ahead of that: it omits the first fsync() you suggest,
> because the +x flag, while necessary, is not sufficient to mark the
> mail as "complete, further processing allowed". The "message file"
> is a structured file format that has an "end" record at the end of
> the file.

This is not sufficient! Data writes are NOT guaranteed to be ordered.
It is permissible for the file system to flush the first and last
block of a file to disk BEFORE flushing the middle. You either need
the double fsync() or you need a checksum in the file; simple markers
are not enough to make a real guarantee. And MTAs should be making
real guarantees!

> But let's keep this unspecific to the MTA. Unless fsync() is used to
> enforce ordering, without data=ordered, journalled file systems can
> "recreate" files that are not there. Undead you may call them if you
> so like...

No, data=ordered has nothing to do with recreating dead files. What
data=ordered does is make sure bogus blocks do not appear in new files
(or in new extents of old files).

Failing to call fsync() at all (i.e., failing to commit metadata
updates) is what can recreate dead files.

> Postfix' local(8) daemon additionally relies on rename(2) being
> synchronous (in Maildir delivery), it does not fsync() after rename.
> OTOH, the file is completely in Maildir/tmp/somename, so it's not
> really lost, just invisible.

No, it is lost, because the file's creation is not guaranteed to have
happened at all! (Well, depending on the file system and the
semantics. I think I need to write this up more clearly.)

> > Summary: *All* MTAs should call fsync() twice. The only issue is what
> > descriptors they should call it on, exactly :-).
>
> See above. Before that, we must know that fsync() syncs all directory
> and file data and metadata (that makes four) all the way up to the mount
> point. For Linux 2.0, 2.2, 2.4. For any file system and any mount
> option. See the table project above ;-)

As I said, the issue is what descriptors they should call fsync() on.
On Linux, fsync() on a file's descriptor will commit the file's
contents; a second fsync() on the containing directory's descriptor
will commit the rename()/link().

- Pat
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/