Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

Patrick J. LoPresti (patl@curl.com)
15 Jul 2002 16:17:01 -0400


--=-=-=

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> Documentation/fs/fsync.txt or similar sounds a good idea

OK, attached is my first attempt at such a document.

What do you think?

- Pat

--=-=-=
Content-Disposition: inline; filename=fsync.txt
Content-Description: fsync.txt

Linux fsync() semantics
(or, "How to create a file reliably")

Introduction
============

Consider the following C program:

#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <string.h>

int
main (int argc, char *argv[]) {
int fd;
char *s = "Hello, world!\n";

fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL);
if (fd < 0) return 1;

if (write (fd, s, strlen(s)) < 0) return 3;
if (fsync (fd) < 0) return 4;
if (close (fd) < 0) return 5;

return 0;
}

Question: If you compile and run this program, and it exits zero
(success), and your machine then crashes, is it guaranteed that the
file /tmp/foo will exist upon reboot?

Answer: On many Unices, including *BSD, yes.
On Linux, NO.

How could this be? And what can you do about it?

History
=======

In the beginning was BSD with its Fast File System (FFS). Under FFS,
changes to directories were "synchronous", meaning they were committed
to disk before the system call (open/link/rename/etc.) returned.
Changes to files (write()) were asynchronous. The fsync() system call
allowed an application to force a file's pending writes to be
committed to persistent media.

In general, disks have reasonble throughput but horrible latency, so
it is much faster to write many things all at once rather than one at
a time. In other words, synchronous operations are slow.

Enter Linux. By default, Linux makes all operations, including
directory updates, asynchronous. Early file system benchmarks showed
Linux beating the pants off of BSD, especially when lots of directory
operations were involved. This annoyed the BSD folks, who claimed
that synchronous directory updates are required for reliable
operation. (As with most points of contention between Linux and BSD,
this is both true and false... See below.)

The problem with making directory operations asynchronous is that you
then need to provide a way for the application to commit those changes
to disk. Otherwise, it is impossible to write reliable applications.

BSD softupdates
===============

Sometime during the 90s, the BSD developers introduced "soft updates"
to improve performance. These do two things. First, they make all
file system operations asynchronous (like Linux). Second, they extend
the fsync() system call so that it commits to disk BOTH the file's
data AND any directories via which the file might be accessed.

In other words, BSD with soft updates requires that you call fsync()
on a file to commit any changes to its containing directory. This is
why the program above "works" on BSD.

Many programs are written these days to expect soft update semantics,
because such algorithms will also work correctly under traditional
FFS.

The problem with the softupdates approach is that finding all paths to
a file is complex, and the Linux developers hate complexity. Linux
does NOT support this behavior for fsync() and probably never will.

Standards
=========

Quick aside: What do the relevant standards (POSIX, SuS) say? Is
Linux violating some standard here?

Well, different people, having read the standards, disagree on this
point. This itself means the standards are not clear (which is a bad
thing for a standard). This is probably because the standards were
written when synchronous directory updates were the norm, and the
authors did not even consider asynchronous directory updates.

The Linux Solution
==================

The Linux answer is simple: If you want to flush a modified directory
to disk, call fsync() on the directory.

In other words, to reliably create a file on Linux, you need to do
something like this:

#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <string.h>

int
main (int argc, char *argv[]) {
int fd, dirfd;
char *s = "Hello, world!\n";

fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL);
if (fd < 0) return 1;

dirfd = open ("/tmp", O_RDONLY);
if (dirfd < 0) return 2;

if (write (fd, s, strlen(s)) < 0) return 3;
if (fsync (fd) < 0) return 4;
if (close (fd) < 0) return 5;
if (fsync (dirfd) < 0) return 6;
if (close (dirfd) < 0) return 7;

return 0;
}

If this program exits zero, the file /tmp/foo is guaranteed to be on
disk and to have the correct contents. This is true for ALL versions
of the Linux kernel and ALL file systems.

Other choices
=============

So you have written to the authors of your favorite MTA asking them to
support Linux properly by using fsync() on directories. They have
responded saying that "Linux is broken". (Be sure to ask them to
justify this claim with chapter and verse from a standard. It is sure
to be interesting.) What can you do?

If the application does all its work in one directory, or a few
directories, you can do "chattr +S" on the directory. This will cause
all operations on that directory to be synchronous.

You can use the "-o sync" mount option. This will cause ALL
operations on that partition to be synchronous. This solves the
problem, but is likely to be slow.

In the current version of Linux, you can use the ext3 or ReiserFS file
systems. These happen to commit their journals to disk whenever
fsync() is called, which has the side-effect of providing semantics
like BSD's soft updates. But note: This behavior is not guaranteed,
and may change in future releases!

But really, the best idea is to convince application authors to
support the "Linux way" for committing directory updates. The
semantics are simple, clear, and extremely efficient. So go bug those
MTA authors until they listen :-).

- Patrick LoPresti <patl@curl.com>
July 2002

--=-=-=--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/