Re: 2.4: NFS client race causes data loss when appending

Trond Myklebust (trond.myklebust@fys.uio.no)
06 Dec 2001 15:29:47 +0100


>>>>> " " == xeno <xeno@overture.com> writes:

> 1. getattr request goes out to get file size. Value will be
> stale compared to inode->i_size, since writes are happening.
> 2. All writebacks for the inode complete.
> 3. getattr response returns with stale file size value.
> 4. __nfs_refresh_inode checks writebacks, finds none,
> overwrites inode->i_size.
> 5. generic_file_write resets file position (O_APPEND) with
> stale file size, overwriting previously written data.

<snip>
> --- linux/fs/nfs/inode.c Fri Nov 9 14:28:15 2001
> +++ linux-nfsappendrace/fs/nfs/inode.c Wed Dec 5 17:12:28 2001
> @@ -868,8 +868,9 @@
> __nfs_revalidate_inode(struct nfs_server *server, struct inode
> *inode) {
> int status = -ESTALE; struct nfs_fattr fattr;
> + int writebacks;

> dfprintk(PAGECACHE, "NFS: revalidating (%x/%Ld)\n",
inode-> i_dev, (long long)NFS_FILEID(inode));

> @@ -889,8 +890,9 @@
> }
> } NFS_FLAGS(inode) |= NFS_INO_REVALIDATING;

> + writebacks = nfs_have_writebacks(inode);
> status = NFS_PROTO(inode)->getattr(inode, &fattr); if
> (status) {
> dfprintk(PAGECACHE, "nfs_revalidate_inode:
> (%x/%Ld) getattr failed, error=%d\n",
inode-> i_dev, (long long)NFS_FILEID(inode), status);
> @@ -900,8 +902,11 @@
> remove_inode_hash(inode);
> } goto out;
> }
> +
> + if ( writebacks && nfs_size_to_loff_t(fattr.size) <
> inode->i_size )
> + fattr.size = (__u64) inode->i_size;

> status = nfs_refresh_inode(inode, &fattr); if (status)
> {
> dfprintk(PAGECACHE, "nfs_revalidate_inode:
> (%x/%Ld) refresh failed, error=%d\n",

The above is clearly insufficient to fix the race: you've only
addressed the problem of getattr. NFS is crawling with stuff that
returns fattrs (read/getattr/lookup/...). Each and every one of them
can race in the way you describe.
It will also fail to prevent a race occurring if the writeback is
scheduled and written while we are in the getattr() call (rare but
possible)...

What we really want is to prevent nfs_refresh_inode() from
overwriting newer attribute information with older information. How
therefore about something like the appended patch, that uses the ctime
field to determine which attribute information is obsolete?
I'm afraid it's not going to work too well for Linux servers because
of the shitty 1 second resolution we have on (a|m|c)time, but it will
help against most non-Linux servers.

Cheers,
Trond

--- linux-2.4.17-pre4/fs/nfs/inode.c.orig Thu Dec 6 02:27:46 2001
+++ linux-2.4.17-pre4/fs/nfs/inode.c Thu Dec 6 15:26:07 2001
@@ -1007,6 +1007,10 @@
new_size = fattr->size;
new_isize = nfs_size_to_loff_t(fattr->size);

+ if (time_before(jiffies, NFS_READTIME(inode)+NFS_ATTRTIMEO(inode)) &&
+ (s64)NFS_CACHE_CTIME(inode) - (s64)fattr->ctime < 0)
+ return 0;
+
/*
* Update the read time so we don't revalidate too often.
*/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/