Re: close return value

Hildo.Biersma@morganstanley.com
Fri, 19 Jul 2002 07:31:54 -0400 (EDT)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Peter Osterlund: "Re: [2.6] Most likely to be merged by Halloween... THE LIST"
Previous message: Werner Almesberger: "Re: [PATCH] CONFIG_MAGIC_SYSRQ without CONFIG_VT broken in 2.5.26"
In reply to: David S. Miller: "Re: close return value"
Next in thread: Zack Weinberg: "Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)"

>>>>> "Pete" == Pete Zaitcev <zaitcev@redhat.com> writes:

>> Date: Thu, 18 Jul 2002 16:09:51 -0400 (EDT)
>> From: Hildo.Biersma@morganstanley.com

Pete> The problem with errors from close() is that NOTHING SMART can be
Pete> done by the application when it receives it. And application can:
>>
Pete> a) print a message "Your data are lost, have a nice day\n".
Pete> b) loop retrying close() until it works.
Pete> c) do (a) then (b).
>>
>> I must disagree with you. We run the Andrew File System (AFS), which
>> has client-side caching with write-on-close semantics. If an error
>> occurs goes wrong at close() time, a well-written application can
>> actually do something useful - such as sending an alert, or letting
>> the user know the action failed.

Pete> The above is an example of an application covering up for
Pete> a filesystem that breaks the general expactions for the
Pete> operating environment. Remember your precursor with a broken
Pete> driver who received his beating a couple of months ago.
Pete> He also had an appliction which processed his errors from
Pete> close just fine. A workaround can be done in every specific
Pete> instance, but it does not make this practice any smarter.

I agree in general, but you should realize that there are valid
reasons why Unix filesystem semantics are sometimes violated.

We have slightly over 8,000 Unix hosts using the same networked
filesystem against the same set of file-servers. This is only
feasible if you minimize the number of client<->server interactions.

This is done in two ways:
- persistent (disk-based) client-side caching, where the server will
let a client know if a file is updated and needs to be evicted from
the client's cache
- close-on-write semantics for files

Pete> What AFS designers should have done if they had a brain larger
Pete> than a pea was:

Pete> 1. Make close to block indefinitely, retrying writes.
Pete> Allow overlapping writes, let clients to sort it out.

None of these things work, as security may be denied, a volume may be
taken off-line, or hvaing overlppaing writes from clients increases
the amount of client<->server interaction.

Pete> 2. Provide an ioctl to flush writes before close() or
Pete> make fsync() work right. Your "smart" applications have had
Pete> to use that, so that no ambiguity existed between tearing down
Pete> the descriptor and writing out the data.

This is provided - sync, fsync, msync all work.

Pete> This way, naive applications such as cat and cc would
Pete> continue to work. There is no reason to penalize them just
Pete> because some application _could_ possibly post idiotic alerts
Pete> (Abort, Retry, Fail).

That's work the trade-offs come in. The AFS designers found that
relaxing the Unix filesystem semantics vastly improves scalability.

Many of the high-performance filesystems (not XFS, the _really_
high-performance filesystems) that you run on supercomputers also
vioilate Unix semantics in various ways. Yes, that breaks na\"ive
apps, but that trade-off is generally accepted.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Peter Osterlund: "Re: [2.6] Most likely to be merged by Halloween... THE LIST"
Previous message: Werner Almesberger: "Re: [PATCH] CONFIG_MAGIC_SYSRQ without CONFIG_VT broken in 2.5.26"
In reply to: David S. Miller: "Re: close return value"
Next in thread: Zack Weinberg: "Re: close return value (was Re: [ANNOUNCE] Ext3 vs Reiserfs benchmarks)"