Re: hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }

Andre Hedrick (andre@linux-ide.org)
Sun, 23 Dec 2001 14:08:26 -0800 (PST)


On Sun, 23 Dec 2001, Alan Cox wrote:

> > I am just waiting to rip somebody's lunch who disagrees with me on the
> > importance of data-recovery and relocation upon media failures. Any
> > points claiming it is not important because the predictive nature of
> > device failure is unknown, should maybe ask an expert in the industry to
> > get you the answer. So lets have some fun and light off a really good
> > flame war!
>
> Why do you want to pass it up to the file system, when the fs probably
> doesn't know what to do about it ? I can see why you want to pass it back
> as far up as you can until someone claims it.

Well that is a design flaw in the general FS models from the beginning,
and it should be fixed. I recall everyone drum banging SCSI because it
could hand this problem, and surprize it fails too. So it come back to
the FS being responsible to force the completetion of it requests
regardless of the overhead.

> Or are you assuming in most cases file systems would have an "io_failure"
> function that reissues the request a few times then marks the fs read only
> and screams ?

Now that is a sensible solution, but I do not know if it would work in all
cases or how best it should be handled. Regardless, the responsiblity of
the content is primarily the FS. Should an APP close a file but it is
still in buffer_cache, there is no way to notify the app or the user or
anything associated with the creation/closing of that file, if a write
error occurs.

So we have user-space believing it is success.
FS doing an initial ACK of success.
BLOCK generating the request to the low_level.
LOW_LEVEL goes OH CRAP, I am having a problem and can not complete.

LOW_LEVEL goes, HEY BLOCK we have a problem.
BLOCK, that is nice whatever ....

This is a bad model, an worse is

LOW_LEVEL goes, HEY BLOCK we have a problem.
BLOCK goes, HEY FS we have an annoying LOW_LEVEL asking for reissue.
FS, duh which way did the rabbit go ...

So what is the right answer, 'cause blackholing the file is wrong, and I
do not see a way for it to be reissued. We should note the kernel now
owns the responsiblity of completion in this case; regardless, what other
think.

> Incidentally the EVMS IBM volume manager code does support bad block
> remapping in some situations.

Well managing badblock can be a major pain, but it is the right thing to
do. Now what is the cost, since there is surge in journaling FS's that
have logs. The cost is coming up w/ a sane way to manage the mess.
Even before we get to managing the mess, we have to be able to reissue the
request to a reallocated location, and make all kinds of noise that we are
doing heroic attempts to save the data. These may include --

periodic message bombing the open terminals.
clearing all outstanding requests, and making RO (caution /)

The issue is we are doing nothing to address the point, and it is arrogant
for the maintainers of the various storage classes and the supported upper
layers not willing to address this issue.

This reply may seem a little fragmented because I had to answer it is
stages between distractions.

Regards,

Andre Hedrick
CEO/President, LAD Storage Consulting Group
Linux ATA Development
Linux Disk Certification Project

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/