Re: raid5 multi-drive-failure and recovery?

Neil Brown (neilb@cse.unsw.edu.au)
Fri, 20 Dec 2002 11:59:17 +1100


On Thursday December 19, harik@chaos.ao.net wrote:
>
> I've got a problem. Two drives in my array developed bad sectors at
> the same time. They're in completely different places, but I can't
> read the disk because md simply fails out both of them, and marks the
> array unusable.
>
> Now, I can either buy new drives and dd the raw partition over, or I
> can hack the kernel to make it a bit smarter about unrecoverable
> reads.
>
> Obviously, if I have raid5_error not mark the drive bad, it will hammer
> away on it, failing over and over. My thought was to let the read
> fail, catch it in raid5_end_read_request then tag the stripe_head
> with the device that's failed. If one has already failed, return
> EIO. This way further reads on the stripe_head will go to the parity
> disk (until it's eventually freed. One IO error per stripe isn't too
> harsh a price to pay for disaster recovery)
>
> in 2.4.20, I'm at raid5.c:421 where we're about to call md_error.
> What happens to the bh from that point? Obviously, it's not up-to-date,
> so when 1 drive fails how does it get re-issued to be pulled from a parity
> drive to reconstruct it?

Nothing much happens to the bh. It just has Uptodate cleared.

A little later (line 433) we call release_stripe. Once all the active
IO requests have finished the stripe will be completely released and
handle_stripe gets called (from raid5d) to handle the stripe.
It notices there is a block that is being read (bh_read) but that
isn't uptodate, and so tries to schedule a read. Line 949 notices
that the block it wants to read is on a failed drive, so it causes all
blocks to be read in. Once they are read in, handle_stripe gets
called again, and this time it gets to line 955 where it computes the
block that you want.

What you want to do is add a 'failed_drive' field to the stripe_head
which gets initialised to -1 in init_stripe, and set at line 421
instead of calling md_error. Then in handle_stripe, line 875 changes
from
if (!conf->disks[i].operational) {
to
if (!conf->disks[i].operational || sh->failed_drive == i) {
and it might just work for you.

Good luck
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/