Re: [PATCH] RAID5 NULL Checking Bug Fix

Neil Brown (neilb@cse.unsw.edu.au)
Wed, 16 May 2001 13:33:16 +1000 (EST)


On Wednesday May 9, alan@lxorguk.ukuu.org.uk wrote:
> > Hi,
> > In drivers/md/raid5.c, the author does not check to see if alloc_page() returns
> > NULL. This patch also adds checks that return 1 (following the
> > error-path convention in the respective function).
>
> This is fixed in 2.4.4-ac and has been for a while (and a little more
> cleanly in some respects). However it needs someone who knows the raid code
> well to push the raid fixes on to Linus
>
> Alan

Can I put my hand up for that?

First patch (against 2.4.5-pre1):

The "ac" patches have code to swap the rdev->desc_nr of the failed
and spare drives when swapping in a spare after it has been
reconstructed.
This is needed to keep the "rdev" structures pointing to the right
raid-superblock slot.
However it isn't quite right.
It "ac" patches avoid the swap if either spare_rdev or failed_rdev
cannot be found, which is wrong (I put it in the avoid an oops. I
didn't know the full story then).

The situation that would cause one of these to not be found is if the
"failed" drive is actually a "missing" drive. i.e. the drive failed
and then was raid_hot_removed, or it just wan't present and working
at boot time.

When this happens, we still need to set spare_rdev->desc_nr to the
correct slot number.

The following patch gets this right for raid5.c and raid1.c (The only
places where it is relevant).

NeilBrown

(more patches to come. They will go to Linus, Alan, and linux-raid only).

--- ./drivers/md/raid5.c 2001/05/15 04:02:07 1.1
+++ ./drivers/md/raid5.c 2001/05/15 04:06:19 1.2
@@ -1704,6 +1704,7 @@
struct disk_info *tmp, *sdisk, *fdisk, *rdisk, *adisk;
mdp_super_t *sb = mddev->sb;
mdp_disk_t *failed_desc, *spare_desc, *added_desc;
+ mdk_rdev_t *spare_rdev, *failed_rdev;

print_raid5_conf(conf);
md_spin_lock_irq(&conf->device_lock);
@@ -1875,6 +1876,16 @@
/*
* do the switch finally
*/
+ spare_rdev = find_rdev_nr(mddev, spare_desc->number);
+ failed_rdev = find_rdev_nr(mddev, failed_desc->number);
+
+ /* There must be a spare_rdev, but there may not be a
+ * failed_rdev. That slot might be empty...
+ */
+ spare_rdev->desc_nr = failed_desc->number;
+ if (failed_rdev)
+ failed_rdev->desc_nr = spare_desc->number;
+
xchg_values(*spare_desc, *failed_desc);
xchg_values(*fdisk, *sdisk);

--- ./drivers/md/raid1.c 2001/05/15 04:02:07 1.1
+++ ./drivers/md/raid1.c 2001/05/15 04:06:19 1.2
@@ -832,6 +832,7 @@
struct mirror_info *tmp, *sdisk, *fdisk, *rdisk, *adisk;
mdp_super_t *sb = mddev->sb;
mdp_disk_t *failed_desc, *spare_desc, *added_desc;
+ mdk_rdev_t *spare_rdev, *failed_rdev;

print_raid1_conf(conf);
md_spin_lock_irq(&conf->device_lock);
@@ -989,6 +990,16 @@
/*
* do the switch finally
*/
+ spare_rdev = find_rdev_nr(mddev, spare_desc->number);
+ failed_rdev = find_rdev_nr(mddev, failed_desc->number);
+
+ /* There must be a spare_rdev, but there may not be a
+ * failed_rdev. That slot might be empty...
+ */
+ spare_rdev->desc_nr = failed_desc->number;
+ if (failed_rdev)
+ failed_rdev->desc_nr = spare_desc->number;
+
xchg_values(*spare_desc, *failed_desc);
xchg_values(*fdisk, *sdisk);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/