Re: RFC - new raid superblock layout for md driver

Kenneth D. Merry (ken@kdm.org)
Thu, 21 Nov 2002 17:08:23 -0700


On Wed, Nov 20, 2002 at 10:03:26 +0000, Anton Altaparmakov wrote:
> Hi,
>
> On Wed, 20 Nov 2002, Neil Brown wrote:
> > I (and others) would like to define a new (version 1) format that
> > resolves the problems in the current (0.90.0) format.
> >
> > The code in 2.5.lastest has all the superblock handling factored out so
> > that defining a new format is very straight forward.
> >
> > I would like to propose a new layout, and to receive comment on it..
>
> If you are making a new layout anyway, I would suggest to actually add the
> complete information about each disk which is in the array into the raid
> superblock of each disk in the array. In that way if a disk blows up, you
> can just replace the disk use some to be written (?) utility to write the
> correct superblock to the new disk and add it to the array which then
> reconstructs the disk. Preferably all of this happens without ever
> rebooting given a hotplug ide/scsi controller. (-;
>
> >From a quick read of the layout it doesn't seem to be possible to do the
> above trivially (or certainly not without help of /etc/raidtab), but
> perhaps I missed something...
>
> Also, autoassembly would be greatly helped if the superblock contained the
> uuid for each of the devices contained in the array. It is then trivial to
> unplug all raid devices and move them around on the controller and it
> would still just work. Again I may be missing something and that is
> already possible to do trivially.

This is a good idea. Having all of the devices listed in the metadata on
each disk is very helpful. (See below for why.)

Here are some of my ideas about the features you'll want out of a new type
of metadata:

[ these you've already got ]

- each array has a unique identifier (you've got this already)
- each disk/partition/component has a unique identifier (you've got this
already)
- a monotonically increasing serial number that gets incremented every
time you write out the metadata (you've got this, the 'events' field)

[ these are features I think would be good to have ]

- Per-array state that lets you know whether you're doing a resync,
reconstruction, verify, verify and fix, and so on. This is part of the
state you'll need to do checkpointing -- picking up where you left off
after a reboot during the middle of an operation.

- Per-array block number that tells you how far along you are in a verify,
resync, reconstruction, etc. If you reboot, you can, for example, pick
a verify back up where you left off.

- Enough per-disk state so you can determine, if you're doing a resync or
reconstruction, which disk is the target of the operation. When I was
doing a lot of work on md a while back, one of the things I ran into is
that when you do a resync of a RAID-1, it always resyncs from the first
to the second disk, even if the first disk is the one out of sync. (I
changed this, with Adaptec metadata at least, so it would resync onto
the correct disk.)

- Each component knows about every other component in the array. (It
knows by UUID, not just that there are N other devices in the array.)
This is an important piece of information:
- You can compose the array now, using each disk's set_uuid and the
position of the device in the array, and by using the events
field to filter out the older of two disks that claim the same
position.

The problem comes in more complicated scenarios. For example:
- user pulls one disk out of a RAID-1 with a spare
- md reconstructs onto the spare
- user shuts down machine, pulls the (former) spare that is
now part of the machine, and replaces the disk that he
originally pulled.

So now you've got a scenario where you have a disk that claims to
be part of the array (same set_uuid), but its events field is a
little behind. You could just resync the disk since it is out of
date, but still claims to be part of the array. But you'd be
back in the same position if the user pulls the disk again and
puts the former spare back in -- you'd have to resync again.

If each disk had a list of the uuids of every disk in the array,
you could tell from the disk table on the "freshest" disk that
the disk the user stuck back in isn't part of the array, despite
the fact that it claims to be. (It was at one point, and then
was removed.) You can then make the user add it back explicitly,
instead of just resyncing onto it.

- Possibly the ability to setup multilevel arrays within a given piece of
metadata. As far as multilevel arrays go, there are two basic
approaches to the metadata:
- Integrated metadata defines all levels of the array in a single
chunk of metadata. So, for example, by reading metadata off of
sdb, you can figure out that it is a component of a RAID-1 array,
and that that RAID-1 array is a component of a RAID-10.

There are a couple of advantages to integrated metadata:
- You can keep state that applies to the whole array
(clean/dirty, for example) in one place.
- It helps in autoconfiguring an array, since you don't
have to go through multiple steps to find out all the
levels of an array. You just read the metadata from one
place on one disk, and you've got

There are a couple of disadvantages to integrated metadata:
- Possibly reduced/limited space for defining multiple
array levels or arrays with lots of disks. This is not a
problem, though, given sufficient metadata space.

- Marginally more difficulty handling metadata updates,
depending on how you handle your multilevel arrays. If
you handle them like md currently does (separate block
devices for each level and component of the array), it'll
be pretty difficult to use integrated metadata.

- Recursive metadata defines each level of the array separately.
So, for example, you'd read the metadata from the end of a disk
and determine it is part of a RAID-1 array. Then, you configure
the RAID-1 array, and read the metadata from the end of that
array, and determine it is part of a RAID-0 array. So then you
configure the RAID-0 array, look at the end, fail to find
metadata, and figure out that you've reached the top level of the
array.

This is almost how md currently does things, except that it
really has no mechanism for autoconfiguring multilevel arrays.

There are a couple of advantages to recursive metadata:
- It is easier to handle metadata updates for multilevel
arrays, especially if the various levels of the array are
handled by different block devices, as md does.

- You've potentially got more space for defining disks as
part of the array, since you're only defining one level
at a time.

There are a couple of disadvantages to recursive metadata:
- You have to have multiple copies of any state that
applies to the whole array (e.g. clean/dirty).

- More windows of opportunity for incomplete metadata
writes. Since metadata is in multiple places, there are
more opportunities for scenarios where you'll have
metadata for one part of the array written out, but not
another part before you crash or a disk crashes...etc.

I know Neil has philosophical issues with autoconfiguration (or perhaps
in-kernel autoconfiguration), but it really is helpful, especially in
certain situations.

As for recursive versus integrated metadata, it would be nice if md could
handle autoconfiguration with either type of multilevel array. The reason
I say this is that Adaptec HostRAID adapters use integrated metadata.
So if you want to support multilevel arrays with md on HostRAID adapters,
you have to have support for multilevel arrays with integrated metadata.

When I did the first port of md to work on HostRAID, I pretty much had to
skip doing RAID-10 support because it wasn't structurally feasible to
autodetect and configure a multilevel array. (I ended up doing a full
rewrite of md that I was partially done with when I got laid off from
Adaptec.)

Anyway, if you want to see the Adaptec HostRAID support, which includes
metadata definitions:

http://people.freebsd.org/~ken/linux/md.html

The patches are against 2.4.18, but you should be able to get an idea of
what I'm talking about as far as integrated metadata goes.

This is all IMO, maybe it'll be helpful, maybe not, but hopefully it'll be
useful to consider these ideas.

Ken

-- 
Kenneth Merry
ken@kdm.org
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/