Re: RFC - new raid superblock layout for md driver

Neil Brown (neilb@cse.unsw.edu.au)
Tue, 3 Dec 2002 08:38:25 +1100


On Friday November 22, joe@fib011235813.fsnet.co.uk wrote:
> On Wed, Nov 20, 2002 at 08:03:00AM -0800, Joel Becker wrote:
> > Hmm, what is the intended future interaction of DM and MD? Two
> > ways at the same problem? Just curious.
>
>
> There are a couple of good arguments for moving the _mapping_ code
> from md into dm targets:
>
> 1) Building a mirror is essentially just copying large amounts of data
> around, exactly what is needed to implement move functionality for
> arbitrarily remapping volumes. (see
> http://people.sistina.com/~thornber/pvmove_outline.txt).

Building a mirror may be just moving data around. But the interesting
issues in raid1 are more about maintaining a mirror: read balancing,
retry on error, hot spares, etc.

>
> So I've always had every intention of implementing a mirror target
> for dm.
>
> 2) Extending raid 5 volumes becomes very simple if they are dm targets
> since you just add another segment, this new segment could even
> have different numbers of stripes. eg,
>
>
> old volume new volume
> +--------------------+ +--------------------+--------------------+
> | raid5 across 3 LVs | => | raid5 across 3 LVs | raid5 across 5 LVs |
> +--------------------+ +--------------------+--------------------+
>
> Of course this could be done ATM by stacking 'bottom LVs' -> mds ->
> 'top LV', but that does create more intermediate devices and
> sacrifices space to the md metadata (eg, LVM has its own metadata
> and doesn't need md to duplicate it).

But is this something that you would *want* to do???

To my mind, the raid1/raid5 almost always lives below any LVM or
partitioning scheme. You use raid1/raid5 to combine drives (real,
physical drives) into virtual drives that are more reliable, and then
you partition them or whatever you want to do. raid1 and raid5 on top
of LVM bits just doesn't make sense to me.

I say 'almost' above because there is one situation where something
else makes sense. That is when you have a small number of drives in a
machine (3 to 5) and you really want RAID5 for all of these, but
booting only really works for RAID1. So you partition the drives, use
RAID1 for the first partitions, and RAID5 for the rest.
put boot (or maybe root) on the RAID1 bit and all your interesting data
on the RAID5 bit.

[[ I just had this really sick idea of creating a raid level that did
data duplication (aka mirroring) for the first N stripes, and
stripe/parity (aka raid5) for the remaining stripes. Then you just
combine your set of drives together with this level, and depending on
your choice of N, you get all raid1, all raid5, or a mixture which
allows booting off the early sectors....]]

>
> MD would continue to exist as a seperate driver, it needs to read and
> write the md metadata at the beginning of the physical volumes, and
> implement all the nice recovery/hot spare features. ie. dm does the
> mapping, md implements the policies by driving dm appropriately. If
> other volume managers such as EVMS or LVM want to implement features
> not provided by md, they are free to drive dm directly.
>
> I don't think it's a huge amount of work to refactor the md code, and
> now might be the right time if Neil is already changing things. I
> would be more than happy to work on this if Neil agrees.

I would probably need a more concrete proposal before I had something
to agree with :-)

I really think the raid1/raid5 parts of MD are distinctly different
from DM, and they should remain separate. However I am quite happy to
improve the interfaces so that seamless connections can be presented
by user-space tools.

For example, md currently gets its 'super-block' information by
reading the device directly. Soon it will have two separate routines
that get the super-block info, one for each super-block format. I
would be quite happy for there to be a way for DM to give a device to
MD along with some routine that provided super-block info by getting
it out of some near-by LVM superblock rather than out of the device
itself.

Similarly, if an API could be designed for MD to provide higher levels
with access to the spare parts of it's superblock, e.g. for partition
table information, then that might make sense.

To summarise: If you want tigher integration between MD and DM, do it
by defining useful interfaces, not by trying to tie them both together
into one big lump.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/