yyyeeeeessssssss..... bbbbuuuuuttttttt...
There was a detail (one of many probably) that I skipped in my brief
description of raid5.
Every block on the raid5 array has a two dimensional address
(discnumber, blocknumber)
This needs to be mapped into the linear address space expected by a
filesystem (unless you have a clever filesystem that understand two
dimensional addressing and copes with holes where the parity blocks
are).
Two extremes of ways to do this are:
abc- afk-
de-f bg-o
g-hi c-lp
-jkl -hmq
mno- din-
qp-r ej-r
where letters are logical block addresses, hyphens are parity blocks,
columns are drives, and rows a physical block numbers.
What is typically done it to define a cluster size, and then address
down the drive for a cluster, and then step across to the next driver
for the next cluster, so with a cluster size of 3, the above array
would be
adg-
beh-
cfi-
jm-p
hn-q
lo-r
(notice that the parity blocks come in clusters like the data blocks).
There is a trade off when choosing cluster size.
A cluster size of 1 (as it in the very first picture above) means that
any sequential access will probably use all drives, and so you should
see appropriate speed ups for read, and you might be able to avoid
reading old data for writes (as when you write a whole stripe you
don't need to read old data to calculate parity).
This is good if you have just a single thread accessing the array.
A large cluster size (e.g. 64k) means that most accesses will use only
one drive (for read) or two drives (for write - data + parity). This
means that multiple threads that access the array concurrently and not
always be tripping over each other (sometimes, but not always) (this
is called 'head contention').
There is a formular that I have seen, but cannot remember, which links
typical IO size, and typical number of concurrent threads to ideal
cluster size.
Issues of drive geometry come into this too. If you are going to read
any of a track, you may as well read all of it. So having a cluster
size that was a multiple of the track size would be good (if only
drives have constant sized tracks!!).
Back to your idea. Having each stripe be one filesystem block means
either having large filesystem blocks (meaning lots of wastage) or
having a cluster size of 1.
Unfortuately, with Linux Software Raid, the minimum cluster size is one
page, and the maximum filesystem block size is one page, so we cannot
try this out on linux to see how it actually works.
My understand of the way WOFL works is that it uses RAID4, so there are
no parity holes to worry about (RAID4 has all the parity on the one
drive) and WOFL knows about the 2-D structure.
It tries to lay out whole files (or large clusters of each file) on to
one disk each, but hopes to have enough files that need writing at one time
that it can write them all, one onto each disc, and thus keep all the
discs busy while writing, but still have reduced head contention when
reading.
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/