Re: 2.4.14-pre6

Neil Brown (neilb@cse.unsw.edu.au)
Thu, 1 Nov 2001 21:20:02 +1100 (EST)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Marc Haber: "Re: xircom_cb and promiscious mode"
Previous message: george anzinger: "Re: [Patch] Re: Nasty suprise with uptime"
Maybe in reply to: Linus Torvalds: "2.4.14-pre6"
Next in thread: Andrew Morton: "Re: 2.4.14-pre6"

On Wednesday October 31, torvalds@transmeta.com wrote:
>
> We have actually talked about some higher-level ordering of the dirty list
> for at least five years, but nobody has ever done it. And I bet you $5
> that you'll get (a) better throughput than by making the queues longer and
> (b) you'll have fine latency while you write and (c) that you want to
> order the write-queue anyway for filesystems that care about ordering.
>

But what is the "right" order. A raid5 array might well respond to a
different ordering that an JBOD.

I've thought a bit about how to best give blocks to RAID5 so that they
can be written efficiently. I suspect the issues are similar for
normal disk io:

Currently the device (or block-device-layer) doesn't see a block until
the upper levels really want the IO to happen. There is a little bit
of a grace period betwen the submit_bh and the run_task_queue(&tq_disk)
when re-ordering can happen, but it isn't very long. There is a bit
more grace time while waiting to get a turn on the device. But it is
still a lot less time than the amount of time that most buffers are
sitting around in cache.

What I would like is that as soon as a buffer was marked "dirty", it
would get passed down to the driver (or at least to the
block-device-layer) with something like
submit_bh(WRITEA, bh);
i.e. a write ahead. (or is it write-behind...)
The device handler (the elevator algorithm for normal disks, other
code for other devices) could keep them ordered in whatever way it
chooses, and feed them into the queues at some appropriate time.

The submit_bh(WRITE, bh) would then push the buffer out if it hadn't
gone already.

The elevator code could possibly keep two sorted lists: one of WRITEA
(or READA) requests and one of WRITE (or READ) requests.
It processes the second merging in some of the first as it goes.
Maybe capping it to 2 -ahead blocks for every immediate block.
Probably also allowing for larger numbers of -ahead blocks if they are
contiguous with an immediate block.

RAID5 would do something a bit different. Possibly whenever it wanted
to write a stripe, it would hunt though the -ahead list (sort of like
the 2.2 code did) for other blocks that could be proactive added to
the stripe.

This would allow a nice ordering of write-behind (and read-ahead)
requests but give the driver control of latency by allowing it to
limit the extent to which write-behind/read-ahead blocks can usurp the
position of other blocks.

Does that make any sense? Is it conceptually simple enough?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Marc Haber: "Re: xircom_cb and promiscious mode"
Previous message: george anzinger: "Re: [Patch] Re: Nasty suprise with uptime"
Maybe in reply to: Linus Torvalds: "2.4.14-pre6"
Next in thread: Andrew Morton: "Re: 2.4.14-pre6"