Re: [PATCH] io stalls

Chris Mason (mason@suse.com)
26 Jun 2003 11:55:12 -0400


This is a MIME-formatted message. If you see this text it means that your
E-mail software does not support MIME-formatted messages.

--=_courier-14406-1056644312-0001-2
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 7bit

On Thu, 2003-06-26 at 09:04, Nick Piggin wrote:

> >One of the things I tried in this area was basically queue ownership.
> >When each process woke up, he was given strict ownership of the queue
> >and could submit up to N number of requests. One process waited for
> >ownership in a yield loop for a max limit of a certain number of
> >jiffies, all the others waited on the request queue.
> >
>
> Not sure exactly what you mean by one process waiting for ownership
> in a yield loop, but why don't you simply allow the queue "owner" to
> submit up to a maximum of N requests within a time limit. Once either
> limit expires (or, rarely, another might become owner -) the process
> would just be put to sleep by the normal queue_full mechanism.
>

You need some way to wakeup the queue after that time limit has expired,
in case the owner never submits another request. This can either be a
timer or a process in a yield loop. Given that very short expire time I
set (10 jiffies), I went for the yield loop.

> >
> >It generally increased the latency in __get_request wait by a multiple
> >of N. I didn't keep it because the current patch is already full of
> >subtle interactions, I didn't want to make things more confusing than
> >they already were ;-)
> >
>
> Yeah, something like that. I think that in a queue full situation,
> the processes are wanting to submit more than 1 request though. So
> the better thoughput you can achieve by batching translates to
> better effective throughput. Read my recent debate with Andrea
> about this though - I couldn't convince him!
>

Well, it depends ;-) I think we've got 3 basic kinds of procs during a
q->full condition:

1) wants to submit lots of somewhat contiguous io
2) wants to submit a single io
3) wants to submit lots of random io

>From a throughput point of view, we only care about giving batch
ownership to #1. giving batch ownership to #3 will help reduce context
switches, but if it helps throughput than the io wasn't really random
(you've got a good point about locality below, drive write caches make a
huge difference there).

The problem I see in 2.4 is the elevator can't tell any of these cases
apart, so any attempt at batch ownership is certain to be wrong at least
part of the time.

> I have seen much better maximum latencies, 2-3 times the
> throughput, and an order of magnitude less context switching on
> many threaded tiobench write loads when using batching.
>
> In short, measuring get_request latency won't give you the full
> story.
>

Very true. But get_request latency is the minimum amount of time a
single read is going to wait (in 2.4.x anyway), and that is what we need
to focus on when we're trying to fix interactive performance.

> >
> >The real problem with this approach is that we're guessing about the
> >number of requests a given process wants to submit, and we're assuming
> >those requests are going to be highly mergable. If the higher levels
> >pass these hints down to the elevator, we should be able to do a better
> >job of giving both low latency and high throughput.
> >
>
> No, the numbers (batch # requests, batch time) are not highly scientific.
> Simply when a process wakes up, we'll let them submit a small burst of
> requests before they go back to sleep. Now in 2.5 (mm) we can cheat and
> make this more effective, fair, and without possible missed wakes because
> io contexts means that multiple processes can be batching at the same
> time, and dynamically allocated requests means it doesn't matter if we
> go a bit over the queue limit.
>

I agree 2.5 has a lot more room for the contexts to be effective, and I
think they are a really good idea.

> I think a decent solution for 2.4 would be to simply have the one queue
> owner, but he allowed the queue to fall below the batch limit, wake
> someone else and make them the owner. It can be a bit less fair, and
> it doesn't work across queues, but they're less important cases.
>
> >
> >Between bios and the pdflush daemons, I think 2.5 is in pretty good
> >shape to do what we need. I'm not 100% sure we need batching when the
> >requests being submitted are not highly mergable, but I haven't put lots
> >of thought into that part yet.
> >
>
> No, there are a couple of problems here.
> First, good locality != sequential. I saw tiobench 256 random write
> throughput _doubled_ because each process is writing within its own
> file.
>
> Second, mergeable doesn't mean anything if your request size only
> grows to say 128KB (IDE). I saw tiobench 256 sequential writes on IDE
> go from ~ 25% peak throughput to ~70% (4.85->14.11 from 20MB/s disk)

Well, play around with raw io, my box writes at roughly disk speed with
128k synchronous requests (contiguous writes).

> Third, context switch rate. In the latest IBM regression tests,
> tiobench 64 on ext2, 8xSMP (so don't look at throughput!), average
> cs/s was about 2500 with mainline (FIFO request allocation), and
> 140 in mm (batching allocation). So nearly 20x better. This might
> not be due to batching alone, but I didn't see any other obvious
> change in mm.
>

Makes sense.

> >
> >Anyway for 2.4 I'm not sure there's much more we can do. I'd like to
> >add tunables to the current patch, so userland can control the max io in
> >flight and a simple toggle between throughput mode and latency mode on a
> >per device basis. It's not perfect but should tide us over until 2.6.
> >
> >
>
> The changes do seem to be a critical fix due to the starvation issue,
> but I'm worried that they take a big step back in performance under
> high disk load. I found my FIFO mechanism to be unacceptably slow for
> 2.5.

Me too, but I'm not sure how to fix it other than a userspace knob to
turn off the q->full checks for server workloads. Andrea's
elevator-lowlatency alone has pretty good throughput numbers, since it
still allows request stealing. Its get_request_wait latency numbers
aren't horrible either, it only suffers in a few corner cases.

But, if someone wants to play with this more, I've attached a quick
remerge of my batch ownership code. I made a read and write owner, so
that a reader doing a single request doesn't grab ownership and make all
the writes wait.

It does make throughput better overall, and it also makes latencies
worse overall. We'll probably get similar results just by disabling
q->full in io-stalls-7, but the batch ownership does a better job of
limiting get_request latencies at a fixed (although potentially large)
number.

lat-stat-5.diff goes on top of io-stalls-7.diff from yesterday
batch_owner.diff goes on top of lat-stat-5.diff.

-chris

--=_courier-14406-1056644312-0001-2
Content-Type: text/plain; name="batch_owner.diff"; charset=iso-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename=batch_owner.diff

===== drivers/block/ll_rw_blk.c 1.47 vs edited =====
--- 1.47/drivers/block/ll_rw_blk.c Thu Jun 26 09:20:08 2003
+++ edited/drivers/block/ll_rw_blk.c Thu Jun 26 10:52:17 2003
@@ -592,6 +592,8 @@
q->full = 0;
q->can_throttle = 0;

+ memset(q->batch, 0, sizeof(struct queue_batch) * 2);
+
reset_stats(q);

/*
@@ -606,6 +608,48 @@
blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
}

+#define MSEC(x) ((x) * 1000 / HZ)
+#define BATCH_MAX_AGE 100
+int grab_batch_ownership(request_queue_t *q, int rw)
+{
+ struct task_struct *tsk = current;
+ unsigned long age;
+ struct queue_batch *batch = q->batch + rw;
+
+ if (batch->batch_waiter)
+ return 0;
+ if (!batch->batch_owner)
+ goto grab;
+ batch->batch_waiter = tsk;
+ while(1) {
+ age = jiffies - batch->batch_jiffies;
+ if (!batch->batch_owner || MSEC(age) > BATCH_MAX_AGE)
+ break;
+ set_current_state(TASK_RUNNING);
+ spin_unlock_irq(&io_request_lock);
+ schedule();
+ spin_lock_irq(&io_request_lock);
+ }
+ batch->batch_waiter = NULL;
+grab:
+ batch->batch_owner = tsk;
+ batch->batch_jiffies = jiffies;
+ batch->batch_remaining = q->batch_requests;
+ return 1;
+}
+
+void decrement_batch_request(request_queue_t *q, int rw)
+{
+ struct queue_batch *batch = q->batch + rw;
+ if (batch->batch_owner == current) {
+ batch->batch_remaining--;
+ if (!batch->batch_remaining ||
+ MSEC(jiffies - batch->batch_jiffies) > BATCH_MAX_AGE) {
+ batch->batch_owner = NULL;
+ }
+ }
+}
+
#define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
/*
* Get a free request. io_request_lock must be held and interrupts
@@ -625,6 +669,7 @@
rq->cmd = rw;
rq->special = NULL;
rq->q = q;
+ decrement_batch_request(q, rw);
} else
q->full = 1;
return rq;
@@ -635,7 +680,7 @@
*/
static inline struct request *get_request(request_queue_t *q, int rw)
{
- if (q->full)
+ if (q->full && q->batch[rw].batch_owner != current)
return NULL;
return __get_request(q, rw);
}
@@ -698,25 +743,28 @@

static struct request *__get_request_wait(request_queue_t *q, int rw)
{
- register struct request *rq;
+ register struct request *rq = NULL;
unsigned long wait_start = jiffies;
unsigned long time_waited;
DECLARE_WAITQUEUE(wait, current);

add_wait_queue_exclusive(&q->wait_for_requests, &wait);

+ spin_lock_irq(&io_request_lock);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
- spin_lock_irq(&io_request_lock);
if (q->full || blk_oversized_queue(q)) {
- __generic_unplug_device(q);
+ if (blk_oversized_queue(q))
+ __generic_unplug_device(q);
spin_unlock_irq(&io_request_lock);
schedule();
spin_lock_irq(&io_request_lock);
+ if (!grab_batch_ownership(q, rw))
+ continue;
}
rq = __get_request(q, rw);
- spin_unlock_irq(&io_request_lock);
} while (rq == NULL);
+ spin_unlock_irq(&io_request_lock);
remove_wait_queue(&q->wait_for_requests, &wait);
current->state = TASK_RUNNING;

@@ -978,9 +1026,9 @@
list_add(&req->queue, &q->rq.free);
if (q->rq.count >= q->batch_requests && !oversized_batch) {
smp_mb();
- if (waitqueue_active(&q->wait_for_requests))
+ if (waitqueue_active(&q->wait_for_requests)) {
wake_up(&q->wait_for_requests);
- else
+ } else
clear_full_and_wake(q);
}
}
===== include/linux/blkdev.h 1.25 vs edited =====
--- 1.25/include/linux/blkdev.h Thu Jun 26 09:20:08 2003
+++ edited/include/linux/blkdev.h Thu Jun 26 10:50:17 2003
@@ -69,6 +69,15 @@
struct list_head free;
};

+struct queue_batch
+{
+ struct task_struct *batch_owner;
+ struct task_struct *batch_waiter;
+ unsigned long batch_jiffies;
+ int batch_remaining;
+
+};
+
struct request_queue
{
/*
@@ -141,7 +150,7 @@
* threshold
*/
int full:1;
-
+
/*
* Boolean that indicates you will use blk_started_sectors
* and blk_finished_sectors in addition to blk_started_io
@@ -162,6 +171,9 @@
* Tasks wait here for free read and write requests
*/
wait_queue_head_t wait_for_requests;
+
+ struct queue_batch batch[2];
+
unsigned long max_wait;
unsigned long min_wait;
unsigned long total_wait;
@@ -278,7 +290,7 @@

#define MAX_SEGMENTS 128
#define MAX_SECTORS 255
-#define MAX_QUEUE_SECTORS (2 << (20 - 9)) /* 4 mbytes when full sized */
+#define MAX_QUEUE_SECTORS (4 << (20 - 9)) /* 4 mbytes when full sized */
#define MAX_NR_REQUESTS 1024 /* 1024k when in 512 units, normally min is 1M in 1k units */

#define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK)

--=_courier-14406-1056644312-0001-2
Content-Type: text/plain; name="lat-stat-5.diff"; charset=iso-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename=lat-stat-5.diff

reverted:
--- b/drivers/block/blkpg.c Thu Jun 26 09:12:14 2003
+++ a/drivers/block/blkpg.c Thu Jun 26 09:12:14 2003
@@ -261,6 +261,7 @@
return blkpg_ioctl(dev, (struct blkpg_ioctl_arg *) arg);

case BLKELVGET:
+ blk_print_stats(dev);
return blkelvget_ioctl(&blk_get_queue(dev)->elevator,
(blkelv_ioctl_arg_t *) arg);
case BLKELVSET:
reverted:
--- b/drivers/block/ll_rw_blk.c Thu Jun 26 09:12:14 2003
+++ a/drivers/block/ll_rw_blk.c Thu Jun 26 09:12:14 2003
@@ -490,6 +490,56 @@
spin_lock_init(&q->queue_lock);
}

+void blk_print_stats(kdev_t dev)
+{
+ request_queue_t *q;
+ unsigned long avg_wait;
+ unsigned long min_wait;
+ unsigned long high_wait;
+ unsigned long *d;
+
+ q = blk_get_queue(dev);
+ if (!q)
+ return;
+
+ min_wait = q->min_wait;
+ if