Re: [PATCH] deadline io scheduler

Andrew Morton (akpm@digeo.com)
Wed, 25 Sep 2002 23:15:58 -0700


This is looking good. With a little more tuning and tweaking
this problem is solved.

The horror test was:

cd /usr/src/linux
dd if=/dev/zero of=foo bs=1M count=4000
sleep 5
time cat kernel/*.c > /dev/null

Testing on IDE (this matters - SCSI is very different)

- On 2.5.38 + souped-up VM it was taking 25 seconds.

- My read-latency patch took 1 second-odd.

- Linus' rework yesterday was taking 0.3 seconds.

- With Linus' current tree (with the deadline scheduler) it now takes
5 seconds.

Let's see what happens as we vary read_expire:

read_expire (ms) time cat kernel/*.c (secs)
500 5.2
400 3.8
300 4.5
200 3.9
100 5.1
50 5.0

well that was a bit of a placebo ;)

Let's leave read_expire at 500ms and diddle writes_starved:

writes_starved (units) time cat kernel/*.c (secs)
1 4.8
2 4.4
4 4.0
8 4.9
16 4.9

Now alter fifo_batch, everything else default:

fifo_batch (units) time cat kernel/*.c (secs)
64 5.0
32 2.0
16 0.2
8 0.17

OK, that's a winner.

Here's something really nice with the deadline scheduler. I was
madly catting five separate kernel trees (five reading processes)
and then started a big `dd', tunables at default:

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 9 0 6008 2460 8304 324716 0 0 2048 0 1102 254 13 88 0
0 7 0 6008 2600 8288 324480 0 0 1800 0 1114 266 0 100 0
0 6 0 6008 2452 8292 324520 0 0 2432 0 1126 287 29 71 0
0 6 0 6008 3160 8292 323952 0 0 3568 0 1132 312 0 100 0
0 6 0 6008 2860 8296 324148 128 0 2984 0 1119 281 17 83 0
1 6 0 5984 2856 8264 323816 352 0 5240 0 1162 479 0 100 0
0 7 1 5984 4152 7876 324068 0 0 1648 28192 1215 1572 1 99 0
0 9 2 6016 3136 7300 328568 0 180 1232 37248 1324 1201 3 97 0
0 9 2 6020 5260 5628 329212 0 4 1112 29488 1296 560 0 100 0
0 9 3 6020 3548 5596 330944 0 0 1064 35240 1302 629 6 94 0
0 9 3 6020 3412 5572 331352 0 0 744 31744 1298 452 6 94 0
0 9 2 6020 1516 5576 333352 0 0 888 31488 1283 467 0 100 0
0 9 2 6020 3528 5580 331396 0 0 1312 20768 1251 385 0 100 0

Note how the read rate maybe halved, and we sustained a high
volume of writeback. This is excellent.

Let's try it again with fifo_batch at 16:

0 5 0 80 303936 3960 49288 0 0 2520 0 1092 174 0 100 0
0 5 0 80 302400 3996 50776 0 0 3040 0 1094 172 20 80 0
0 5 0 80 301164 4032 51988 0 0 2504 0 1082 150 0 100 0
0 5 0 80 299708 4060 53412 0 0 2904 0 1084 149 0 100 0
1 5 1 80 164640 4060 186784 0 0 1344 26720 1104 891 1 99 0
0 6 2 80 138900 4060 212088 0 0 280 7928 1039 226 0 100 0
0 6 2 80 134992 4064 215928 0 0 1512 7704 1100 226 0 100 0
0 6 2 80 130880 4068 219976 0 0 1928 9688 1124 245 17 83 0
0 6 2 80 123316 4084 227432 0 0 2664 8200 1125 283 11 89 0

That looks acceptable. Writes took quite a bit of punishment, but
the VM should cope with that OK.

It'd be interesting to know why read_expire and writes_starved have
no effect, while fifo_batch has a huge effect.

I'd like to gain a solid understanding of what these three knobs do.
Could you explain that a little more?

During development I'd suggest the below patch, to add
/proc/sys/vm/read_expire, fifo_batch and writes_starved - it beats
recompiling each time.

I'll test scsi now.

drivers/block/deadline-iosched.c | 18 +++++++++---------
kernel/sysctl.c | 12 ++++++++++++
2 files changed, 21 insertions(+), 9 deletions(-)

--- 2.5.38/drivers/block/deadline-iosched.c~akpm-deadline Wed Sep 25 22:16:36 2002
+++ 2.5.38-akpm/drivers/block/deadline-iosched.c Wed Sep 25 23:05:45 2002
@@ -24,14 +24,14 @@
* fifo_batch is how many steps along the sorted list we will take when the
* front fifo request expires.
*/
-static int read_expire = HZ / 2; /* 500ms start timeout */
-static int fifo_batch = 64; /* 4 seeks, or 64 contig */
-static int seek_cost = 16; /* seek is 16 times more expensive */
+int read_expire = HZ / 2; /* 500ms start timeout */
+int fifo_batch = 64; /* 4 seeks, or 64 contig */
+int seek_cost = 16; /* seek is 16 times more expensive */

/*
* how many times reads are allowed to starve writes
*/
-static int writes_starved = 2;
+int writes_starved = 2;

static const int deadline_hash_shift = 8;
#define DL_HASH_BLOCK(sec) ((sec) >> 3)
@@ -253,7 +253,7 @@ static void deadline_move_requests(struc
{
struct list_head *sort_head = &dd->sort_list[rq_data_dir(rq)];
sector_t last_sec = dd->last_sector;
- int batch_count = dd->fifo_batch;
+ int batch_count = fifo_batch;

do {
struct list_head *nxt = rq->queuelist.next;
@@ -267,7 +267,7 @@ static void deadline_move_requests(struc
if (rq->sector == last_sec)
batch_count--;
else
- batch_count -= dd->seek_cost;
+ batch_count -= seek_cost;

if (nxt == sort_head)
break;
@@ -319,7 +319,7 @@ dispatch:
* if we have expired entries on the fifo list, move some to dispatch
*/
if (deadline_check_fifo(dd)) {
- if (writes && (dd->starved++ >= dd->writes_starved))
+ if (writes && (dd->starved++ >= writes_starved))
goto dispatch_writes;

nxt = dd->read_fifo.next;
@@ -329,7 +329,7 @@ dispatch:
}

if (!list_empty(&dd->sort_list[READ])) {
- if (writes && (dd->starved++ >= dd->writes_starved))
+ if (writes && (dd->starved++ >= writes_starved))
goto dispatch_writes;

nxt = dd->sort_list[READ].next;
@@ -392,7 +392,7 @@ deadline_add_request(request_queue_t *q,
/*
* set expire time and add to fifo list
*/
- drq->expires = jiffies + dd->read_expire;
+ drq->expires = jiffies + read_expire;
list_add_tail(&drq->fifo, &dd->read_fifo);
}
}
--- 2.5.38/kernel/sysctl.c~akpm-deadline Wed Sep 25 22:59:48 2002
+++ 2.5.38-akpm/kernel/sysctl.c Wed Sep 25 23:05:42 2002
@@ -272,6 +272,9 @@ static int zero = 0;
static int one = 1;
static int one_hundred = 100;

+extern int fifo_batch;
+extern int read_expire;
+extern int writes_starved;

static ctl_table vm_table[] = {
{VM_OVERCOMMIT_MEMORY, "overcommit_memory", &sysctl_overcommit_memory,
@@ -314,6 +317,15 @@ static ctl_table vm_table[] = {
{VM_HUGETLB_PAGES, "nr_hugepages", &htlbpage_max, sizeof(int), 0644, NULL,
&proc_dointvec},
#endif
+ {90, "read_expire",
+ &read_expire, sizeof(read_expire), 0644,
+ NULL, &proc_dointvec},
+ {91, "fifo_batch",
+ &fifo_batch, sizeof(fifo_batch), 0644,
+ NULL, &proc_dointvec},
+ {92, "writes_starved",
+ &writes_starved, sizeof(writes_starved), 0644,
+ NULL, &proc_dointvec},
{0}
};

.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/