Re: Poor read performance when sequential write presents

Jens Axboe (axboe@suse.de)
Tue, 28 May 2002 11:25:03 +0200


On Mon, May 27 2002, Andrew Morton wrote:
> > On Mon, May 27 2002, Andrew Morton wrote:
> > > Jens Axboe wrote:
> > > >
> > > > ...
> > > > > But in 2.5, head-activeness went away and as far as I know, IDE and SCSI are
> > > > > treated the same. Odd.
> > > >
> > > > It didn't really go away, it just gets handled automatically now.
> > > > elv_next_request() marks the request as started, in which case the i/o
> > > > scheduler won't consider it for merging etc. SCSI removes the request
> > > > directly after it has been marked started, while IDE leaves it on the
> > > > queue until it completes. For IDE TCQ, the behaviour is the same as with
> > > > SCSI.
> > >
> > > It won't consider the active request at the head of the queue for
> > > merging (making the request larger). But it _could_ consider the
> > > request when making decisions about insertion (adding a new request
> > > at the head of the queue because it's close-on-disk to the active
> > > one). Does it do that?
> >
> > Only when the front request isn't active is it safe to consider
> > insertion in front of it. 2.5 does that exactly because it knows if the
> > request has been started, while 2.4 has to guess by looking at the
> > head-active flag and the plug status.
> >
> > If the request is started, we will only consider placing in front of the
> > 2nd request not after the 1st. We could consider in between 1st and 2nd,
> > that should be safe. In fact that should be perfectly safe, just move
> > the barrier and started test down after the insert test. *req is the
> > insert-after point.
>
> Makes sense. I suspect it may even worsen the problem I observed
> with the mpage code. Set the readahead to 256k with `blockdev --setra 512'
> and then run tiobench. The read latencies are massive - one thread
> gets hold of the disk head and hogs it for 30-60 seconds.
>
> The readahead code has a sort of double-window design. The idea is that
> if the disk does 50 megs/sec and your application processes data at
> 49 megs/sec, the application will never block on I/O. At 256k readahead,
> the readahead code will be laying out four BIOs at a time. It's probable
> that the application is actually submitting BIOs for a new readahead
> window before all of the BIOs for the old one are complete. So it's
> performing merging against its own reads.
>
> Given all this, what I would expect to see is for thread "A" to capture
> the disk head for some period of time, until eventually one of thread "B"'s
> requests expires its latency. Then thread "B" gets to hog the disk head.
> That's reasonable behaviour, but the latencies are *enormous*. Almost
> like the latency stuff isn't working. But it sure looks OK.

I'm still waiting for some time to implement some nicer i/o scheduling
algorithms, I'd be sad to see elevator_linus be the default for 2.6. For
now it's just receiving the odd fixes here and there which do make small
improvements.

> Not super-high priority at this time. I'll play with it some more.
> (Some userspace tunables for the elevator would be nice. Hint. ;))

Agreed :-)

> hmm. Actually the code looks a bit odd:
>
> if (elv_linus_sequence(__rq)-- <= 0)
> break;
> if (!(__rq->flags & REQ_CMD))
> continue;
> if (elv_linus_sequence(__rq) < bio_sectors(bio))
> break;
>
> The first decrement is saying that elv_linus_sequence is in units of
> requests, but the comparison (and the later `-= bio_sectors()') seems
> to be saying it's in units of sectors.

Well, it really is in units of sectors in 2.5, the first decrement is a
scan aging measure.

> I think calculating the latency in terms of requests makes more sense - just
> ignore the actual size of those requests (or weight it down in some manner).
> But I don't immediately see what the above code is up to?

That might make more sense, but again it's not likely to make
elevator_linus too tolerable anyways. You can easily changes the
read/write initial sequences to be >> 2 what they are now, and just
account seeks. The end result would be very similar, though :-)

-- 
Jens Axboe

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/