Re: 2.4.14-pre6

Andrew Morton (akpm@zip.com.au)
Wed, 31 Oct 2001 10:36:24 -0800

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Andreas Dilger: "Re: [Patch] Re: Nasty suprise with uptime"
Previous message: Mike Castle: "Re: cdrecord from ext3"
Maybe in reply to: Linus Torvalds: "2.4.14-pre6"
Next in thread: H. Peter Anvin: "Re: 2.4.14-pre6"

Linus Torvalds wrote:
>
> In article <3BDFBFF5.9F54B938@zip.com.au>,
> Andrew Morton <akpm@zip.com.au> wrote:
> >
> >Appended here is a program which creates 100,000 small files.
> >Using ext2 on -pre5. We see how long it takes to run
> >
> > (make-many-files ; sync)
> >
> >For several values of queue_nr_requests:
> >
> >queue_nr_requests: 128 8192 32768
> >execution time: 4:43 3:25 3:20
> >
> >Almost all of the execution time is in the `sync'.
>
> Hmm.. I don't consider "sync" to be a benchmark, and one of the things
> that made me limit the queue size was in fact that Linux in the
> timeframe before roughly 2.4.7 or so was _completely_ unresponsive when
> you did a big "untar" followed by a "sync".

Sure. I chose `sync' because it's measurable. That sync took
four minutes, so the machine will be locked up seeking for four
minutes whether the writeback was initiated by /bin/sync or by
kupdate/bdflush.

> I'd rather have a machine where I don't even much notice the sync than
> one where a made-up-load and a "sync" that servers no purpose shows
> lower throughput.
>
> Do you actually have any real load that cares?

All I do is compile kernels :)

Actually, ext3 journal replay can sometimes take thirty seconds
or longer - it reads maybe ten megs from the journal and then
it has to splatter it all over the platter and wait on it.

> ...
> We have actually talked about some higher-level ordering of the dirty list
> for at least five years, but nobody has ever done it. And I bet you $5
> that you'll get (a) better throughput than by making the queues longer and
> (b) you'll have fine latency while you write and (c) that you want to
> order the write-queue anyway for filesystems that care about ordering.

I'll buy that. It's not just the dirty list, either. I've seen
various incarnations of page_launder() and its successor which
were pretty suboptimal from a write clustering pov.

But it's actually quite seductive to take a huge amount of data and
just chuck it at the request layer and let Jens sort it out. This
usually works well and keeps the complexity in one place.

One does wonder whether everything is working as it should, though.
Creating those 100,000 4k files is going to require writeout of
how many blocks? 120,000? And four minutes is enough time for
34,000 seven-millisecond seeks. And ext2 is pretty good at laying
things out contiguously. These numbers don't gel.

Ah-ha. Look at the sync_inodes stuff:

for (zillions of files) {
filemap_fdatasync(file)
filemap_fdatawait(file)
}

If we turn this into

for (zillions of files)
filemap_fdatasync(file)
for (zillions of files)
filemap_fdatawait(file)

I suspect that interesting effects will be observed, yes? Especially
if we have a nice long request queue, and the results of the
preceding sync_buffers() are still available for being merged with.

kupdate runs this code path as well. Why is there any need for
kupdate to wait on the writes?

Anyway. I'll take a look....

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andreas Dilger: "Re: [Patch] Re: Nasty suprise with uptime"
Previous message: Mike Castle: "Re: cdrecord from ext3"
Maybe in reply to: Linus Torvalds: "2.4.14-pre6"
Next in thread: H. Peter Anvin: "Re: 2.4.14-pre6"