Re: [Bench] New benchmark showing fileserver problem in 2.4.12

Linus Torvalds (torvalds@transmeta.com)
Wed, 17 Oct 2001 16:44:49 +0000 (UTC)


In article <3BCD8269.B4E003E5@anu.edu.au>,
Robert Cohen <robert.cohen@anu.edu.au> wrote:
>
>Factor 3: the performance problems only happens for I/O that is due to
>network traffic, not I/O that was generated locally. I realise this is
>extremely strange and I have no idea how it knows that I/O is die to
>network traffic let alone why it cares. But I can assure you that it
>does make a difference.

I'll bet you $5 AUD that this happens because you don't block your
output into nicely aligned chunks.

When you have an existing file, and you write 1500 bytes to the middle
of it, performance will degrade _horribly_ compared to the case of
writing a full block, or writing to a position that hasn't been written
yet.

Your benchmark probably just does the equivalent of

for (;;) {
int bytes = read(in, buf, BUFSIZE);
if (bytes <= 0)
break;
write(out, buf, bytes);
}

am I right? The above is obvious code, but it happens to be bad code.

Now, when you read from the network, you will NOT get reads that are a
nice multiple of BUFSIZE, you'll get reads that are a multiple of the
packet data load (~1460 bytes on TCP over ethernet), and you'll end up
doing unaligned writes that require a read-modify-wrtie cycle and thus
end up doing twice as much IO.

And not only does it do twice as much IO (and potentially more with
read-ahead), the read will obviously be _synchronous_, so the slowdown
is more than twice as much.

In contrast, when the source is a local file (or a pipe that ends up
chunking stuff up in 4kB chunks instead of 1500-byte packets), you'll
have nice write patterns that fill the whole buffer and make the read
unnecessary. Which gets you nice streaming writes to disk.

With most fast disks, this is not unlikely to be performance difference
on the order of a magnitude.

And there is _nothing_ the kernel can do about it. Your benchmark is
bad, and has different behaviour depending on the source.

In short, fix your program. Change the loop to be something like

unsigned int so_far = 0;
for (;;) {
int bytes = read(in, buf+so_far, BUFSIZE-so_far);
if (bytes <= 0)
break;
so_far += bytes;
if (so_far < BUFSIZE)
continue;
write(out, buf, BUFSIZE);
so_far = 0;
}
if (so_far)
write(out, buf, so_far);

which will act the same for partial and full reads, and I bet you'll see
the same difference for local and networking I/O (modulo the speed
difference in the _source_, of course).

Oh, and I bet you that once you do something like the above, you won't
see much difference between a 8kB buffer and a 256kB buffer. The
smaller buffer will generate more system calls, but it won't much matter
(and sometimes the smaller buffer performs better due to better data
cache locality and better overlapping IO - system calls under Linux
aren't slow, other factors can easily dominate).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/