high-end i/o performance of 2.4.19pre8aa2 (was: Re: O_DIRECT

Lincoln Dale (ltd@cisco.com)
Mon, 03 Jun 2002 14:53:26 +1000


g'day Andrea,

At 04:51 AM 22/05/2002 +0200, Andrea Arcangeli wrote:
>Infact a nice addition to your proggy would be to also benchmark the I/O
>bandwith using mmap driven I/O, instead of buffered I/O, raw, O_DIRECT
>and the nocopy hack. Like raw and O_DIRECT also the mmap driven I/O is
>zerocopy I/O, so only the buffered-IO would show the total waste of cpu
..

i've had some spare time to enhance my benchmark program. it is now
multithreaded and can do mmap() i/o also.

interestingly enough, it shows that mmap() actually sucks. i didn't take a
"readprofile" to see why, but it used far more cpu time than anything else.

with the same hardware as before (Dual P3 Xeon 733MHz, 133MHz FSB, 2G PC133
SDRAM, QLogic 2300 HBA in 64/66 PCI slot) and same kernel (linux 2.4.19pre8
with vary-io, lockmeter & profile=2), the aggregate throughput across 8 x
15K RPM FC disks (2gbit/s FC switched network) are:

now that i've multithreaded the test app. the results change somewhat
given we now have multiple threads issuing i/o operations:
raw i/o is slightly better than O_DIRECT again. mmap()'ed i/o is the worst.

[summary]
'base' Completed reading 81920 mbytes in 524.289107 seconds
(156.25 Mbytes/sec), 386usec mean/block
'nocopy' Completed reading 81920 mbytes in 435.653909 seconds
(188.04 Mbytes/sec), 302usec mean/block
'o_direct' Completed reading 81920 mbytes in 429.757513 seconds
(190.62 Mbytes/sec), 157655usec mean/block
'raw' Completed reading 81920 mbytes in 420.940382 seconds
(194.61 Mbytes/sec), 164368usec mean/block
'mmap' Completed reading 81920 mbytes in 1049.446053 seconds (78.06
Mbytes/sec), 401124usec mean/block

given i'm not hitting the peak performance of a single 2gbit/s fibre
channel connection, i'll now migrate to dual FC HBAs and see what happens
... :-)

[detail]

(1)
# performance of a single disk-spindle using normal linux block i/o
layer. use 10G worth of 8kbyte blocks.
# this will show the peak performance of a single 15K RPM disk without any
bus contention.
[root@mel-stglab-host1 root]# ./test_disk_performance2 bs=8k blocks=1280k
mode=basic /dev/sde
Completed reading 10240 mbytes across 1 devices using 1310720 blocks of
8192 in 190.028450 seconds (53.89 Mbytes/sec), 151usec mean

(2)
# performance of multiple disk-spindles (8 disks, sde-sdl) using normal
linux block i/o layer.
# read 10G sequentially from each disk using 8kbyte blocks parallel across
all disks.
[root@mel-stglab-host1 root]# ./test_disk_performance2 bs=8k blocks=1280k
mode=basic /dev/sd[e-l]
Completed reading 81920 mbytes across 8 devices using 10485760 blocks of
8192 in 524.289107 seconds (156.25 Mbytes/sec), 386usec mean
device #0 (/dev/sde) 10240 megabytes in 520.588206 seconds using 1310720
reads of 8192 bytes (19.67 Mbytes/sec), 320usec
device #1 (/dev/sdf) 10240 megabytes in 509.305642 seconds using 1310720
reads of 8192 bytes (20.11 Mbytes/sec), 352usec
device #2 (/dev/sdg) 10240 megabytes in 516.744421 seconds using 1310720
reads of 8192 bytes (19.82 Mbytes/sec), 351usec
device #3 (/dev/sdh) 10240 megabytes in 519.215781 seconds using 1310720
reads of 8192 bytes (19.72 Mbytes/sec), 382usec
device #4 (/dev/sdi) 10240 megabytes in 517.312922 seconds using 1310720
reads of 8192 bytes (19.79 Mbytes/sec), 339usec
device #5 (/dev/sdj) 10240 megabytes in 524.289183 seconds using 1310720
reads of 8192 bytes (19.53 Mbytes/sec), 315usec
device #6 (/dev/sdk) 10240 megabytes in 521.006994 seconds using 1310720
reads of 8192 bytes (19.65 Mbytes/sec), 322usec
device #7 (/dev/sdl) 10240 megabytes in 518.434472 seconds using 1310720
reads of 8192 bytes (19.75 Mbytes/sec), 345usec

(3)
# same test as above, but using 'nocopy hack' so PC architecture isn't the
bottleneck.
# look at 8 disks (sde-sdl) using 8kbyte/block, 10G on each disk: (80G
total) in parallel.
# using 'nocopy' means we can measure peak performance without PC
front-side-bus becoming
# the bottleneck, but we're still using the normal read-file case.
[root@mel-stglab-host1 root]# ./test_disk_performance2 bs=8k blocks=1280k
mode=nocopy /dev/sd[e-l]
Completed reading 81920 mbytes across 8 devices using 10485760 blocks of
8192 in 435.653909 seconds (188.04 Mbytes/sec), 302usec mean
device #0 (/dev/sde) 10240 megabytes in 408.566056 seconds using 1310720
reads of 8192 bytes (25.06 Mbytes/sec), 305usec
device #1 (/dev/sdf) 10240 megabytes in 411.353237 seconds using 1310720
reads of 8192 bytes (24.89 Mbytes/sec), 309usec
device #2 (/dev/sdg) 10240 megabytes in 414.244129 seconds using 1310720
reads of 8192 bytes (24.72 Mbytes/sec), 295usec
device #3 (/dev/sdh) 10240 megabytes in 416.225989 seconds using 1310720
reads of 8192 bytes (24.60 Mbytes/sec), 293usec
device #4 (/dev/sdi) 10240 megabytes in 417.593914 seconds using 1310720
reads of 8192 bytes (24.52 Mbytes/sec), 268usec
device #5 (/dev/sdj) 10240 megabytes in 422.646633 seconds using 1310720
reads of 8192 bytes (24.23 Mbytes/sec), 259usec
device #6 (/dev/sdk) 10240 megabytes in 431.343133 seconds using 1310720
reads of 8192 bytes (23.74 Mbytes/sec), 209usec
device #7 (/dev/sdl) 10240 megabytes in 435.654122 seconds using 1310720
reads of 8192 bytes (23.50 Mbytes/sec), 216usec

(4)
# use O_DIRECT to bypass readahead and most of block-layer memory-copies
# since there isn't any readahead anymore, use a large buffer to achieve
peak performance
[root@mel-stglab-host1 root]# ./test_disk_performance2 bs=4m blocks=2560
mode=direct /dev/sd[e-l]
Completed reading 81920 mbytes across 8 devices using 20480 blocks of
4194304 in 429.757513 seconds (190.62 Mbytes/sec), 157655usec mean
device #0 (/dev/sde) 10240 megabytes in 412.931460 seconds using 2560
reads of 4194304 bytes (24.80 Mbytes/sec), 160898usec
device #1 (/dev/sdf) 10240 megabytes in 414.531502 seconds using 2560
reads of 4194304 bytes (24.70 Mbytes/sec), 159234usec
device #2 (/dev/sdg) 10240 megabytes in 414.585683 seconds using 2560
reads of 4194304 bytes (24.70 Mbytes/sec), 158187usec
device #3 (/dev/sdh) 10240 megabytes in 415.257838 seconds using 2560
reads of 4194304 bytes (24.66 Mbytes/sec), 154983usec
device #4 (/dev/sdi) 10240 megabytes in 416.209758 seconds using 2560
reads of 4194304 bytes (24.60 Mbytes/sec), 138563usec
device #5 (/dev/sdj) 10240 megabytes in 419.144473 seconds using 2560
reads of 4194304 bytes (24.43 Mbytes/sec), 133593usec
device #6 (/dev/sdk) 10240 megabytes in 424.031534 seconds using 2560
reads of 4194304 bytes (24.15 Mbytes/sec), 102836usec
device #7 (/dev/sdl) 10240 megabytes in 429.773878 seconds using 2560
reads of 4194304 bytes (23.83 Mbytes/sec), 98300usec

(5)
# same as (4) but using /dev/raw/rawX instead of O_DIRECT
[root@mel-stglab-host1 root]# ./test_disk_performance2 bs=4m blocks=2560
mode=basic /dev/raw/raw[1-8]
Completed reading 81920 mbytes across 8 devices using 20480 blocks of
4194304 in 420.940382 seconds (194.61 Mbytes/sec), 164368usec mean
device #0 (raw1) 10240 megabytes in 420.869676 seconds using 2560 reads
of 4194304 bytes (24.33 Mbytes/sec), 164385usec
device #1 (raw2) 10240 megabytes in 420.859291 seconds using 2560 reads
of 4194304 bytes (24.33 Mbytes/sec), 164386usec
device #2 (raw3) 10240 megabytes in 420.834023 seconds using 2560 reads
of 4194304 bytes (24.33 Mbytes/sec), 164385usec
device #3 (raw4) 10240 megabytes in 420.896964 seconds using 2560 reads
of 4194304 bytes (24.33 Mbytes/sec), 163879usec
device #4 (raw5) 10240 megabytes in 420.838551 seconds using 2560 reads
of 4194304 bytes (24.33 Mbytes/sec), 164385usec
device #5 (raw6) 10240 megabytes in 420.867115 seconds using 2560 reads
of 4194304 bytes (24.33 Mbytes/sec), 164060usec
device #6 (raw7) 10240 megabytes in 420.906691 seconds using 2560 reads
of 4194304 bytes (24.33 Mbytes/sec), 164386usec
device #7 (raw8) 10240 megabytes in 420.932318 seconds using 2560 reads
of 4194304 bytes (24.33 Mbytes/sec), 164357usec

(6)
# use memory-mapped i/o, more for interests-sake looking at the efficiency
of the virtual-mapping of the linux kernel.
# use a large blocksize as it was shown that a 8kbyte blocksize is about
40% of the performance with a 4m blocksize
[root@mel-stglab-host1 root]# ./test_disk_performance2 bs=4m blocks=2560
mode=mmap /dev/sd[e-l]
Completed reading 81920 mbytes across 8 devices using 20480 blocks of
4194304 in 1049.446053 seconds (78.06 Mbytes/sec), 401124usec mean
device #0 (/dev/sde) 10240 megabytes in 1045.814369 seconds using 2560
reads of 4194304 bytes (9.79 Mbytes/sec), 342912usec
device #1 (/dev/sdf) 10240 megabytes in 1034.783187 seconds using 2560
reads of 4194304 bytes (9.90 Mbytes/sec), 430320usec
device #2 (/dev/sdg) 10240 megabytes in 1046.222514 seconds using 2560
reads of 4194304 bytes (9.79 Mbytes/sec), 340732usec
device #3 (/dev/sdh) 10240 megabytes in 1049.445858 seconds using 2560
reads of 4194304 bytes (9.76 Mbytes/sec), 353895usec
device #4 (/dev/sdi) 10240 megabytes in 1047.926118 seconds using 2560
reads of 4194304 bytes (9.77 Mbytes/sec), 362176usec
device #5 (/dev/sdj) 10240 megabytes in 1047.454030 seconds using 2560
reads of 4194304 bytes (9.78 Mbytes/sec), 377308usec
device #6 (/dev/sdk) 10240 megabytes in 1043.274980 seconds using 2560
reads of 4194304 bytes (9.82 Mbytes/sec), 387720usec
device #7 (/dev/sdl) 10240 megabytes in 1048.326277 seconds using 2560
reads of 4194304 bytes (9.77 Mbytes/sec), 347532usec

>175MB/sec means you only read to kernel space and you don't make such
>information visible to userspace, so it is unbeatable no matter what :).
>Only in kernel service like sendfile and/or tux can take advantage of it by
>taking the radix tree lock and looking up the radix tree to find the
>physical page.
>
> > if we _could_ produce superior performance with something like:
> > (a) additional improvements to O_DIRECT (say, a mmap() hack for O_DIRECT)
> > whereby i/o can go direct-to-userspace or
> > userspace-can-mmap(readonly)-into-
> > kernel-space
>
>even if you get a direct physical view of some of the address space (as
>said in a earlier email on 32bit the lack of address space would reduce
>the place you can map to a little zone), you still need the information
>on the physical memory where the I/O gone, so that's still an API from
>kernel to user, so again an overhead compared to the nocopy hack

[ ignoring the fact that O_DIRECT and /dev/rawN are faster than 'nocopy'
for a moment ... :-) ]

around 18 months ago, i had it working.

basically, there was a character device-driver which allowed user-space to
mmap() kernel space.
a seperate 'zone' existed that was reserved exclusively for data to/from
disk and network.
this zone was sized to 75% of RAM -- and (on the hardware being used) with
2G RAM, meant that there was 1.5GB available in this space.

the kernel would indicate the physical address of a "skb" on network
sockets and each packet put on the skbuff receive_q for a socket marked for
'zerocopy' would result in a zerocopy_buf_t being queued to user-space.
user-space could poll to get a list of the current zerocopy_buf_t's. (this
list wasn't copied to userspace using copy_to_user; user-space picked this
list up via the mmap() into kernel space too)

userspace could control what to do with buffers by various ioctl() calls on
the character-device. things like marking "i'm not interested in that
buffer anymore" would result in atomic_dec_and_test() on the refcount and a
subsequent skb_free() if there wasn't any more interest in the
buffer. there were also ioctl()s for queueing the buffer on a different
socket.

yes -- rather radical -- but it did bypass the bottleneck of
memory-bandwidth becoming the bottleneck in an application which
predominantly did:
- read-from-disk, send on network socket
- read from network socket, write to disk
- read from network socket, write to alternate network socket, write to disk

given user-space could crash at any point, there was some housekeeping that
ensured that buffers tied to a process that had failed didn't stay around -
they were cleaned up.

there was also some rudimentary checks to ensure that user-space passing
buffer pointers to kernel-space was passing buffers that kernel-space
already knew about. ie. you couldn't crash the kernel by passing a bad
pointer.

i haven't resurrected this work since the VM underwent some radical changes
-- but it wouldn't be too hard to do. it'd still be a bit ugly in the
tie-ins to the tcp stack and buffer-cache but ....

i guess given O_DIRECT and /dev/rawN performance for to-disk/from-disk
being what it currently is (very good), there's less importance on having
it, if we could find a 'fast' way to queue to/from disk from/to network.
perhaps Ben's async i/o is the way to go for those apps.

>Long email sorry.

i was travelling and it gave me a chance to test out the 8 hour battery
life in my new laptop on a 14 hour flight. :-)

cheers,

lincoln.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/