RAID1 under 2.0.32

David Mansfield (david@cobite.com)
Tue, 2 Dec 1997 12:01:55 -0500 (EST)


Hello, I've been trying to decide whether to set up a production web
server/dial-in server using the RAID 1 mirroring. I've set it up and it
seems to work OK but I've gotten a couple of oopses and a lot of
interesting syslog kernel messages. I am close to suspecting bad memory,
although the memtest (run about 10 times) doesn't show anything. The
system looks like this:

Pentium 150.
64 MB ram.
No ide drives
Adaptec 2940UW with 2x Quantum HD
(Note: I was running with twin adapters for a while and trimmed to
one, which didn't eliminate the problems)
Kernel 2.0.32 with raid145-0.36.3-2.0.30.gz patch.
(Note: although the patch is for 2.0.30 it applied cleanly...)
raidtools-0.41

The rest is a stock RedHat 4.2 distribution.

My tests are the following:
--- test 1 ---
cd /usr/src/linux
while true; do make dep; make clean; make zImage; make modules; done
--- test 2 ---
while true; do cp -a /usr/src/linux /tmp/test; rm -r /tmp/test; done
--- test 3 ---
short c program that mallocs 10 mb and writes a random value to a random
spot in this buffer (keeps all 10mb swapping)
----

I ran test1 + test2 + (7 x test3) to stress test the system. Note, since
the system has only 64 MB this puts me about 20MB into swap.
~
Here are the results lots of these (for some reason my syslog has
disappeared, but there are a number of these errors, at least 40 over the
period of 12 hours)

kernel: Internal error: bad swap device
kernel: rw_swap_page: weirdness
kernel: swap_free: weirdness
kernel: Trying to free non-existant swap page
kernel: Trying to swap to non swap device

One of:
Dec 2 10:40:08 tempiws kernel: Unable to handle kernel paging request at
virtual address 081c8000
Dec 2 10:40:08 tempiws kernel: current->tss.cr3 = 039a9000, 8r3 =
039a9000
Dec 2 10:40:08 tempiws kernel: *pde = 00bbd067
Dec 2 10:40:08 tempiws kernel: *pte = 68747561

And three oops (first two copied by hand):
CPU: 0
EIP: 0010 [<00123d7a>]
EFLAGS: 00010246
eax: 00001800 ebx: 52565253 ecx: 0381944c edx: 00000c00
esi: 00000bc1 edi: 00000000 ebp: bffffe60 esp: 03993f84
ds:0018 es:0018 fs:002b gs:0026 ss:0018
Process update (pid: 305, process nr:27, stackpage:03993000)
Stack 0031b810 00000000 00000000 00126c94 00000000 00000000 0031b810
00000000
00000000 0031b810 00126df1 0031b810 00000001 0010a86d 00000001
00000000
00000000 00000001 00000000 bffffe60 ffffffda 0000002b 0000002b
0000002b
Call Trace: [<00126c94>] [<00126df1>] [<0010a86d>]
general protection: 0000

and ksymoops says:
Using `/usr/src/linux/System.map' to map addresses to symbols.

>>EIP: 123d7a <sync_inodes+1e/58>
Trace: 126c94 <sync_old_buffers+14/13c>
Trace: 126df1 <sys_bdflush+35/98>
Trace: 10a86d <system_call+55/7c>

Second oops:
CPU 0
EIP: 0010: [<0011ac2b>]
EFLAGS: 00010246
eax: 00000000 ebx: 00fd2bfc ecx: 00000400 edx: 02001000
esi: 00fae660 edi: ds001000 ebp: 0009ad98 esp: 00006f58
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process init (pid 1; process nr: 1; stackpage=00006000)
Stack: 0011aa5x bfd98f68 00099414 00099414 fffffff3 00105025 0010a4f3
000998b4
00105025 00105025 00111624 00099414 0009ad98 bfd98000 00000001 00111508
00000002 0804bccc bfd9906c 00099414 0377f618 bfd99720 0010a9d0 00006fbc
Call Trace: [<0011aa5c>] [<0010a4f3>] [<00111624>] [<00111508>]
[<0010a9d0>]
Code: f3 ab 0b 55 0c 89 54 24 18 89 54 24 1c 8b 44 24 18 0c 40 89

and ksymoops says:
Using `/usr/src/linux/System.map' to map addresses to symbols.

>>EIP: 11ac2b <do_no_page+1cf/328>
Trace: 11ac2b <do_no_page+1cf/328>
Trace: 10a4f3 <handle_signal+5b/90>
Trace: 111624 <do_page_fault+11c/310>
Trace: 111624 <do_page_fault+11c/310>
Trace: 10a9d0 <error_code+40/48>

Code: 11ac2b <do_no_page+1cf/328> repz stosl %eax,%es:(%edi)
Code: 11ac2d <do_no_page+1d1/328> orl 0xc(%ebp),%edx
Code: 11ac30 <do_no_page+1d4/328> movl %edx,0x18(%esp,1)
Code: 11ac34 <do_no_page+1d8/328> movl %edx,0x1c(%esp,1)
Code: 11ac38 <do_no_page+1dc/328> movl 0x18(%esp,1),%eax
Code: 11ac3c <do_no_page+1e0/328> orb $0x40,%al
Code: 11ac3e <do_no_page+1e2/328> movl %eax,(%eax)
Code: 11ac40 <do_no_page+1e4/328> nop
Code: 11ac41 <do_no_page+1e5/328> nop
Code: 11ac42 <do_no_page+1e6/328> nop

Third oops (this one got logged so the symbols are already here):
Oops: 0009
CPU: 0
EIP: 0010:[ext2_file_write+585/1116]
EFLAGS: 00010216
eax: 028c8598 ebx: 00000400 ecx: 00000100 edx: 034f2400
esi: 081c8000 edi: 034f2400 ebp: 00000400 esp: 038efc04
ds: 0018 es: 0018 fs: 002b gs: 002b ss: 0018
Process cc1 (pid: 4190, process nr: 33, stackpage=038ef000)
Stack: 000c0000 0814b000 0814b000 0010b000 00000000 00000000 001d8dfc
0007d000
00000000 00000210 00084000 00000000 028c8598 00000000 03bf8a00
038efc90
00eab500 00008180 03bf8a00 00bd9798 038efc90 00eab500 00125e1a
00bd9798
Call Trace: [do_coprocessor_segment_overrun+4/60] [__brelse+34/68]
[ext2_create+
341/360] [dump_write+28/44] [writenote+167/200] [dump_write+28/44]
[elf_core_dum
p+2488/2640]
[do_no_page+620/808] [timer_bh+193/820] [do_signal+495/632]
[signal_retur
n+18/56]
Code: 64 f3 a5 83 e3 03 89 d9 64 f3 a4 55 8b 54 24 34 8b 52 24 03

Does anyone have an opinion on whether RAID 1 is ready to play with the
big boyz? Should I tuck this one away and try again in 6 months? Does
it look like processor/memory weirdness? Other experiences and or
comments welcome.

David Mansfield
david@cobite.com