Re: Kernel Bug at spinlock.h ?!

ChristopherHuhn (c.huhn@gsi.de)
Mon, 10 Mar 2003 09:52:04 +0100


Zwane Mwaikambo wrote:

>On Thu, 6 Mar 2003, ChristopherHuhn wrote:
>
>
>
>>Hi again,
>>
>>
>>
>>>It looks like a possible race with rpc_execute and possibly the timer,
>>>although i can't be certain where the other cpus are. Do the other oopses
>>>look somewhat similar? Could you supply them?
>>>
>>>
>>>
>>>
>>below are some oopses I gathered yesterday and today, all on different
>>machines.
>>I'd like to remark that we experience massive NFS problems at the moment
>>that seem to be caused by our mixed potato 2.2/ woody 2.4 environment,
>>i. e. linking apps on a woody system with the sources mounted via nfs
>>from a potato box leads to obscure IO failures like "no space left on
>>device" (This never happens with woddy only). So this might be a clue
>>here as well.
>>
>>The oopses are all written down from the screen, I hopefully made little
>>"transmission" errors.
>>
>>
>
>Some of these are a bit worrying seeing as they are bit flips, also they
>all appear to come from a UP machine(?) this would change things with
>respect to my previous comment about races. Regarding weird io failures
>are you mounting with the 'soft' option?
>
> Zwane
>
>
The machines all all DP Xeons, our SP machines run the same kernel, but
these oopses only occur on DP machines under heavy load.
The machines are recognized as SMP:
# uname -a
Linux lxb000 2.4.20 #2 SMP Tue Dec 17 10:43:29 CET 2002 i686 unknown

but the e7500 chipset seems not to be supported 100%:

Jan 27 15:26:34 lxb000 kernel: found SMP MP-table at 000f6710
Jan 27 15:26:34 lxb000 kernel: hm, page 000f6000 reserved twice.
Jan 27 15:26:34 lxb000 kernel: hm, page 000f7000 reserved twice.
Jan 27 15:26:34 lxb000 kernel: hm, page 0009f000 reserved twice.
Jan 27 15:26:34 lxb000 kernel: hm, page 000a0000 reserved twice.
Jan 27 15:26:34 lxb000 kernel: On node 0 totalpages: 262016
Jan 27 15:26:34 lxb000 kernel: zone(0): 4096 pages.
Jan 27 15:26:34 lxb000 kernel: zone(1): 225280 pages.
Jan 27 15:26:34 lxb000 kernel: zone(2): 32640 pages.
Jan 27 15:26:34 lxb000 kernel: ACPI: Searched entire block, no RSDP was
found.
Jan 27 15:26:34 lxb000 kernel: ACPI: Searched entire block, no RSDP was
found.
Jan 27 15:26:34 lxb000 kernel: ACPI: System description tables not found
Jan 27 15:26:34 lxb000 kernel: Intel MultiProcessor Specification v1.4
Jan 27 15:26:34 lxb000 kernel: Virtual Wire compatibility mode.
Jan 27 15:26:34 lxb000 kernel: OEM ID: Product ID: Kings Canyon APIC
at: 0xFEE00000
Jan 27 15:26:34 lxb000 kernel: Processor #0 Pentium 4(tm) XEON(tm) APIC
version 20
Jan 27 15:26:34 lxb000 kernel: Processor #6 Pentium 4(tm) XEON(tm) APIC
version 20
Jan 27 15:26:34 lxb000 kernel: Processor #1 Pentium 4(tm) XEON(tm) APIC
version 20
Jan 27 15:26:34 lxb000 kernel: Processor #7 Pentium 4(tm) XEON(tm) APIC
version 20
Jan 27 15:26:34 lxb000 kernel: I/O APIC #2 Version 32 at 0xFEC00000.
Jan 27 15:26:34 lxb000 kernel: I/O APIC #3 Version 32 at 0xFEC80000.
Jan 27 15:26:34 lxb000 kernel: I/O APIC #4 Version 32 at 0xFEC80400.
Jan 27 15:26:34 lxb000 kernel: I/O APIC #5 Version 32 at 0xFEC81000.
Jan 27 15:26:34 lxb000 kernel: I/O APIC #8 Version 32 at 0xFEC81400.
Jan 27 15:26:34 lxb000 kernel: Processors: 4
...

There might be (are) severe flaws in our NFS configuration and network
performance, but that should not crash the box, should it?

BTW: I just received a link to a bux incl. fix that sounds similar to
our problem: http://marc.theaimsgroup.com/?l=linux-nfs&m=104716581307294&w=2

With kind regards,

Christopher

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/