You want _that_ story, eh? 8^)
* * * * *
Yeah we had ESR problems on the original NUMA-Q boxes with P6 CPUs. On system
shutdown, CPU 0 on one or more secondary nodes would occasionally spasm with
an infinite stream of APIC error interrupts claiming invalid message. A
couple hardware guys and I spent a lot of time looking at the APIC bus with
special APIC bus analyzers, etc. We _never_ caught a malformed message on
the APIC bus.
Once a CPU started weirding out like this, it was impossible to make it shut
up. We could clear the error status, and it would show cleared in the ESR,
but the local APIC would reissue the same error interrupt as soon as we
returned from the error handler.
In fact, with kernel printf turned off we would get about a million of them
per second, faster than most APIC messages could be sent over the APIC bus.
(This was a 16.6667 MHz two bit wide bus. Messages were about 10 to 40
frames long.)
Thus, I concluded that it was some weird error state in the local APIC. We
never got any answer back from Intel on how to clear this state, let alone
admission that it existed, so we just turned off the APIC error IRQ. Since
we were shutting down the system anyway, this seemed an adequate kludge.
Writing 0 to the ESR four times was done out of paranoia, and a desire to
grind the clear deeper into the local APIC's state machine. I have no
evidence that it ever really fixed this bug. Nothing did.
Maybe this weirdness was fixed in P2s or later CPUs. Maybe. Intel never did
say anything about it to us. Regardless, the four writes to ESR is still
enshrined in Dynix/PTX's APIC error handler, and will remain a hidden
testimony to this bug for as long as IBM maintains PTX support.
-- James Cleverdon IBM xSeries Linux Solutions {jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot com- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/