tlb flush timeout on SMP alphas

Mike (mibarnes@tni.net)
Tue, 10 Dec 2002 10:06:42 -0500


I am trying to figure out if the timeouts I am getting are software or
hardware related. We've got 2 way alphas with 4 gig of ram and the only
other different hardware in them is a Myrinet card with the GM driver.
And I can frequently (not 100% reliably) get tlb flush timeouts while
the system has a load of around 2 and one of the running processes is
using the mryinet card.

When this message occurs, one of the 2 CPUs is "dead", meaning that any
processes running on it are copletly stalled and the rtc for the "dead"
processor in /proc/interupts does not increase.

An interesting datapoint is that the air conditioner failed one weekend
and some machines died because of the heat and these machines failed
with the same tlb flush timeouts.

Currently we are running 2.4.9 (redhat modified) kernel, and have tried
others up to and including 2.4.18 with the same results.

Any insight would be greatly appreciated.

I am not a subscriber on this list, so CCing me would be greatly
appreciated.

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/