Re: Dazed and Confused

Greg Boyce (gboyce@rakis.net)
Fri, 6 Dec 2002 11:47:35 -0500 (EST)


On 6 Dec 2002, Alan Cox wrote:

> On Fri, 2002-12-06 at 14:55, Greg Boyce wrote:
> > I work in a company with a large number of Linux machine deployed all
> > around the country, and in some of the machines we've been seeing the
> > following error:
> >
> > Uhhuh. NMI received. Dazed and confused, but trying to continue
> > You probably have a hardware problem with your RAM chips
>
> There are several causes of an NMI depending on the system - hardware
> failures is one, some systems do it for things like PCI errors, a few
> boxes you see them on power management events (notably old 486's)
>
> > Due to the number of machines and their locations, running memtest86 on
> > them isn't exactly feasible.
>
> Then buy better ram ;)

We have a large number of a very small number of machine types. The
OS images installed are identical, and the bioses should be identical
between each individual machine types.

Since the number of machines reporting this error are pretty small, I
think it's unlikely to be power management, or anything like that.

> > Is there anything besides failing hardware that could be the cause of this
> > error? Also, how serious is this error? Some of the machines reporting
> > this error have had problems with programs crashing, while others seem to
> > run fine.
>
> Take a sample set of machines which have been crashing and run memtest86
> on a couple. That should tell you if it is RAM. From a sample you can
> then figure out how to handle the rest (things that come to mind if
> memtest86 fails on the test machines include replacing the ram in a few
> more then taking the old ram back to test)

I'll mention it to the people who handle the replacement of hardware, but
from the sounds of this and Dick's e-mail, it's most likely hardware of
some sort or possibly overheating. They can decide if they want to try to
figure out which component is causing the problem, or if they'd prefer to
just replace the faulty machines completely and worry about tracking the
component later. We have plenty of spares in the warehouse.

Thanks for the help,

Greg

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/