I've been taking a class in BSD kernel internals this semester and I
am going to be taking a vendor class in sun crash dump analysis the
upcoming week and one of the things that I learned in my BSD class is
that most of the debugging that has gone on in the BSD world is due to
crash dump analysis. I sort of assumed that the reason that linux was
so solid was extensive crash dump analysis. However, when I went to
the source, I was shocked to find there was no provision for it. So
basically I have two questions:
1) Is the lack of crash dumps just due to the fact that no one
implemented it and you all have found ways to get along without it or
was there some conscious decision made to avoid it?
2) If you don't have crash dumps, how do you debug the linux
kernel? The only debugging facility that I have found for when a
system gets in a really bad way is the sysrec function and that will
only report back a small amount of information. It also doesn't seem
as though you run up a dev kernel on one machine and then debug
remotely becasuse the it seems that the gdbstub was last updated for
2.0.30 and all of you have been working hard on the kernel since that
point.
The reason that I ask is cisco has used HP vectra xm's as print
servers for more than a year now. Now we are deploying a whole bunch
of servers to our field offices and so we opted for a more bare bones
machine the HP vectra vl. The problem is that we can't keep the little
buggers running. After just a few hours of load they drop off of the
network. The screen is completly blank and there is no messages in
syslog, nothing anywhere we can find. We really don't know how to
troubleshoot the problem without something like a crash dump.
-ben