Re: crash dumps

Linus Torvalds (torvalds@transmeta.com)
8 Dec 1997 01:40:17 GMT


In article <199712071648.IAA11904@trill.cisco.com>,
<bwoodard@cisco.com> wrote:
>
>I've been taking a class in BSD kernel internals this semester and I
>am going to be taking a vendor class in sun crash dump analysis the
>upcoming week and one of the things that I learned in my BSD class is
>that most of the debugging that has gone on in the BSD world is due to
>crash dump analysis. I sort of assumed that the reason that linux was
>so solid was extensive crash dump analysis. However, when I went to
>the source, I was shocked to find there was no provision for it. So
>basically I have two questions:
>
>1) Is the lack of crash dumps just due to the fact that no one
>implemented it and you all have found ways to get along without it or
>was there some conscious decision made to avoid it?

I guess you could call it conscious. I prefer to consider it a
fundamental approach to debugging: I think that it's a _lot_ better to
look at the sources to figure out where the bugs are than to try to find
them ipso facto..

For example, when something happens that would give a crash dump on
other systems, Linux will spew out the offending register state and a
backtrace. Essentially, it tells the developer _where_ the error
happened - it leaves the _why_ to the developer to find out.

>2) If you don't have crash dumps, how do you debug the linux
>kernel? The only debugging facility that I have found for when a
>system gets in a really bad way is the sysrec function and that will
>only report back a small amount of information. It also doesn't seem
>as though you run up a dev kernel on one machine and then debug
>remotely becasuse the it seems that the gdbstub was last updated for
>2.0.30 and all of you have been working hard on the kernel since that
>point.

I personally dislike debuggers. Debuggers are good for one thing:
finding out where in the source a particular address exists.

I've seen too much code that has stupid tests for error conditions only
because the code was developed by people that used debuggers and noticed
that "oops, the pointer here is NULL, let's add a test against NULL to
make the crash go away". That's treating the symptoms rather than the
bug itself (sometimes the symptoms _are_ the bug, but that's by no means
always true).

Not having a debugger means that the person who fixes the bug usually
has to _understand_ the bug. And quite frankly, I'd much rather have
any bugs fixed by people who understand them than by somebody who just
tries to fix the symptoms. It does require more of the fixer, but on
the whole it tends to result in less ad-hoc fixes..

Linus