Re: [RFC] hangcheck-timer module
Brian Gerst (bgerst@didntduck.org)
Thu, 21 Nov 2002 15:31:04 -0500
Joel Becker wrote:
> Folks,
> 	Attached is a module, hangcheck-timer.  It is used to detect
> when the system goes out to lunch for a period of time, such as when a
> driver like qla2x00 udelays a bunch.
> 	The module sets a timer.  When the timer goes off, it then uses
> the TSC (warning: portability needed) to determine how much real time
> has passed.
> 	On a normal system, the real elapsed time will be almost
> identical to the expected timer duration.  However, if a device decided
> to udelay for 60 seconds (or some other circumstance), the module takes
> notice.  If the margin of error passes a threshold, the machine is
> rebooted.
> 	The module is currently used in a cluster environment.  After
> some time out to lunch, the rest of the cluster will have given up on a
> machine.  If the machine suddenly comes back and assumes it is still
> "live", bad things can happen.
> 	We can also see use for this in a debugging sense, for kernel
> hangs as well as driver code.  That's why I'm proposing it for general
> inclusion.
> 	Comments?  Thoughts?
> 
> Joel
> 
> Building:
> 	The module should happily build against most 2.4 kernels.  The
> usual module building compile line:
> 	gcc  -I /scratch/jlbec/kernel/linux-2.4.20-rc2/include \
> 		-DMODULE -D__KERNEL__ -DLINUX  -c -o hangcheck-timer.o \
> 		hangcheck-timer.c
> 
> Running:
> 	Load the module with insmod.  There are two options.
> "hangcheck_tick=<seconds>" specifies the timer timeout, and
> "hangcheck_margin=<seconds" specifies the margin of error.
> 
> Joel
> 
There is already an NMI watchdog that is better than what you propose, 
because it will also catch cases where something gets stuck with 
interrupts disabled.
--
				Brian Gerst
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/