Re: Kexec, DMA, and SMP

Kenneth Sumrall (ken@mvista.com)
Mon, 10 Feb 2003 17:35:39 -0800


"Eric W. Biederman" wrote:
>
> Suparna Bhattacharya <suparna@in.ibm.com> writes:
>
> > On Sun, Feb 09, 2003 at 11:39:27AM -0700, Eric W. Biederman wrote:
> > > Corey Minyard <cminyard@mvista.com> writes:
> > >
> > > With respect to DMA and SMP handling for kexec on panic that case is
> > > much trickier. A lot of the normal methods simply don't apply because
> > > by definition in a panic something is broken, and that something may
> > > be the code we need to cleanly shutdown the hardware. But I an not
> > > ready to sacrifice a method that works well in a properly working
> > > kernel just because the panic case can't use it.
> > >
> > > In getting it working I suggest we start with the easy cases, where
> > > DMA and SMP are not big issues. And then we can have a working
> > > framework.
> >
> > I'd agree. That was also the idea behind the patch we'd just posted
> > for LKCD. With a basic working framework in hand that works for
> > simpler cases, we can now keep working on addressing more and harder
> > situations bit by bit.
>
> Agreed. I guess the primary question is can we trust the current
> device shutdown + reboot notifier path or do we need to make some
> large changes to avoid it.
>
So are the functions registered on the reboot notifier path guaranteed
to be non-blocking? In the kexec on panic case, calls that can block
would obviously be a bad thing. If they can block, perhaps we could add
a new flag SYS_PANIC or something like that to tell the driver to only
do a non-blocking shutdown of the chip.

> > Are you trying to address the possibility that DMA is overwriting
> > memory we are using in the recovery code, due to a runaway driver
> > or other code passing a wrong memory address to a device (e.g. in
> > a corrupted command area) ?
>
> Not primarily. Instead I am trying to address the possibility that
> DMA is overwriting the recovery code due to a device not being shutdown
> properly. Though it would happen to cover many cases of the wrong
> memory address being passed to a device.
>
The problem we were seeing was that rogue DMA from a network interface
chip was corrupting dentry's in the dirent cache when the rebooted
kernel was coming back up. This caused a whole new set of panics. :-(

Ken Sumrall
ken@mvista.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/