Re: the oom killer

Andrea Arcangeli (andrea@suse.de)
Fri, 5 Apr 2002 16:43:48 +0200


On Fri, Apr 05, 2002 at 01:18:26AM -0800, Andrew Morton wrote:
>
> Andrea,
>
> Marcelo would prefer that the VM retain the oom killer. The thinking
> is that if try_to_free_pages fails, then we're better off making a
> deliberate selection of the process to kill rather than the random(ish)
> selection which we make by failing the allocation.
>
> One example is at
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=101405688319160&w=2
>
> That failure was with vm-24, which I think had the less aggressive

vm-24 had a problem yes, that is fixed in the latest releases.

> i/dcache shrink code. We do need to robustly handle the no-swap-left
> situation.
>
> So I have resurrected the oom killer. The patch is below.
>
> During testing of this, a problem cropped up. The machine has 64 megs
> of memory, no swap. The workload consisted of running `make -j0
> bzImage' in parallel with `usemem 40'. usemem will malloc a 40
> megabyte chunk, memset it and exit.
>
> The kernel livelocked. What appeared to be happening was that ZONE_DMA
> was short on free pages, but ZONE_NORMAL was not. So this check:
>
> if (!check_classzone_need_balance(classzone))
> break;
>
> in try_to_free_pages() was seeing that ZONE_NORMAL had some headroom
> and was causing a return to __alloc_pages().
>
> __alloc_pages has this logic:
>
> min = 1UL << order;
> for (;;) {
> zone_t *z = *(zone++);
> if (!z)
> break;
>
> min += z->pages_min;
> if (z->free_pages > min) {
> page = rmqueue(z, order);
> if (page)
> return page;
> }
> }
>
>
> On the first pass through this loop, `min' gets the value
> zone_dma.pages_min + 1. On the second pass through the loop it gets
> the value zone_dma.pages_min + 1 + zone_normal.pages_min. And this is
> greater than zone_normal.free_pages! So alloc_pages() gets stuck in an
> infinite loop.

This is a bug I fixed in the -rest patch, that's also broken on numa.
The deadlock cannot happen if you apply all my patches.

As for your patch it reintroduces a deadlock by looping in GFP relying
on the oom killer (that will also go and kill the
bigger task most of the time), the oom killer can select a task in D
state, or it can a sigterm, and secondly you broke google DB (the right
fix for that min thing are the point-of-view watermarks in the -rest
patch in my collection). the worst thing is that with the oom killer
we've to keep looping, so if the task is for whatever reason hung in R
state in kernel the machine will deadlock, while current way it will
make progress either in the do_exit, or in the -ENOMEM fail path (modulo
getblk that's not too bad anyways). the current memory balancing is now
been good enough to kill in function of probability, so I didn't feel
the need of risking (at the very least theorical) deadlocks there, this
is why I left it disabled.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/