Re: 2.4.7p6 hang

Mike Kravetz (mkravetz@sequent.com)
Wed, 11 Jul 2001 10:19:01 -0700

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Chris Mason: "Re: 2x Oracle slowdown from 2.2.16 to 2.4.4"
Previous message: Ho Chak Hung: "Re: __alloc_pages 4 order allocation failed"

On Wed, Jul 11, 2001 at 05:58:09PM +0200, Andrea Arcangeli wrote:
>
> this one I forgot to sumbit but here it is now for easy merging:
>
> --- 2.4.4aa3/kernel/sched.c.~1~ Sun Apr 29 17:37:05 2001
> +++ 2.4.4aa3/kernel/sched.c Tue May 1 16:39:42 2001
> @@ -674,8 +674,10 @@
> #endif
> spin_unlock_irq(&runqueue_lock);
>
> - if (prev == next)
> + if (prev == next) {
> + current->policy &= ~SCHED_YIELD;
> goto same_process;
> + }
>
> #ifdef CONFIG_SMP
> /*

I would like to second the need for this patch in the 'mainline' kernel.
Not too long ago, I came up with the following senario caused by this
bug. The scenario is based on the unmodified 2.4.4 scheduler.

- Task A calls sched_yield(), and the code in sys_sched_yield()
determines that a yield is in order and sets SCHED_YIELD in
the task's policy field and need_resched is set for this task.

- When Task A attempts to return to user land, schedule() will
be called (since need_resched was set). However, in this case
schedule() does not find a better task than A to run. Since
task A will continue to run, the 'same_process' goto is taken
in schedule(). Note that __schedule_tail() is not called, so
the SCHED_YIELD flag remains set in A when it continues to
execute.

- Task A then performs some operation which causes it to go into
a non-runnable state (such as calling nanosleep()). After setting
the state of Task A to something other than TASK_RUNNING, a call
to schedule() will be made. At this time Task A will be removed
from the runqueue (again note that SCHED_YIELD remains set in A).
Also, assume that there are no other runnable tasks so the idle
task is chosen to run next on this CPU.

- Now, after schedule() releases the runqueue lock the timer for
Task A fires and we call the wake_up code. This code path will
eventually call try_to_wake_up() which will set the state of A
to TASK_RUNNING, add A to the runqueue and call reschedule_idle()
for A.

- Note that we have not yet cleared the has_cpu field in A. Hence,
can_schedule() will never be true for task A. As a result, we
will not send an IPI to any other CPU. In effect, reschedule_idle()
is a noop.

- Now, we finally call __schedule_tail() for task A. After clearing
the SCHED_YIELD and has_cpu flags, we notice that the state of A
is TASK_RUNNING (it was set by try_to_wake_up()) and take the
needs_resched goto.

- The needs_resched block of code usually results in a call to
reschedule_idle for the task. However, the first line of code
in this block is:

/*
* Avoid taking the runqueue lock in cases where
* no preemption-check is necessery:
*/
if ((prev == idle_task(smp_processor_id())) ||
(policy & SCHED_YIELD))
goto out_unlock;

Since, the SCHED_YIELD flag was set in A when we entered this routine
we will not call reschedule_idle().

In this case, the CPU associated with task A is still idle yet we will
not schedule the task on the CPU. In addition, it is possible that at
this time ALL CPUs in the system could be idle. Hence, we would end up
with all CPUs idle while task A is on the runqueue. Not good!

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Next message: Chris Mason: "Re: 2x Oracle slowdown from 2.2.16 to 2.4.4"
Previous message: Ho Chak Hung: "Re: __alloc_pages 4 order allocation failed"