Re: Scheduler Bug (set_cpus_allowed)

Ingo Molnar (mingo@elte.hu)
Mon, 10 Jun 2002 22:02:28 +0200 (CEST)


Mike,

On Mon, 10 Jun 2002, Mike Kravetz wrote:

> However, even with this code removed there are still some races in the
> scheduler. Specifically, load_balance() can race with schedule() in the
> code around the call to context switch(). [...]

the fundamental problem is the dropping of the runqueue lock before doing
the context_switch(), which leaves the scheduler data structures in an
inconsistent state. Reintroducing rq->prev works, we had something like
that in 2.4, but i'd like to avoid such complexity as much as possible.

the fundamental problem is that the current rq->frozen code is broken,
unfortunately. It simply does not work, and in fact it never worked since
it was introduced in 2.5.7. Nothing synchronizes on rq->frozen, only the
rq-local CPU locks/unlocks it, so it has *zero* functional effect, it's an
expensive no-op.

I've removed it from the scheduler in the attached patch (against 2.5.21),
which in turn should also fix the race(s) Mike noticed.

the rq->frozen code was added for the sake of the Sparc port, which has a
legitim reason: it wants to do some more complex things (like calling into
the scheduler ...) in context_switch(). But i really dislike the effects
of this: it makes the context switch non-atomic, which triggers a set of
races and complexities.

David, would it be possible to somehow not recurse back into the scheduler
code (like wakeup) from within the port-specific switch_to() code? If
something like that absolutely has to be done then a better solution would
be to eg. introduce an arch-optional post-context-switch callback of
sorts, which would happen with the runqueue lock dropped. The overhead of
this solution can be controlled and it has no impact on the design of the
scheduler. No coding is needed on your side, i'll code it all up if it
solves your problems.

Ingo

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.493 -> 1.494
# kernel/sched.c 1.83 -> 1.84
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/06/10 mingo@elte.hu 1.494
# - get rid of rq->frozen, fix context switch races.
# --------------------------------------------
#
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Mon Jun 10 22:01:41 2002
+++ b/kernel/sched.c Mon Jun 10 22:01:41 2002
@@ -135,7 +135,6 @@
*/
struct runqueue {
spinlock_t lock;
- spinlock_t frozen;
unsigned long nr_running, nr_switches, expired_timestamp;
signed long nr_uninterruptible;
task_t *curr, *idle;
@@ -403,7 +402,7 @@
#if CONFIG_SMP || CONFIG_PREEMPT
asmlinkage void schedule_tail(void)
{
- spin_unlock_irq(&this_rq()->frozen);
+ spin_unlock_irq(&this_rq()->lock);
}
#endif

@@ -828,9 +827,6 @@
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
- spin_lock(&rq->frozen);
- spin_unlock(&rq->lock);
-
context_switch(prev, next);

/*
@@ -840,10 +836,8 @@
*/
mb();
rq = this_rq();
- spin_unlock_irq(&rq->frozen);
- } else {
- spin_unlock_irq(&rq->lock);
}
+ spin_unlock_irq(&rq->lock);

reacquire_kernel_lock(current);
preempt_enable_no_resched();
@@ -1599,7 +1593,6 @@
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
- spin_lock_init(&rq->frozen);
INIT_LIST_HEAD(&rq->migration_queue);

for (j = 0; j < 2; j++) {

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/