Rick broke the non HT parts of D7 out into patches 2,3,4,5 for us.
This has been useful in all sorts of ways ...
http://marc.theaimsgroup.com/?l=lse-tech&m=104436251927067&w=2
(I'll attatch his patches here for convenience, they're small, and I don't
trust the archives not to munge them).
Below I've done two runs on each interesting combo (2 is a no-op).
Overall results are a slight improvement in kernbench and a degredation
in SDET. My impressions from the data below (16x NUMA-Q)
3 - gives a good boost to kernbench, marginal degreadation on SDET.
4 - doesn't do much.
5 - gives some degredation to kernebench, most of the degredation of SDET.
Conclusion: 3 = good (mostly). 5 = bad.
key:
                 2.5.59 - virgin
          2.5.59-sched3 - patch 2 & 3
        2.5.59-sched3-2 - patch 2 & 3
          2.5.59-sched4 - patch 2 & 3 & 4
        2.5.59-sched4-2 - patch 2 & 3 & 4
       2.5.59-ricksched - patch 2 & 3 & 4 & 5
      2.5.59-ricksched2 - patch 2 & 3 & 4 & 5
Oh, and the runs for 4-2 don't have any schedbench numbers because the
machine crashed, and I'm too lazy to rerun, as it wasn't showing much
anyway.
Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus)
                            Elapsed        User      System         CPU
                 2.5.59       46.08      563.88      118.38     1480.00
          2.5.59-sched3       46.12      567.52      115.82     1480.83
        2.5.59-sched3-2       46.37      567.81      116.48     1475.00
          2.5.59-sched4       46.00      566.28      116.30     1483.17
        2.5.59-sched4-2       46.30      567.29      117.43     1478.33
       2.5.59-ricksched       46.11      568.54      117.21     1486.67
      2.5.59-ricksched2       46.54      568.72      117.43     1473.50
Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus)
                            Elapsed        User      System         CPU
                 2.5.59       47.45      568.02      143.17     1498.17
          2.5.59-sched3       47.59      573.48      139.78     1498.33
        2.5.59-sched3-2       47.61      573.63      139.97     1498.17
          2.5.59-sched4       47.65      573.47      140.47     1497.67
        2.5.59-sched4-2       47.73      573.62      141.07     1496.83
       2.5.59-ricksched       47.16      571.72      140.02     1508.50
      2.5.59-ricksched2       47.19      571.60      138.00     1503.00
DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This 
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.
Results are shown as percentages of the first set displayed
SDET 1  (see disclaimer)
                         Throughput    Std. Dev
                 2.5.59       100.0%         0.8%
          2.5.59-sched3        98.2%         0.8%
        2.5.59-sched3-2        97.2%         2.1%
          2.5.59-sched4        95.3%         3.5%
        2.5.59-sched4-2        97.1%         3.6%
       2.5.59-ricksched        89.2%         4.3%
      2.5.59-ricksched2        89.4%         2.9%
SDET 2  (see disclaimer)
                         Throughput    Std. Dev
                 2.5.59       100.0%         0.6%
          2.5.59-sched3        93.3%         7.7%
        2.5.59-sched3-2        92.8%         8.5%
          2.5.59-sched4        91.0%        10.1%
        2.5.59-sched4-2        96.4%         7.4%
       2.5.59-ricksched        89.7%         5.6%
      2.5.59-ricksched2        84.5%         2.1%
SDET 4  (see disclaimer)
                         Throughput    Std. Dev
                 2.5.59       100.0%         5.7%
          2.5.59-sched3        96.9%         4.4%
        2.5.59-sched3-2        98.6%         5.2%
          2.5.59-sched4       102.5%         7.2%
        2.5.59-sched4-2        97.0%         1.6%
       2.5.59-ricksched        91.4%         3.5%
      2.5.59-ricksched2        92.6%         1.5%
SDET 8  (see disclaimer)
                         Throughput    Std. Dev
                 2.5.59       100.0%         1.4%
          2.5.59-sched3       103.7%         4.8%
        2.5.59-sched3-2       102.8%         4.2%
          2.5.59-sched4       104.5%         4.3%
        2.5.59-sched4-2       102.0%         2.9%
       2.5.59-ricksched       102.2%         4.6%
      2.5.59-ricksched2       102.5%         1.1%
SDET 16  (see disclaimer)
                         Throughput    Std. Dev
                 2.5.59       100.0%         1.7%
          2.5.59-sched3        98.9%         1.9%
        2.5.59-sched3-2       101.9%         1.8%
          2.5.59-sched4        99.1%         1.7%
        2.5.59-sched4-2        99.5%         0.9%
       2.5.59-ricksched        98.9%         1.1%
      2.5.59-ricksched2        78.7%        50.1%
SDET 32  (see disclaimer)
                         Throughput    Std. Dev
                 2.5.59       100.0%         1.5%
          2.5.59-sched3        98.9%         2.0%
        2.5.59-sched3-2        99.7%         0.8%
          2.5.59-sched4        97.7%         1.1%
        2.5.59-sched4-2        98.7%         1.4%
       2.5.59-ricksched        96.7%         1.6%
      2.5.59-ricksched2        89.7%        14.9%
SDET 64  (see disclaimer)
                         Throughput    Std. Dev
                 2.5.59       100.0%         0.7%
          2.5.59-sched3        98.8%         0.6%
        2.5.59-sched3-2        98.9%         0.4%
          2.5.59-sched4        97.8%         0.7%
        2.5.59-sched4-2        96.7%         0.9%
       2.5.59-ricksched        98.8%         0.7%
      2.5.59-ricksched2        97.7%         1.1%
SDET 128  (see disclaimer)
                         Throughput    Std. Dev
NUMA schedbench 4:
                            AvgUser     Elapsed   TotalUser    TotalSys
                 2.5.59        0.00       38.88       82.78        0.65
          2.5.59-sched3        0.00       30.13       93.10        1.02
        2.5.59-sched3-2        0.00       37.47      111.83        0.69
          2.5.59-sched4        0.00       30.92       90.07        0.90
        2.5.59-sched4-2        0.00       34.20       84.41        0.70
       2.5.59-ricksched        0.00       26.40       69.10        0.68
      2.5.59-ricksched2        0.00       60.15      170.80        0.98
NUMA schedbench 8:
                            AvgUser     Elapsed   TotalUser    TotalSys
                 2.5.59        0.00       49.30      247.80        1.93
          2.5.59-sched3        0.00       54.05      306.14        1.50
        2.5.59-sched3-2        0.00       39.71      241.97        2.19
          2.5.59-sched4        0.00       54.05      305.21        1.35
        2.5.59-sched4-2           0           0           0           0
       2.5.59-ricksched        0.00       47.32      261.69        1.59
      2.5.59-ricksched2        0.00       45.88      260.44        1.70
NUMA schedbench 16:
                            AvgUser     Elapsed   TotalUser    TotalSys
                 2.5.59        0.00       57.37      843.12        3.77
          2.5.59-sched3        0.00       57.96      849.88        2.81
        2.5.59-sched3-2        0.00       57.85      853.58        3.28
          2.5.59-sched4        0.00       57.63      847.41        2.78
        2.5.59-sched4-2           0           0           0           0
       2.5.59-ricksched        0.00       57.48      850.36        3.19
      2.5.59-ricksched2        0.00       57.29      848.65        4.21
NUMA schedbench 32:
                            AvgUser     Elapsed   TotalUser    TotalSys
                 2.5.59        0.00      116.99     1805.79        6.05
          2.5.59-sched3        0.00      120.15     1839.49        6.70
        2.5.59-sched3-2        0.00      126.20     1843.77        6.42
          2.5.59-sched4        0.00      119.90     1825.33        7.05
        2.5.59-sched4-2           0           0           0           0
       2.5.59-ricksched        0.00      122.43     1821.62        6.80
      2.5.59-ricksched2        0.00      125.44     1821.75        6.96
NUMA schedbench 64:
                            AvgUser     Elapsed   TotalUser    TotalSys
                 2.5.59        0.00      235.18     3632.73       15.45
          2.5.59-sched3        0.00      238.90     3714.23       17.74
        2.5.59-sched3-2        0.00      235.20     3699.45       18.75
          2.5.59-sched4        0.00      241.24     3689.19       15.65
        2.5.59-sched4-2           0           0           0           0
       2.5.59-ricksched        0.00      242.80     3708.15       16.97
      2.5.59-ricksched2        0.00      235.71     3693.12       16.50
--==========1812079384==========
Content-Type: text/plain; charset=iso-8859-1; name="02-struct_change.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="02-struct_change.txt"; size=2052
diff -ru linux-2.5.59/include/linux/sched.h =
sched-2.5.59-02/include/linux/sched.h
--- linux-2.5.59/include/linux/sched.h	Thu Jan 16 18:21:38 2003
+++ sched-2.5.59-02/include/linux/sched.h	Tue Feb  4 00:30:05 2003
@@ -293,7 +293,7 @@
 	prio_array_t *array;
=20
 	unsigned long sleep_avg;
-	unsigned long sleep_timestamp;
+	unsigned long last_run;
=20
 	unsigned long policy;
 	unsigned long cpus_allowed;
diff -ru linux-2.5.59/kernel/fork.c sched-2.5.59-02/kernel/fork.c
--- linux-2.5.59/kernel/fork.c	Thu Jan 16 18:21:38 2003
+++ sched-2.5.59-02/kernel/fork.c	Tue Feb  4 00:30:06 2003
@@ -876,7 +876,7 @@
 	 */
 	p->first_time_slice =3D 1;
 	current->time_slice >>=3D 1;
-	p->sleep_timestamp =3D jiffies;
+	p->last_run =3D jiffies;
 	if (!current->time_slice) {
 		/*
 	 	 * This case is rare, it happens when the parent has only
diff -ru linux-2.5.59/kernel/sched.c sched-2.5.59-02/kernel/sched.c
--- linux-2.5.59/kernel/sched.c	Thu Jan 16 18:22:29 2003
+++ sched-2.5.59-02/kernel/sched.c	Tue Feb  4 00:30:05 2003
@@ -324,14 +324,14 @@
  */
 static inline void activate_task(task_t *p, runqueue_t *rq)
 {
-	unsigned long sleep_time =3D jiffies - p->sleep_timestamp;
+	unsigned long sleep_time =3D jiffies - p->last_run;
 	prio_array_t *array =3D rq->active;
=20
 	if (!rt_task(p) && sleep_time) {
 		/*
 		 * This code gives a bonus to interactive tasks. We update
 		 * an 'average sleep time' value here, based on
-		 * sleep_timestamp. The more time a task spends sleeping,
+		 * ->last_run. The more time a task spends sleeping,
 		 * the higher the average gets - and the higher the priority
 		 * boost gets as well.
 		 */
@@ -975,7 +975,7 @@
 	 */
=20
 #define CAN_MIGRATE_TASK(p,rq,this_cpu)					\
-	((jiffies - (p)->sleep_timestamp > cache_decay_ticks) &&	\
+	((jiffies - (p)->last_run > cache_decay_ticks) &&	\
 		!task_running(rq, p) &&					\
 			((p)->cpus_allowed & (1UL << (this_cpu))))
=20
@@ -1160,7 +1160,7 @@
 	rq =3D this_rq();
=20
 	release_kernel_lock(prev);
-	prev->sleep_timestamp =3D jiffies;
+	prev->last_run =3D jiffies;
 	spin_lock_irq(&rq->lock);
=20
 	/*
--==========1812079384==========
Content-Type: text/plain; charset=us-ascii; name="03-tunables.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="03-tunables.txt"; size=1487
diff -ru sched-2.5.59-02/kernel/sched.c sched-2.5.59-03/kernel/sched.c
--- sched-2.5.59-02/kernel/sched.c	Tue Feb  4 00:30:05 2003
+++ sched-2.5.59-03/kernel/sched.c	Tue Feb  4 00:31:43 2003
@@ -54,20 +54,19 @@
 /*
  * These are the 'tuning knobs' of the scheduler:
  *
- * Minimum timeslice is 10 msecs, default timeslice is 150 msecs,
- * maximum timeslice is 300 msecs. Timeslices get refilled after
+ * Minimum timeslice is 10 msecs, default timeslice is 100 msecs,
+ * maximum timeslice is 200 msecs. Timeslices get refilled after
  * they expire.
  */
 #define MIN_TIMESLICE		( 10 * HZ / 1000)
-#define MAX_TIMESLICE		(300 * HZ / 1000)
-#define CHILD_PENALTY		95
+#define MAX_TIMESLICE		(200 * HZ / 1000)
+#define CHILD_PENALTY		50
 #define PARENT_PENALTY		100
 #define EXIT_WEIGHT		3
 #define PRIO_BONUS_RATIO	25
 #define INTERACTIVE_DELTA	2
-#define MAX_SLEEP_AVG		(2*HZ)
-#define STARVATION_LIMIT	(2*HZ)
-#define NODE_THRESHOLD          125
+#define MAX_SLEEP_AVG		(10*HZ)
+#define STARVATION_LIMIT	(30*HZ)
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -1035,9 +1034,9 @@
  * increasing number of running tasks:
  */
 #define EXPIRED_STARVING(rq) \
-		((rq)->expired_timestamp && \
+		(STARVATION_LIMIT && ((rq)->expired_timestamp && \
 		(jiffies - (rq)->expired_timestamp >= \
-			STARVATION_LIMIT * ((rq)->nr_running) + 1))
+			STARVATION_LIMIT * ((rq)->nr_running) + 1)))
 
 /*
  * This function gets called by the timer code, with HZ frequency.
--==========1812079384==========
Content-Type: text/plain; charset=iso-8859-1; name="04-oldmm.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="04-oldmm.txt"; size=1451
diff -ru sched-2.5.59-03/kernel/sched.c sched-2.5.59-04/kernel/sched.c
--- sched-2.5.59-03/kernel/sched.c	Tue Feb  4 00:31:43 2003
+++ sched-2.5.59-04/kernel/sched.c	Tue Feb  4 00:40:20 2003
@@ -151,6 +151,7 @@
 	unsigned long nr_running, nr_switches, expired_timestamp,
 			nr_uninterruptible;
 	task_t *curr, *idle;
+	struct mm_struct *prev_mm;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
 #ifdef CONFIG_NUMA
@@ -560,7 +561,7 @@
  * context_switch - switch to the new MM and the new
  * thread's register state.
  */
-static inline task_t * context_switch(task_t *prev, task_t *next)
+static inline task_t * context_switch(runqueue_t *rq, task_t *prev, task_t =
*next)
 {
 	struct mm_struct *mm =3D next->mm;
 	struct mm_struct *oldmm =3D prev->active_mm;
@@ -574,7 +575,7 @@
=20
 	if (unlikely(!prev->mm)) {
 		prev->active_mm =3D NULL;
-		mmdrop(oldmm);
+		rq->prev_mm =3D oldmm;
 	}
=20
 	/* Here we just switch the register state and the stack. */
@@ -1213,14 +1214,19 @@
 	RCU_qsctr(prev->thread_info->cpu)++;
=20
 	if (likely(prev !=3D next)) {
+		struct mm_struct *prev_mm;
 		rq->nr_switches++;
 		rq->curr =3D next;
 	
 		prepare_arch_switch(rq, next);
-		prev =3D context_switch(prev, next);
+		prev =3D context_switch(rq, prev, next);
 		barrier();
 		rq =3D this_rq();
+		prev_mm =3D rq->prev_mm;
+		rq->prev_mm =3D NULL;
 		finish_arch_switch(rq, prev);
+		if (prev_mm)
+			mmdrop(prev_mm);
 	} else
 		spin_unlock_irq(&rq->lock);
=20
--==========1812079384==========
Content-Type: text/plain; charset=iso-8859-1; name="05-activate_task.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="05-activate_task.txt"; size=3374
diff -ru sched-2.5.59-04/kernel/sched.c sched-2.5.59-05/kernel/sched.c
--- sched-2.5.59-04/kernel/sched.c	Tue Feb  4 00:40:20 2003
+++ sched-2.5.59-05/kernel/sched.c	Tue Feb  4 02:24:21 2003
@@ -67,6 +67,8 @@
 #define INTERACTIVE_DELTA	2
 #define MAX_SLEEP_AVG		(10*HZ)
 #define STARVATION_LIMIT	(30*HZ)
+#define SYNC_WAKEUPS		1
+#define SMART_WAKE_CHILD	1
=20
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -322,6 +324,13 @@
  * Also update all the scheduling statistics stuff. (sleep average
  * calculation, priority modifiers, etc.)
  */
+static inline void __activate_task(task_t *p, runqueue_t *rq)
+{
+	enqueue_task(p, rq->active);
+	nr_running_inc(rq);
+}
+ =20
+
 static inline void activate_task(task_t *p, runqueue_t *rq)
 {
 	unsigned long sleep_time =3D jiffies - p->last_run;
@@ -340,8 +349,7 @@
 			p->sleep_avg =3D MAX_SLEEP_AVG;
 		p->prio =3D effective_prio(p);
 	}
-	enqueue_task(p, array);
-	nr_running_inc(rq);
+	__activate_task(p, rq);
 }
=20
 /*
@@ -455,6 +463,7 @@
 	long old_state;
 	runqueue_t *rq;
=20
+	sync &=3D SYNC_WAKEUPS;
 repeat_lock_task:
 	rq =3D task_rq_lock(p, &flags);
 	old_state =3D p->state;
@@ -473,10 +482,13 @@
 		}
 		if (old_state =3D=3D TASK_UNINTERRUPTIBLE)
 			rq->nr_uninterruptible--;
-		activate_task(p, rq);
-
-		if (p->prio < rq->curr->prio)
-			resched_task(rq->curr);
+		if (sync)
+			__activate_task(p, rq);
+		else {
+			activate_task(p, rq);
+			if (p->prio < rq->curr->prio)
+				resched_task(rq->curr);
+		}
 		success =3D 1;
 	}
 	p->state =3D TASK_RUNNING;
@@ -512,8 +524,19 @@
 		p->prio =3D effective_prio(p);
 	}
 	set_task_cpu(p, smp_processor_id());
-	activate_task(p, rq);
=20
+	if (SMART_WAKE_CHILD) {
+		if (unlikely(!current->array))
+			__activate_task(p, rq);
+		else {
+			p->prio =3D current->prio;
+			list_add_tail(&p->run_list, ¤t->run_list);
+			p->array =3D current->array;
+			p->array->nr_active++;
+			nr_running_inc(rq);
+		}
+	} else
+		activate_task(p, rq);
 	rq_unlock(rq);
 }
=20
@@ -790,7 +813,23 @@
=20
 #endif /* CONFIG_NUMA */
=20
-#if CONFIG_SMP
+/*
+ * One of the idle_cpu_tick() and busy_cpu_tick() functions will
+ * get called every timer tick, on every CPU. Our balancing action
+ * frequency and balancing agressivity depends on whether the CPU is
+ * idle or not.
+ *
+ * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * systems with HZ=3D100, every 10 msecs.)
+ */
+#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+
+#if !CONFIG_SMP
+
+static inline void load_balance(runqueue_t *rq, int this_cpu, int idle) { =
}
+
+#else
=20
 /*
  * double_lock_balance - lock the busiest runqueue
@@ -972,6 +1011,9 @@
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
 	 * 3) are cache-hot on their current CPU.
+	 *
+	 * (except if we are in idle mode which is a more agressive
+	 * form of rebalancing.)
 	 */
=20
 #define CAN_MIGRATE_TASK(p,rq,this_cpu)					\
@@ -1665,7 +1707,7 @@
 	else
 		p->prio =3D p->static_prio;
 	if (array)
-		activate_task(p, task_rq(p));
+		__activate_task(p, task_rq(p));
=20
 out_unlock:
 	task_rq_unlock(rq, &flags);
@@ -2297,7 +2339,7 @@
 			set_task_cpu(p, cpu_dest);
 			if (p->array) {
 				deactivate_task(p, rq_src);
-				activate_task(p, rq_dest);
+				__activate_task(p, rq_dest);
 				if (p->prio < rq_dest->curr->prio)
 					resched_task(rq_dest->curr);
 			}
--==========1812079384==========--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/