[PATCH] get_pid() performance fix

Hubertus Franke (frankeh@watson.ibm.com)
Fri, 22 Mar 2002 17:14:03 -0500

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Marcelo Tosatti: "Re: [patch] Don't offer CONFIG_INDYDOG on non-ip22 machines"
Previous message: Steffen Persvold: "Interrupts lost on Intel Plumas chipset"
Next in thread: Davide Libenzi: "Re: [PATCH] get_pid() performance fix"
Reply: Davide Libenzi: "Re: [PATCH] get_pid() performance fix"

I implemented an alternative version of getpid, that for large thread counts
( > 220000), provides "significantly" better performance as shown in attached
performance analysis. This is particulary viable for PID_MAX=32768.
Note under the current algorithm, allocating the last pid will take 32
seconds (!!!) on a 500MHZ P-III.
I addressed the correctnes issue that Hirofumi brought up with the last
submission (now test case "-c 4")
Attached patch is against 2.5.7.

-- Hubertus Franke <frankeh@watson.ibm.com>

------------------------------------------------------------------------------

Currently the getpid algorithm works as follows:
At any given time an interval of [ last_pid .. next_safe ) is known to hold
unused pids. Initially the interval is set to [0 .. 32767]

Hence to allocate a new pid the following is sufficient:

if (++last_pid < next_safe) return last_pid;

However, if we move out of the safe interval, the next safe interval needs
to be established first.
This is currently done by a repetive search

repeat:
foralltasks(p) {
if (p uses lastpid) { last_pid++; goto repeat; }
/* narrow [ last_pid .. next_safe ) */
if (p->pids in [ last_pid .. next_safe ) ) next_safe = p->pid
}

Particulary for large number of tasks, this can lead to frequent exercise of
the repeat resulting in a O(N^2) algorithm. We call this : <algo-0>.

Instead I have provided an alternative mechanism that at the time
of determining the next interval marks a bitmask by walking the tasklist
once [ O(N) ] and then finding the proper bit offsets to mark the next free
interval starting from last_pid. The bitmap requires 4096 bytes.
This is <algo-1>.

An optimization to this to keep the last bitmap instead of clearing it
with every search. Only if we fail to obtain a free range, then we have
to go back and clear the bitmap and redo the search one more time.
This is <algo-2>. Here I omit the results, as they have only shown
improvements for the very last few pids.

I dragged the various algorithms into a userlevel test program to figure
out where the cut off points are with PID_MAX=32768. In this testprogram
I maintain <TH> tasks, and for 10 rounds (delete <DEL> random tasks and
reallocate <DEL> tasks again) resulting in T=10*D total measured allocations.

Si states how many interval searches where needed for algo-i.
Gi states the average overhead per get_pid for algo-i in usecs.

Based on that one should use the current algorithm until ~ 22-25K tasks and
beyond that use algo-2. Only the last 15 tasks are a bit faster under algo-1.
We can safely ignore that case.

Based on that providing an adaptive implementation seems the right choice.
The patch for 2.5.7 is attached.

executed program example: getpid -c 2 -s 10 -e 100 -D 10 -t <0,1>
0 is old 1 is new algo 2.

TH: average thread count in system
DEL: number of randomly deleted threads and then reallocated
TAL: total number of getpid invocation
C-NEW: number of times search was invoked
T-NEW: per getpid() overhead
C-OLD: number of times search was invoked
T-OLD: per getpid() overhead

TH DEL TAL | C-NEW T-NEW | C-OLD T-OLD
10 10 80 | 1 0.79 | 1 0.41
20 10 100 | 1 0.56 | 1 0.36
30 10 100 | 1 0.56 | 1 0.38
40 10 100 | 1 0.58 | 1 0.42
50 10 100 | 1 0.59 | 1 0.43
60 10 100 | 2 0.84 | 2 0.52
70 10 100 | 1 0.72 | 1 0.43
80 10 100 | 1 0.72 | 1 0.48
100 50 500 | 2 0.38 | 2 0.27
150 50 500 | 4 0.56 | 3 0.32
200 50 500 | 5 0.79 | 4 0.40
250 50 500 | 6 1.03 | 2 0.35
300 50 500 | 9 1.68 | 2 0.38
350 50 500 | 6 1.47 | 4 0.58
400 50 500 | 10 2.54 | 8 1.05
450 50 500 | 7 2.10 | 7 1.02
500 50 500 | 7 2.32 | 4 0.75
550 50 500 | 9 3.13 | 5 0.95
600 50 500 | 14 5.14 | 6 1.18
650 50 500 | 13 5.23 | 9 1.79
700 50 500 | 10 4.35 | 9 1.87
750 50 500 | 15 6.91 | 8 2.02
800 50 500 | 12 5.98 | 8 1.95
850 50 500 | 13 6.85 | 9 2.29
900 50 500 | 27 14.55 | 18 4.46
1000 1000 9980 | 1 0.21 | 1 0.21
2000 1000 10000 | 322 14.36 | 76 2.03
3000 1000 10000 | 606 46.10 | 161 6.63
4000 1000 10000 | 901 97.58 | 359 18.87
5000 1000 10000 | 1165 164.19 | 539 38.75
6000 1000 10000 | 1498 266.58 | 724 66.96
7000 1000 10000 | 1835 396.91 | 1026 122.35
8000 1000 10000 | 2084 531.83 | 1228 179.70
9000 1000 10000 | 2489 746.99 | 1516 264.16
10000 1000 10000 | 2763 946.99 | 1818 375.22
11000 1000 10000 | 3095 1199.73 | 2093 502.47
12000 1000 10000 | 3277 1422.02 | 2305 648.17
13000 1000 10000 | 3711 1776.28 | 2680 854.89
14000 1000 10000 | 3878 2033.30 | 2959 1061.61
15000 1000 10000 | 4301 2452.35 | 3167 1266.78
16000 1000 10000 | 4485 2757.00 | 3466 1554.22
17000 1000 10000 | 4844 3210.19 | 3743 1857.74
18000 1000 10000 | 5218 3681.90 | 4069 2247.13
19000 1000 10000 | 5501 4118.69 | 4293 2605.90
20000 1000 10000 | 5770 4594.99 | 4616 3095.39
21000 1000 10000 | 6161 5172.44 | 4974 3686.87
22000 1000 10000 | 6389 5637.80 | 5177 4258.81
23000 1000 10000 | 6665 6191.73 | 5483 4949.61
24000 1000 10000 | 6982 6777.02 | 5838 5831.25
25000 1000 10000 | 7209 7318.15 | 6183 6905.34
----------------> Break even range <------------------------
26000 1000 10000 | 7533 7954.33 | 6413 7955.66
27000 1000 10000 | 7959 8749.97 | 6748 9444.79
28000 1000 10000 | 8140 9302.36 | 7139 11617.75
29000 1000 10000 | 8501 10085.77 | 7638 14960.53
30000 1000 10000 | 8911 10946.80 | 8178 19584.24
32000 50 500 | 487 12265.89 | 486 94498.36
32050 50 500 | 486 12314.17 | 486 95832.03
32100 50 500 | 486 12389.38 | 488 108895.28
32150 50 500 | 492 12599.58 | 484 111742.39
32200 50 500 | 497 12792.36 | 491 124440.45
32250 50 500 | 490 12659.33 | 490 134499.09
32300 50 500 | 495 12843.78 | 489 145873.72
32350 50 500 | 495 12915.18 | 493 158940.66
32400 50 500 | 496 12988.64 | 494 183092.45
32450 50 500 | 492 12924.86 | 490 196135.28
32500 50 500 | 495 13043.85 | 488 223226.98
32550 50 500 | 498 13164.09 | 495 265431.10
32600 50 500 | 498 13222.56 | 495 326363.36
32650 50 500 | 498 13279.90 | 497 441002.09
32700 50 500 | 499 13368.66 | 499 656269.52
32751 1 10 | 10 12836.40 | 10 4031660.40
32752 1 10 | 10 12831.70 | 10 4620214.70
32753 1 10 | 10 12832.60 | 10 4188605.70
32754 1 10 | 10 12837.60 | 10 2972975.40
32755 1 10 | 10 12835.90 | 10 4434635.70
32756 1 10 | 10 12833.80 | 10 3892058.30
32757 1 10 | 10 12839.10 | 10 5002365.30
32758 1 10 | 10 12840.20 | 10 6332786.20
32759 1 10 | 10 12840.20 | 10 5269462.90
32760 1 10 | 10 14127.80 | 10 8234780.40
32761 1 10 | 10 14134.90 | 10 7558043.20
32762 1 10 | 10 14129.70 | 10 9117119.40
32763 1 10 | 10 14134.70 | 10 13895498.10
32764 1 10 | 10 14140.10 | 10 13608972.90
32765 1 10 | 10 15427.30 | 10 12930469.20
32766 1 10 | 10 16708.30 | 10 23576610.90
32767 1 10 | 10 17997.90 | 10 32603396.10

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

diff -urbN linux-2.5.7/kernel/fork.c linux-2.5.7-pid/kernel/fork.c
--- linux-2.5.7/kernel/fork.c	Mon Mar 18 15:37:05 2002
+++ linux-2.5.7-pid/kernel/fork.c	Fri Mar 22 16:38:29 2002
@@ -125,24 +125,111 @@
 /* Protects next_safe and last_pid. */
 spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
 
+/* this should be provided in every architecture */
+#ifndef SHIFT_PER_LONG
+#if BITS_PER_LONG == 64
+#   define SHIFT_PER_LONG 6
+#elif BITS_PER_LONG == 32
+#   define SHIFT_PER_LONG 5
+#else
+#error "SHIFT_PER_LONG"
+#endif
+#endif
+
+#define RESERVED_PIDS       (300)
+#define GETPID_THRESHOLD    (26000)  /* when to switch to a mapped algo */
+#define PID_MAP_SIZE        (PID_MAX >> SHIFT_PER_LONG)
+static unsigned long pid_map[PID_MAP_SIZE];
+static int next_safe = PID_MAX;
+
+static inline void mark_pid(int pid)
+{
+	__set_bit(pid,pid_map);
+}
+
+static int get_pid_by_map(int lastpid) 
+{
+	static int mark_and_sweep = 0;
+
+	int round = 0;
+        struct task_struct *p;
+        int i;
+	unsigned long mask;
+	int fpos;
+
+
+	if (mark_and_sweep) {
+repeat:
+		mark_and_sweep = 0;
+		memset(pid_map, 0, PID_MAP_SIZE * sizeof(unsigned long));
+		lastpid = RESERVED_PIDS;
+	}
+	for_each_task(p) {
+		mark_pid(p->pid);
+		mark_pid(p->pgrp);
+		mark_pid(p->tgid);
+		mark_pid(p->session);
+	}
+
+	/* find next free pid */
+	i = (lastpid >> SHIFT_PER_LONG);
+	mask = pid_map[i] | (((lastpid & (BITS_PER_LONG-1)) << 1) - 1);
+	
+	while ((mask == ~0) && (++i < PID_MAP_SIZE)) 
+		mask = pid_map[i];
+	
+	if (i == PID_MAP_SIZE) { 
+		if (round == 0) {
+			round = 1;
+			goto repeat;
+		}
+		next_safe = RESERVED_PIDS;
+		mark_and_sweep = 1;  /* mark next time */
+		return 0; 
+	}
+
+	fpos = ffz(mask);
+	i &= (PID_MAX-1);
+	lastpid = (i << SHIFT_PER_LONG) + fpos;
+
+	/* find next save pid */
+	mask &= ~((1 << fpos) - 1);
+
+	while ((mask == 0) && (++i < PID_MAP_SIZE)) 
+		mask = pid_map[i];
+
+	if (i==PID_MAP_SIZE) 
+		next_safe = PID_MAX;
+	else 
+		next_safe = (i << SHIFT_PER_LONG) + ffs(mask) - 1;
+	return lastpid;
+}
+
 static int get_pid(unsigned long flags)
 {
-	static int next_safe = PID_MAX;
 	struct task_struct *p;
-	int pid;
+	int pid,beginpid;
 
 	if (flags & CLONE_PID)
 		return current->pid;
 
 	spin_lock(&lastpid_lock);
+	beginpid = last_pid;
 	if((++last_pid) & 0xffff8000) {
-		last_pid = 300;		/* Skip daemons etc. */
+		last_pid = RESERVED_PIDS;		/* Skip daemons etc. */
 		goto inside;
 	}
 	if(last_pid >= next_safe) {
 inside:
 		next_safe = PID_MAX;
 		read_lock(&tasklist_lock);
+		if (nr_threads > GETPID_THRESHOLD) {
+			last_pid = get_pid_by_map(last_pid);
+			if (unlikely(last_pid == 0)) {
+				last_pid = PID_MAX;
+				goto nomorepids; 
+			}
+		} else {
 	repeat:
 		for_each_task(p) {
 			if(p->pid == last_pid	||
@@ -151,9 +238,11 @@
 			   p->session == last_pid) {
 				if(++last_pid >= next_safe) {
 					if(last_pid & 0xffff8000)
-						last_pid = 300;
+							last_pid = RESERVED_PIDS;
 					next_safe = PID_MAX;
 				}
+					if(unlikely(last_pid == beginpid))
+						goto nomorepids;
 				goto repeat;
 			}
 			if(p->pid > last_pid && next_safe > p->pid)
@@ -162,6 +251,9 @@
 				next_safe = p->pgrp;
 			if(p->session > last_pid && next_safe > p->session)
 				next_safe = p->session;
+				if(p->tgid > last_pid && next_safe > p->tgid)
+					next_safe = p->tgid;
+			}
 		}
 		read_unlock(&tasklist_lock);
 	}
@@ -169,6 +261,10 @@
 	spin_unlock(&lastpid_lock);
 
 	return pid;
+nomorepids:
+	read_unlock(&tasklist_lock);
+	spin_unlock(&lastpid_lock);
+	return 0;
 }
 
 static inline int dup_mmap(struct mm_struct * mm)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Next message: Marcelo Tosatti: "Re: [patch] Don't offer CONFIG_INDYDOG on non-ip22 machines"
Previous message: Steffen Persvold: "Interrupts lost on Intel Plumas chipset"
Next in thread: Davide Libenzi: "Re: [PATCH] get_pid() performance fix"
Reply: Davide Libenzi: "Re: [PATCH] get_pid() performance fix"