[PATCH] get_pid() performance fix

Hubertus Franke (frankeh@watson.ibm.com)
Fri, 22 Mar 2002 17:14:03 -0500


I implemented an alternative version of getpid, that for large thread counts
( > 220000), provides "significantly" better performance as shown in attached
performance analysis. This is particulary viable for PID_MAX=32768.
Note under the current algorithm, allocating the last pid will take 32
seconds (!!!) on a 500MHZ P-III.
I addressed the correctnes issue that Hirofumi brought up with the last
submission (now test case "-c 4")
Attached patch is against 2.5.7.

-- Hubertus Franke <frankeh@watson.ibm.com>

------------------------------------------------------------------------------

Currently the getpid algorithm works as follows:
At any given time an interval of [ last_pid .. next_safe ) is known to hold
unused pids. Initially the interval is set to [0 .. 32767]

Hence to allocate a new pid the following is sufficient:

if (++last_pid < next_safe) return last_pid;

However, if we move out of the safe interval, the next safe interval needs
to be established first.
This is currently done by a repetive search

repeat:
foralltasks(p) {
if (p uses lastpid) { last_pid++; goto repeat; }
/* narrow [ last_pid .. next_safe ) */
if (p->pids in [ last_pid .. next_safe ) ) next_safe = p->pid
}

Particulary for large number of tasks, this can lead to frequent exercise of
the repeat resulting in a O(N^2) algorithm. We call this : <algo-0>.

Instead I have provided an alternative mechanism that at the time
of determining the next interval marks a bitmask by walking the tasklist
once [ O(N) ] and then finding the proper bit offsets to mark the next free
interval starting from last_pid. The bitmap requires 4096 bytes.
This is <algo-1>.

An optimization to this to keep the last bitmap instead of clearing it
with every search. Only if we fail to obtain a free range, then we have
to go back and clear the bitmap and redo the search one more time.
This is <algo-2>. Here I omit the results, as they have only shown
improvements for the very last few pids.

I dragged the various algorithms into a userlevel test program to figure
out where the cut off points are with PID_MAX=32768. In this testprogram
I maintain <TH> tasks, and for 10 rounds (delete <DEL> random tasks and
reallocate <DEL> tasks again) resulting in T=10*D total measured allocations.

Si states how many interval searches where needed for algo-i.
Gi states the average overhead per get_pid for algo-i in usecs.

Based on that one should use the current algorithm until ~ 22-25K tasks and
beyond that use algo-2. Only the last 15 tasks are a bit faster under algo-1.
We can safely ignore that case.

Based on that providing an adaptive implementation seems the right choice.
The patch for 2.5.7 is attached.

executed program example: getpid -c 2 -s 10 -e 100 -D 10 -t <0,1>
0 is old 1 is new algo 2.

TH: average thread count in system
DEL: number of randomly deleted threads and then reallocated
TAL: total number of getpid invocation
C-NEW: number of times search was invoked
T-NEW: per getpid() overhead
C-OLD: number of times search was invoked
T-OLD: per getpid() overhead

TH DEL TAL | C-NEW T-NEW | C-OLD T-OLD
10 10 80 | 1 0.79 | 1 0.41
20 10 100 | 1 0.56 | 1 0.36
30 10 100 | 1 0.56 | 1 0.38
40 10 100 | 1 0.58 | 1 0.42
50 10 100 | 1 0.59 | 1 0.43
60 10 100 | 2 0.84 | 2 0.52
70 10 100 | 1 0.72 | 1 0.43
80 10 100 | 1 0.72 | 1 0.48
100 50 500 | 2 0.38 | 2 0.27
150 50 500 | 4 0.56 | 3 0.32
200 50 500 | 5 0.79 | 4 0.40
250 50 500 | 6 1.03 | 2 0.35
300 50 500 | 9 1.68 | 2 0.38
350 50 500 | 6 1.47 | 4 0.58
400 50 500 | 10 2.54 | 8 1.05
450 50 500 | 7 2.10 | 7 1.02
500 50 500 | 7 2.32 | 4 0.75
550 50 500 | 9 3.13 | 5 0.95
600 50 500 | 14 5.14 | 6 1.18
650 50 500 | 13 5.23 | 9 1.79
700 50 500 | 10 4.35 | 9 1.87
750 50 500 | 15 6.91 | 8 2.02
800 50 500 | 12 5.98 | 8 1.95
850 50 500 | 13 6.85 | 9 2.29
900 50 500 | 27 14.55 | 18 4.46
1000 1000 9980 | 1 0.21 | 1 0.21
2000 1000 10000 | 322 14.36 | 76 2.03
3000 1000 10000 | 606 46.10 | 161 6.63
4000 1000 10000 | 901 97.58 | 359 18.87
5000 1000 10000 | 1165 164.19 | 539 38.75
6000 1000 10000 | 1498 266.58 | 724 66.96
7000 1000 10000 | 1835 396.91 | 1026 122.35
8000 1000 10000 | 2084 531.83 | 1228 179.70
9000 1000 10000 | 2489 746.99 | 1516 264.16
10000 1000 10000 | 2763 946.99 | 1818 375.22
11000 1000 10000 | 3095 1199.73 | 2093 502.47
12000 1000 10000 | 3277 1422.02 | 2305 648.17
13000 1000 10000 | 3711 1776.28 | 2680 854.89
14000 1000 10000 | 3878 2033.30 | 2959 1061.61
15000 1000 10000 | 4301 2452.35 | 3167 1266.78
16000 1000 10000 | 4485 2757.00 | 3466 1554.22
17000 1000 10000 | 4844 3210.19 | 3743 1857.74
18000 1000 10000 | 5218 3681.90 | 4069 2247.13
19000 1000 10000 | 5501 4118.69 | 4293 2605.90
20000 1000 10000 | 5770 4594.99 | 4616 3095.39
21000 1000 10000 | 6161 5172.44 | 4974 3686.87
22000 1000 10000 | 6389 5637.80 | 5177 4258.81
23000 1000 10000 | 6665 6191.73 | 5483 4949.61
24000 1000 10000 | 6982 6777.02 | 5838 5831.25
25000 1000 10000 | 7209 7318.15 | 6183 6905.34
----------------> Break even range <------------------------
26000 1000 10000 | 7533 7954.33 | 6413 7955.66
27000 1000 10000 | 7959 8749.97 | 6748 9444.79
28000 1000 10000 | 8140 9302.36 | 7139 11617.75
29000 1000 10000 | 8501 10085.77 | 7638 14960.53
30000 1000 10000 | 8911 10946.80 | 8178 19584.24
32000 50 500 | 487 12265.89 | 486 94498.36
32050 50 500 | 486 12314.17 | 486 95832.03
32100 50 500 | 486 12389.38 | 488 108895.28
32150 50 500 | 492 12599.58 | 484 111742.39
32200 50 500 | 497 12792.36 | 491 124440.45
32250 50 500 | 490 12659.33 | 490 134499.09
32300 50 500 | 495 12843.78 | 489 145873.72
32350 50 500 | 495 12915.18 | 493 158940.66
32400 50 500 | 496 12988.64 | 494 183092.45
32450 50 500 | 492 12924.86 | 490 196135.28
32500 50 500 | 495 13043.85 | 488 223226.98
32550 50 500 | 498 13164.09 | 495 265431.10
32600 50 500 | 498 13222.56 | 495 326363.36
32650 50 500 | 498 13279.90 | 497 441002.09
32700 50 500 | 499 13368.66 | 499 656269.52
32751 1 10 | 10 12836.40 | 10 4031660.40
32752 1 10 | 10 12831.70 | 10 4620214.70
32753 1 10 | 10 12832.60 | 10 4188605.70
32754 1 10 | 10 12837.60 | 10 2972975.40
32755 1 10 | 10 12835.90 | 10 4434635.70
32756 1 10 | 10 12833.80 | 10 3892058.30
32757 1 10 | 10 12839.10 | 10 5002365.30
32758 1 10 | 10 12840.20 | 10 6332786.20
32759 1 10 | 10 12840.20 | 10 5269462.90
32760 1 10 | 10 14127.80 | 10 8234780.40
32761 1 10 | 10 14134.90 | 10 7558043.20
32762 1 10 | 10 14129.70 | 10 9117119.40
32763 1 10 | 10 14134.70 | 10 13895498.10
32764 1 10 | 10 14140.10 | 10 13608972.90
32765 1 10 | 10 15427.30 | 10 12930469.20
32766 1 10 | 10 16708.30 | 10 23576610.90
32767 1 10 | 10 17997.90 | 10 32603396.10

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

diff -urbN linux-2.5.7/kernel/fork.c linux-2.5.7-pid/kernel/fork.c --- linux-2.5.7/kernel/fork.c Mon Mar 18 15:37:05 2002 +++ linux-2.5.7-pid/kernel/fork.c Fri Mar 22 16:38:29 2002 @@ -125,24 +125,111 @@ /* Protects next_safe and last_pid. */ spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED; +/* this should be provided in every architecture */ +#ifndef SHIFT_PER_LONG +#if BITS_PER_LONG == 64 +# define SHIFT_PER_LONG 6 +#elif BITS_PER_LONG == 32 +# define SHIFT_PER_LONG 5 +#else +#error "SHIFT_PER_LONG" +#endif +#endif + +#define RESERVED_PIDS (300) +#define GETPID_THRESHOLD (26000) /* when to switch to a mapped algo */ +#define PID_MAP_SIZE (PID_MAX >> SHIFT_PER_LONG) +static unsigned long pid_map[PID_MAP_SIZE]; +static int next_safe = PID_MAX; + +static inline void mark_pid(int pid) +{ + __set_bit(pid,pid_map); +} + +static int get_pid_by_map(int lastpid) +{ + static int mark_and_sweep = 0; + + int round = 0; + struct task_struct *p; + int i; + unsigned long mask; + int fpos; + + + if (mark_and_sweep) { +repeat: + mark_and_sweep = 0; + memset(pid_map, 0, PID_MAP_SIZE * sizeof(unsigned long)); + lastpid = RESERVED_PIDS; + } + for_each_task(p) { + mark_pid(p->pid); + mark_pid(p->pgrp); + mark_pid(p->tgid); + mark_pid(p->session); + } + + /* find next free pid */ + i = (lastpid >> SHIFT_PER_LONG); + mask = pid_map[i] | (((lastpid & (BITS_PER_LONG-1)) << 1) - 1); + + while ((mask == ~0) && (++i < PID_MAP_SIZE)) + mask = pid_map[i]; + + if (i == PID_MAP_SIZE) { + if (round == 0) { + round = 1; + goto repeat; + } + next_safe = RESERVED_PIDS; + mark_and_sweep = 1; /* mark next time */ + return 0; + } + + fpos = ffz(mask); + i &= (PID_MAX-1); + lastpid = (i << SHIFT_PER_LONG) + fpos; + + /* find next save pid */ + mask &= ~((1 << fpos) - 1); + + while ((mask == 0) && (++i < PID_MAP_SIZE)) + mask = pid_map[i]; + + if (i==PID_MAP_SIZE) + next_safe = PID_MAX; + else + next_safe = (i << SHIFT_PER_LONG) + ffs(mask) - 1; + return lastpid; +} + static int get_pid(unsigned long flags) { - static int next_safe = PID_MAX; struct task_struct *p; - int pid; + int pid,beginpid; if (flags & CLONE_PID) return current->pid; spin_lock(&lastpid_lock); + beginpid = last_pid; if((++last_pid) & 0xffff8000) { - last_pid = 300; /* Skip daemons etc. */ + last_pid = RESERVED_PIDS; /* Skip daemons etc. */ goto inside; } if(last_pid >= next_safe) { inside: next_safe = PID_MAX; read_lock(&tasklist_lock); + if (nr_threads > GETPID_THRESHOLD) { + last_pid = get_pid_by_map(last_pid); + if (unlikely(last_pid == 0)) { + last_pid = PID_MAX; + goto nomorepids; + } + } else { repeat: for_each_task(p) { if(p->pid == last_pid || @@ -151,9 +238,11 @@ p->session == last_pid) { if(++last_pid >= next_safe) { if(last_pid & 0xffff8000) - last_pid = 300; + last_pid = RESERVED_PIDS; next_safe = PID_MAX; } + if(unlikely(last_pid == beginpid)) + goto nomorepids; goto repeat; } if(p->pid > last_pid && next_safe > p->pid) @@ -162,6 +251,9 @@ next_safe = p->pgrp; if(p->session > last_pid && next_safe > p->session) next_safe = p->session; + if(p->tgid > last_pid && next_safe > p->tgid) + next_safe = p->tgid; + } } read_unlock(&tasklist_lock); } @@ -169,6 +261,10 @@ spin_unlock(&lastpid_lock); return pid; +nomorepids: + read_unlock(&tasklist_lock); + spin_unlock(&lastpid_lock); + return 0; } static inline int dup_mmap(struct mm_struct * mm) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/