dead mem walking ;-)

Dirk Wetter (dirkw@rentec.com)
Thu, 12 Jul 2001 17:43:21 -0400 (EDT)


Hi guys,

On Thu, 12 Jul 2001, Mike Galbraith wrote:

> > > Have you had a chance to try 2.4.7-pre-latest yet? I'd be interested
> > > in a small sample of vmstat 1 leading into heavy swap with >=pre5 if
> > > it is still a problem.
> >
> > i will definetely check it out and give a report, since the test i did
> > yesterday the *command* "vmstat 1" in typed in appeared to be :)) more
> > like "vmstat 180", no kidding.
>
> Ok, you have some 'io bound' issues that need to be looked at. Present
> the data in that light please.

so here is result of my testing: the scenario: vanilla kernel
2.4.6, config is CONFIG_HIGHMEM4G=y, it's a dual intel box with 4GB
mem. the machine was freshly booted before the test with profile=2
(more detailed data is av. @ www.desy.de/~dirkw/linux-kernel/ )

a while before the jobs were submitted i did "readprofile | sort -nr | head -10":
296497 total 0.3442
295348 default_idle 5679.7692
300 __rdtsc_delay 10.7143
215 si_swapinfo 1.2500
138 do_softirq 1.0147
107 printk 0.2816
28 do_wp_page 0.0272
17 schedule 0.0117
10 tcp_get_info 0.0077
10 filemap_nopage 0.0073

the same after i was able to kill the jobs (see below):

836552 total 0.9710
458757 default_idle 8822.2500
361961 __get_swap_page 665.3695
6629 si_swapinfo 38.5407
1655 do_anonymous_page 5.3734
760 file_read_actor 3.0645
652 statm_pgd_range 1.6633
592 do_softirq 4.3529
498 skb_copy_bits 0.5845
302 __rdtsc_delay 10.7857

i also did a "vmstat 1" (sorry about the long lines):

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
[..]
0 0 0 0 4004896 3184 13600 0 0 0 0 183 60 0 6 94
0 0 0 0 4004896 3184 13600 0 0 0 0 153 88 0 3 97
0 0 0 0 4004892 3184 13604 0 0 0 0 109 63 0 3 97
then, as the jobs were submitted:
0 1 0 0 4001772 3252 15084 0 0 224 56 2804 2806 10 7 83
0 0 0 0 3997784 3252 15852 0 0 4 8 2133 2117 14 8 78
0 1 0 0 3997744 3252 15896 0 0 0 0 3526 3487 7 9 84
0 1 0 0 3995872 3256 16700 0 0 324 24 1624 1732 9 5 86
1 0 0 0 3995900 3256 16876 0 0 0 24 2652 3648 13 8 79
1 0 1 0 3993068 3256 16908 0 0 4 28 1894 2703 33 14 54
[..]
2 0 0 0 2648168 4716 303564 0 0 0 24 5675 1718 70 29 1
[..]
2 0 0 0 1851568 4856 500804 0 0 456 0 265 148 87 10 3
[..] it became "idle?":
0 2 0 0 1023728 4944 661724 0 0 0 0 3944 3023 4 12 84
[..]
1 1 0 0 510344 4972 764132 0 0 0 0 3647 2601 6 21 74
[..]
0 2 0 0 251700 4972 817964 0 0 0 0 6440 4888 11 27 63
[..]
0 2 0 0 63300 4972 856792 0 0 0 0 6876 5836 10 20 70
[..]
2 0 0 0 5092 2912 377904 0 0 0 4 5514 4359 7 30 63
and now continously:
0 2 0 0 5092 2912 328024 0 0 0 0 6660 5991 8 20 72
0 2 0 0 5092 2836 297044 0 0 0 0 7777 7083 9 21 70
0 2 0 292 6644 2808 274684 0 0 588 0 1026 992 71 15 14
0 2 1 708 5084 2784 269260 0 0 676 88 261 269 57 31 12
0 2 0 455800 5072 2756 871396 0 0 2920 0 423 579 40 44 17
2 0 0 1232696 5092 2696 1469092 0 0 0 0 3534 1076 32 63 5
2 0 1 1264360 5100 2672 1490184 0 0 0 0 4400 1286 72 28 0
3 0 0 1313796 6720 2632 1505580 0 0 4 0 7147 2554 54 41 5
4 0 2 1532260 5092 2652 1783788 0 0 0 76 7655 2998 51 43 5
starting from here, the machine was comatose
11 1 1 2126752 5216 2644 2312428 0 0 39 1 202 140 4 20 76
9 1 1 2153304 61684 2644 2282860 0 0 0 56 17401 188 0 100 0
spat-out frequency ;-) now was now ~ 1 line/2-5 minutes:
18 0 3 2179888 61084 2644 2310692 0 0 0 64 18828 610 0 100 0
6 0 2 2180032 58184 2644 2313340 0 0 0 88 3293 1194 14 85 0
20 0 2 2205848 55024 2644 2340136 0 0 0 8 18377 1085 0 100 0
15 0 2 2231488 54952 2644 2365796 0 0 16 12 16970 179 0 100 0
27 1 3 2256488 52756 2644 2391040 0 0 76 20 17124 328 0 100 0
25 1 2 2281296 53504 2644 2416300 0 0 0 95 17093 146 0 100 0
[..]
22 1 2 2422088 40496 2644 2568668 0 0 0 52 30513 236 0 100 0
19 1 1 2444300 39228 2644 2593716 0 0 0 152 17785 1051 0 100 0
13 1 1 2466092 38800 2644 2615504 0 0 0 44 14733 308 0 100 0
10 3 1 2508800 36820 2644 2659232 0 0 0 44 29436 563 0 100 0
after i managed to kill the jobs:
15 0 4 2609772 58396 2656 2778556 0 0 19 1 128 70 2 61 37
0 1 0 2602520 1248956 2672 2771456 0 0 160 0 2500 3795 1 24 75
0 1 0 2602520 1249340 2672 2771456 0 0 0 0 1930 3108 3 6 92
0 1 0 2602520 1249572 2676 2771488 0 0 36 20 660 1022 3 14 83
0 1 0 2602520 1249572 2676 2771488 0 0 0 0 109 56 0 4 96
0 1 0 2602520 1249560 2688 2771488 0 0 12 0 141 69 0 7 93
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id

strange is, that si/so was all the time 0, also in the lines i
omittted, but i guess that only appeared to be ;-) . system was 100%
in kernel, load was through the ceiling:

2:05pm up 42 min, 3 users, load average: 21.54, 9.50, 3.69
68 processes: 63 sleeping, 4 running, 1 zombie, 0 stopped
CPU0 states: 54.2% user, 36.1% system, 53.4% nice, 9.2% idle
CPU1 states: 45.2% user, 47.0% system, 45.3% nice, 6.4% idle
Mem: 4057236K av, 4049964K used, 7272K free, 0K shrd, 2632K buff
Swap: 14337736K av, 1313796K used, 13023940K free 1505452K cached
PID PPID USER PRI SIZE SWAP RSS SHARE D STAT %CPU %MEM TIME COMMA
1377 1213 usersid 18 1854M 636M 1.2G 568 0M R N 70.9 30.7 2:18 ceqsim
1387 1284 usersid 19 1854M 636M 1.2G 568 0M R N 70.7 30.7 2:17 ceqsim
1213 1188 usersid 9 8124 2512 5612 732 1K S 0.0 0.1 0:02 sfmbB2
1284 1259 usersid 9 8124 2512 5612 732 1K S 0.0 0.1 0:02 sfmbC2

2:18pm up 36 min, 2 users, load average: 24.09, 20.61, 12.52
81 processes: 70 sleeping, 9 running, 2 zombie, 0 stopped
CPU0 states: 0.1% user, 100.0% system, 0.0% nice, 0.-1% idle
CPU1 states: 1.9% user, 99.0% system, 0.5% nice, 0.-1% idle
Mem: 4057240K av, 4052876K used, 4364K free, 0K shrd, 332K buff
Swap: 14337736K av, 3334732K used, 11003004K free 3370160K cached
PID PPID USER PRI SIZE SWAP RSS SHARE D STAT %CPU %MEM TIME COMMA
1186 1109 usersid 15 1936M 1.4G 526M 33736 0M R N 23.3 13.2 8:27 ceqsim
1054 975 usersid 20 1936M 1.7G 157M 20292 36K R N 18.9 3.9 5:18 ceqsim
229 1 root 9 1568 0 1568 1352 54 S 0.0 0.0 0:00 xntpd
543 1 root 20 780 48 732 708 42 R 2.5 0.0 0:01 pbs_mo

2:54pm up 1:32, 2 users, load average: 25.37, 25.03, 24.13
66 processes: 54 sleeping, 12 running, 0 zombie, 0 stopped
CPU0 states: 0.3% user, 99.216% system, 0.47% nice, 0.1% idle
CPU1 states: 0.0% user, 99.265% system, 0.3% nice, 0.-1% idle
Mem: 4057236K av, 4025056K used, 32180K free, 0K shrd, 2644K buff
Swap: 14337736K av, 2550220K used, 11787516K free 2704768K cached
PID PPID USER PRI SIZE SWAP RSS SHARE D STAT %CPU %MEM TIME COMMA
1377 1213 usersid 15 1859M 1.2G 643M 568 0M R N 0.1 16.2 23:25 ceqsim
1387 1284 usersid 15 1859M 1.2G 622M 568 0M R N 49.7 15.7 25:13 ceqsim
1213 1188 usersid 9 8124 5308 2816 732 521 S 0.0 0.0 0:02 sfmbB2
1284 1259 usersid 9 8124 5308 2816 732 521 S 0.0 0.0 0:02 sfmbC2

ok. if could do more to resolve this problem, let me know what I
should do and I'll try to make further tests then.

thanks,
~dirkw

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/