SIGRT_1 not being delivered on pthread termination

Andreas Steinmetz (ast@domdv.de)
Thu, 18 Oct 2001 14:18:19 +0200 (CEST)


Beware: even as this looks like a glibc problem on the first glance it boils
down to clone() and no signal being delivered on process exit. Nevertheless I'm
cross posting this as it still may be a glibc/pthreads problem.

General: The problem has been around since at least 2.4.9 and up to at
least 2.4.13pre2. System data: PIII850/512MB/glibc-2.2.4

pdnsd is a threaded lightweight small network nameserver
(http://home.t-online.de/home/Moestl/). It does start one detached thread per
dns query. As pdnsd manages a cache each query thread may terminate fast.
Unfortunately the terminated threads are not reaped and it looks like a missing
SIGRT_1 signal delivery causing this problem. Details below.

pdnsd always starts threads the same way (detached, sample below).

pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr,PTHREAD_CREATE_DETACHED);
pthread_create(&pt,&attr,udp_answer_thread,(void *)buf);

This is how things should look like. pid 15446 is the pthreads manager thread.

root@hades:~ > ps ax | grep pdnsd
15445 ? S 0:00 /usr/sbin/pdnsd
15446 ? S 0:00 /usr/sbin/pdnsd
15448 ? S 0:00 /usr/sbin/pdnsd
15449 ? S 0:00 /usr/sbin/pdnsd
15450 ? S 0:00 /usr/sbin/pdnsd

A few lookups later and the first zombies are hanging around.

root@hades:~ > ps ax | grep pdnsd
15445 ? S 0:00 /usr/sbin/pdnsd
15446 ? S 0:00 /usr/sbin/pdnsd
15448 ? S 0:00 /usr/sbin/pdnsd
15449 ? S 0:00 /usr/sbin/pdnsd
15450 ? S 0:00 /usr/sbin/pdnsd
17224 ? Z 0:00 [pdnsd <defunct>]
17228 ? Z 0:00 [pdnsd <defunct>]
17229 ? Z 0:00 [pdnsd <defunct>]

Tracing the manager thread now shows nothing unusual.

root@hades:~ > strace -p 15446
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll( <unfinished ...>

Process list after ptrace.

root@hades:~ > ps ax | grep pdnsd
15445 ? S 0:00 /usr/sbin/pdnsd
15448 ? S 0:00 /usr/sbin/pdnsd
15449 ? S 0:00 /usr/sbin/pdnsd
15450 ? S 0:00 /usr/sbin/pdnsd
17224 ? Z 0:00 [pdnsd <defunct>]
17228 ? Z 0:00 [pdnsd <defunct>]
17229 ? Z 0:00 [pdnsd <defunct>]
15446 ? S 0:00 /usr/sbin/pdnsd

Same in detail.

root@hades:~ > ps axlw | grep pdnsd
040 65534 15445 1 9 0 14780 1584 rt_sig S ? 0:00
/usr/sbin/pdnsd
040 65534 15448 15446 9 0 14780 1584 wait_f S ? 0:00
/usr/sbin/pdnsd
040 65534 15449 15446 9 0 14780 1584 wait_f S ? 0:00
/usr/sbin/pdnsd
040 65534 15450 15446 9 0 14780 1584 wait_f S ? 0:00
/usr/sbin/pdnsd
044 65534 17224 15446 9 0 0 0 do_exi Z ? 0:00 [pdnsd
<defunct>]
044 65534 17228 15446 9 0 0 0 do_exi Z ? 0:00 [pdnsd
<defunct>]
044 65534 17229 15446 9 0 0 0 do_exi Z ? 0:00 [pdnsd
<defunct>]
040 65534 15446 15445 9 0 14780 1584 do_pol S ? 0:00
/usr/sbin/pdnsd

Trace of the manager thread and issuing of a new dns query.

root@hades:~ > strace -p 15446
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, 2000) = 1
getppid() = 15445
read(7, "\340\373?\277\0\0\0\0\330\372?\277\200\22\5\10\270\215"..., 148) = 148
old_mmap(0xbe800000, 2097152, PROT_READ|PROT_WR
ITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xbe800000
mprotect(0xbe800000, 4096, PROT_NONE) = 0
clone(child_stack=0xbe9ffbd8,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|0x21) = 17287
kill(15450, SIGRT_0) = 0
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll( <unfinished ...>

Now the thread handling the dns query aleady finished when strace was aborted.
New process listing showing even more zombies below.

root@hades:~ > ps axlw | grep pdnsd
040 65534 15445 1 9 0 18876 1600 rt_sig S ? 0:00
/usr/sbin/pdnsd
040 65534 15448 15446 9 0 18876 1600 wait_f S ? 0:00
/usr/sbin/pdnsd
040 65534 15449 15446 9 0 18876 1600 wait_f S ? 0:00
/usr/sbin/pdnsd
040 65534 15450 15446 9 0 18876 1600 wait_f S ? 0:00
/usr/sbin/pdnsd
044 65534 17224 15446 9 0 0 0 do_exi Z ? 0:00 [pdnsd
<defunct>]
044 65534 17228 15446 9 0 0 0 do_exi Z ? 0:00 [pdnsd
<defunct>]
044 65534 17229 15446 9 0 0 0 do_exi Z ? 0:00 [pdnsd
<defunct>]
044 65534 17268 15446 9 0 0 0 do_exi Z ? 0:00 [pdnsd
<defunct>]
040 65534 15446 15445 9 0 18876 1600 do_pol S ? 0:00
/usr/sbin/pdnsd
044 65534 17287 15446 9 0 0 0 do_exi Z ? 0:00 [pdnsd
<defunct>]
000 1000 17291 16783 13 0 1624 492 - R pts/6 0:00 grep pdnsd

A few queries later another strace shows a 'spurious' delivery of SIGRT_1 as a
thread terminates and the thread manager happily reaps all dead childs.

root@hades:~ > strace -p 15446
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN}], 1, 2000) = 0
getppid() = 15445
poll([{fd=7, events=POLLIN, revents=POLLIN}], 1, 2000) = 1
getppid() = 15445
read(7, "\340\373?\277\0\0\0\0\330\372?\277\200\22\5\10\220E\17"..., 148) = 148
old_mmap(0xbe600000, 2097152, PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xbe600000
mprotect(0xbe600000, 4096, PROT_NONE) = 0
clone(child_stack=0xbe7ffbd8,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|0x21) = 17293
--- SIGRT_1 (Real-time signal 1) ---

A final process list shows that all zombies are gone.

root@hades:~ > ps axlw | grep pdnsd
040 65534 15445 1 9 0 8636 1560 rt_sig S ? 0:00
/usr/sbin/pdnsd
040 65534 15448 15446 9 0 8636 1560 wait_f S ? 0:00
/usr/sbin/pdnsd
040 65534 15449 15446 9 0 8636 1560 wait_f S ? 0:00
/usr/sbin/pdnsd
040 65534 15450 15446 9 0 8636 1560 wait_f S ? 0:00
/usr/sbin/pdnsd
040 65534 15446 15445 8 0 8636 1560 do_pol S ? 0:00
/usr/sbin/pdnsd

The above details clearly indicate a missing signal delivery to the manager
thread when a thread terminates. The problem seems to appear the more often the
longer the system is up. It may easily reach the point of all possible thread
slots being filled by zombies.

Andreas Steinmetz
D.O.M. Datenverarbeitung GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/