Bug report: tcp staled when send-q != 0, timers == 0.

Eugene B. Berdnikov (berd@elf.ihep.su)
Mon, 9 Apr 2001 18:43:38 +0400


Hi all.

In brief: a stale state of the tcp send queue was observed for 2.2.17
while send-q counter and connection window sizes are not zero:

% netstat -n -eot | grep 1018
tcp 0 13064 194.190.166.31:22 194.190.161.106:1018
ESTABLISHED 0 11964 off (0.00/0/0)

Host 194.190.166.31: 2.2.17 on PPro-180 (HP NetServer E40),
compiled with CONFIG_M686=y and running sshd1 server.

SYMPTOMS

1. When data is sent to 194.190.166.31, it is ack'ed with non-zero
window size, but zero data block is returned (whereas send-q!=0).
Strace shows, that server gets data via read(2) and replies
with write(2) to the same descriptor. The send-q counter is
encremented by the number of bytes sent, but timers remain zero.
No data is returned to the network.

2. When data is is sent to the controlling pty, sshd also steps
through the read(2) and write(2), with the same result
(send-q is encremented, timers do not change, no traffic).

3. Keepalive is enabled in ssh client and server (on both sides).
The curious thing is that every keepalive interval (2h as default)
host 194.190.166.31 sends exactly one packet with ethernet
MTU size:

17:50:30.391347 > 194.190.166.31.ssh > 194.190.161.106.1018:
P 1:1449(1448) ack 1 win 32640
<nop,nop,timestamp 72137447 1733874370> (DF) [tos 0x10]
17:50:31.102567 < 194.190.161.106.1018 > 194.190.166.31.ssh:
. 1:1(0) ack 1449 win 32120
<nop,nop,timestamp 1734601828 72137447> (DF) [tos 0x10]

The value send-q is properly decremented every such transmission.

4. This connection was traced for a long time, and when send-q counter
reaches zero (due to "keepalive" exchange), it gets out from its
stale state. Now it behaves as an normal connection.
I still keep it for investigation (if any:).

INFO

Here is some supplementary information on the network configuration for
this machine. I belive it is not directly concerned to the bug discussed,
but put here for the completeness.

Host has Intel EtherExpress-100 ethernet card (standard driver from 2.2.17)
and SkyMedia-200 DVB sattelite receiver, running driver "sm200_lnx"
from Telemann. Staled connection passes over ethernet only.

# lsmod
Module Size Used by
sm200_lnx 18800 1
eepro100 16180 1 (autoclean)

# ip -s l l
1: lo: <LOOPBACK,UP> mtu 3924 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
RX: bytes packets errors dropped overrun mcast
349139839 1395237 0 0 0 0
TX: bytes packets errors dropped carrier collsns
349139839 1395237 0 0 0 0
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100
link/ether 00:a0:c9:9e:c9:7d brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
2338597780 7637867 0 0 0 0
TX: bytes packets errors dropped carrier collsns
3016099313 7378871 0 0 0 293385
58: sm200: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100
link/ether 00:90:bc:01:1a:da brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
603326460 896671 0 0 0 0
TX: bytes packets errors dropped carrier collsns
0 0 0 0 0 0

# ip -s r l
192.168.30.31 dev eth0 scope link
194.190.166.31 dev eth0 scope link src 194.190.166.31
192.168.30.0/24 dev eth0 proto kernel scope link src 192.168.30.31
127.0.0.0/8 dev lo scope link
default via 192.168.30.34 dev eth0 src 194.190.166.31

Here is the line from /proc/net/tcp for this connection when it was stale:

128: 1FA6BEC2:0016 6AA1BEC2:03FA 01 00003394:00000000 00:00000000 00000000
0 0 11964

That's all that I consider as interesting.

I was told by ANK, that it might be helpfull to find a socket in the memory
and dump it contents, while it was staled. However, the time was lost...

If anybody tell me which additional information can be extracted for the
diagnostic, I'll try to get it. In any case, I plan to run something through
this connection in hope to reproduce this state again.

-- 
 Eugene Berdnikov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/