Re: Linux 2.2.16 through 2.2.18preX TCP hang bug triggered by rsync

kuznet@ms2.inr.ac.ru
Thu, 25 Jan 2001 21:35:48 +0300 (MSK)


Hello!

I take my words back. Manfred is right, this requirement is not a MUST.

Real problem is much worse, and it is wholly on the shame of solaris.
Tcpdump shows at least two different bugs there.

2060 16:31:42.879337 eth0 < dynamic.ih.lucent.com.39406 > static.8664: . 675
80:67580(0) ack 1582261 win 1460 (DF)
2061 16:31:42.907940 eth0 > static.8664 > dynamic.ih.lucent.com.39406: . 158
3721:1583721(0) ack 67580 win 1460 (DF)

All is OK until now. Solaris's state should be:

SND.NXT=SND.UNA=67580
SND.WND=1460
RCV.NXT=1582261

2062 16:31:42.908620 eth0 < dynamic.ih.lucent.com.39406 > static.8664: . 675
80:67581(1) ack 1583721 win 0 (DF)

Solaris sends one byte.

SND.NXT++
RCV.NXT=1583721

2063 16:31:43.098761 eth0 > static.8664 > dynamic.ih.lucent.com.39406: . 158
3721:1583721(0) ack 67581 win 1460 (DF)

We ACK it.

2064 16:31:43.100993 eth0 < dynamic.ih.lucent.com.39406 > static.8664: P 675
81:68456(875) ack 1583721 win 0 (DF)
2065 16:31:43.101524 eth0 < dynamic.ih.lucent.com.39406 > static.8664: P 684
56:69041(585) ack 1583721 win 0 (DF)

Solaris sends two segments, filling all the window.

SND.NXT=69041

2066 16:31:43.108759 eth0 > static.8664 > dynamic.ih.lucent.com.39406: . 158
3720:1583720(0) ack 69041 win 0 (DF)

We send zero window probe. SEG.SEQ=1583720.

Solaris accepts ACK from it!!! (bug #1) But does not accept window.

So, now it thinks that SND.UNA=SND.NXT=69041
SND.WND=1460

State is corrupted.

This is hard bug. But it is still not fatal. Actually, such corruptions
(but by different reasons) are common with stacks, which borrowed code
from BSD. Look into tcp-impl, Subj: "Send window update algorithm ..."
They are recoverable, provided stack is sane.

2067 16:31:43.110623 eth0 < dynamic.ih.lucent.com.39406 > static.8664: P 690
41:69628(587) ack 1583721 win 0 (DF)

Solaris send some crap out of window, because of corrupted state.
No problems.

2068 16:31:43.110679 eth0 > static.8664 > dynamic.ih.lucent.com.39406: . 158
3721:1583721(0) ack 69041 win 0 (DF)

We tell "No pasaran", of course.

According to rules, Solaris must shrink window now.
This is the only way to recover corrupted state.

2069 16:31:43.111641 eth0 < dynamic.ih.lucent.com.39406 > static.8664: P 696
28:70501(873) ack 1583721 win 0 (DF)

It does not. And this is point after which recovery is impossible.
Fatal bug#2.

To resume: it is impossible to help to this from Linux side.
We may accept ACK&WIN from out-of-window segments, and this
will help in this case _occasionally_. But Solaris is still
deemed to lockup randomly with such sawdust in the head.

Alexey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/