2002.05.13 - Well, I guess I never created a readme for this stuff. It's a little past time now but here goes. ;) ----- Executive summary ----- At some point around Linux kernel version 2.1.xx (I dunno xx), they changed the kernel's default tcp keepalive timeout from 300 seconds to 7200 seconds. The problem was that this was longer than the ip_masquerade timeout (and probably there were some other related problems) so programs that were sensitive to the source port started crashing. This impacted openssh, for one. I tried to create a really simple solution by making openssh send its own keepalives. My patch went through several revisions and worked pretty well, but it had a tendency to deplete the local entropy pool quickly (the worst time to draw entropy from /dev/urandom is when nobody's typing!) so I never really pushed for inclusion into openssh. Eventually, I started to recommend to people that they just change their keepalive timeouts back to 300 seconds, if they had root access, despite how ``impolite'' it is over the internet. After a goodly while, the openssh folks actually implemented keepalives in a way that made them happy, so the problem is now moot. :) ----- My initial e-mail to the openssh developers ----- The attached patch adds an option (off by default to preserve current behavior) to set a timeout on the select() statement that waits for input in clientloop.c. This fixes a timeout issue for me (explained below) and probably also fixes the timeouts mentioned in last month's thread "Idle time out". The patch is also available by http from: http://www.chaos2.org/~jacob/code/patch-openssh-1.2.2-trans_inter I am ssh-ing from a machine on my home network to one on the internet. This goes out over a Linux ip_masquerade firewall. When I wrote the attached patch, I thought it was the firewall that was killing the connection by timing out on the redirected port due to lack of traffic. But after reading some similar posts on this list, I think there might be problems even if a firewall isn't involved. Also note that in the tcpdump below, I did have KeepAlive turned on (both server and client) and yet I don't see any traffic being generated due to this, which seems to render KeepAlive pretty useless... When ssh dies on me (when no max idle time is set) it gives me the error below: " velius:~% Read from remote host velius.chaos2.org: Connection reset by peer Connection to velius.chaos2.org closed. jacob:~# " From the tcpdump below, we see that the firewall has assigned a new ip_masq port. This shows all the packets; specifically, none are generated in the interim. " 00:59:19.987703 velius.chaos2.org.ssh > c392100-a.crvlls1.or.home.com.64579: P 1:21(20) ack 20 win 32120 (DF) 00:59:19.998389 c392100-a.crvlls1.or.home.com.64579 > velius.chaos2.org.ssh: . ack 21 win 32120 (DF) [tos 0x10] ... time passes here but no traffic to velius ... 01:20:37.477884 c392100-a.crvlls1.or.home.com.64687 > velius.chaos2.org.ssh: P 2954940853:2954940873(20) ack 2970631452 win 32120 (DF) [tos 0x10] 01:20:37.583097 velius.chaos2.org.ssh > c392100-a.crvlls1.or.home.com.64687: R 2970631452:2970631452(0) win 0 [tos 0x10] " The attached patch allows the user to put a TransmitInterlude option in their ssh_config file that gives how many seconds are allowed to pass without generating traffic. A value of 300 completely solves the timeouts for me and I haven't observed any stability issues. Please cc me with comments as I am not subscribed to the list. Jacob Lundberg ----- My reply to Damien Miller's response ----- > I would first rather get to the bottom of figuring out why keepalives > aren't working. Which brings a question to mind. I haven't really programmed with keepalives before. I presume they're a field in the tcp frames (as opposed to a periodic empty frame)? > Is "KeepAlive yes" set for both client and server? Yes. To no avail. Both are Linux 2.2.14 boxen, but I have been seeing this problem since I switched to 2.1.x (client with server still 2.0.x). I still see it with the not-so-open ssh suite as well, both 1.x and 2.x. I tried kernel 2.3.42 and was still seeing it there too. > Is /proc/sys/net/ipv4/tcp_keepalive_time set It is set to 7200. > to less than the masquerading timeouts? I checked after reading the recent list entries and actually I see the exact same behavior when I run ssh out from the firewall (thus bypassing the ip_masq). So while the ip_masq is an issue, it is orthogonal to the problem and could be resolved (as you say) by setting the keepalives to less than the ip_masq timeout. For some reason, keepalives aren't sufficient to keep some connections alive right now. On a side note, Di Zhao asked if I should have implemented a server version of the patch as well. I rather felt that (from what I've seen) the problem is a bit too infrequent for that (let the users turn it on if they discover they need it)... But I suppose it does leave people using different clients out in the cold. Any preference there? -Jacob ----- Me again, a bit later ----- > I would first rather get to the bottom of figuring out why keepalives > aren't working. Ok. I've played around some and now understand keepalives a bit better. So that 7200 setting would be two hours, which is rather long. And in fact it turns out that setting it to 300 solves the problem for me. But it is of note that 7200 is the _default_ value. And also I'm still not sure why a setting of 7200 (both server and client) would break things. (Also finally I see some keepalive packets going by so now I understand much better what they are.) > /proc/sys/net/ipv4/tcp_keepalive_time set to less than the masquerading timeouts? The question remains here: what if you can't get your sysadmin to go tweaking with the kernel default keepalive of 7200 seconds? Do we just say to such a person, "too bad!" or do we let them send packets on their own to keep the connection alive? Unless I misunderstand, keepalive default is set here: /usr/src/linux/include/net/tcp.h line 264 (Linux 2.2.14) #define TCP_KEEPALIVE_TIME (120*60*HZ) /* two hours */ I know it could be contrued as bloat, which is why my patch didn't include a commandline option. I think the option itself is useful for the purpose of empowering the user (in a non-security threatening way ;). ----- Markus Friedl helped me out next and I replied: ----- > the patch looks reasonable, but SSH_MSG_NONE type packets > must not travel over the wire. this violates the protocol spec. Ok. Corrected patch attached. :) As before, it's also here: http://www.chaos2.org/~jacob/code/patch-openssh-1.2.2-trans_inter-r1 > SSH_MSG_IGNORE should be used, e.g.: > packet_start(SSH_MSG_IGNORE); > packet_put_string("bla", 3); > packet_send(); I wondered if I needed to stuff them with something. Open sshd didn't seem to mind if they were empty, but closed sshd terminated the connection. Thanks for the showing me how... ----- Sean Aaron Lisse asked and recieved ----- > I recommend sending some random characters instead of a constant string > like "bla". Done. I used random strings with a maximum length of 256 chars. Sound good to everybody? And the patch is also at: http://www.chaos2.org/~jacob/code/patch-openssh-1.2.2-trans_inter-r2 ----- Nikhil asked about my patch much later and I elaborated ----- > For example an idle ssh session that does: > gagne:/home/nikhil# iRead from remote host gagne.nols.com: Connection reset by > peer > Connection to gagne.nols.com closed. > $ The most common set of circumstances that seems to cause this is when you are running Linux >= 2.1.x and ssh'ing beyond your localnet. A decision was made to change the default TCP keepalive timeout for Linux to 7200 seconds, which is great for people using intermittant ppp connections but not always great for people with stable network connections. In my particular case, this means that my firewall times out on the TCP connection and closes it. When I next generate network traffic on that connection, it will have a different source port on the firewall. Presumably, sshd on the other end considers this a breach of security and closes the connection. I think that the change to a longer keepalive interval should have been better documented since the problems it causes are often mysterious and hard to track down. > Your code is for a rather old version of openssh, I was just wondering if you > had any plans or what should I do to fix that problem with the newer versions > of openssh. That patch basically used ssh to create the keepalives instead of leaving it to Linux. Thus you could control the keepalive interval without needing root on the system you were on. However, I stopped maintaining it for three reasons. First, the openssh people indicated that it was unlikely they would ever merge it. Secondly, there were a lot of changes to the code for generating random strings. Thirdly, I discovered that with my patch, multiple ssh clients could get in fights over access to the system entropy pool and die much like before (a fault of some sort in the openssh entropy gathering code). I will have to take a look at 2.1.1p1 and see if my patch would port easily and still work. In the meantime, if you can get root on the machine you're running the client on, try doing this: echo "300\c" > /proc/sys/net/ipv4/tcp_keepalive_time Linux 2.2.x seems to require that there be no newline; thence the \c. Unfortunately, different shells and binutils versions interpret echo quite differently, so you might need one of these forms instead: /bin/echo "300\c" /bin/echo -e "300\c" Then make sure the kernel accepted it (should get 300): cat /proc/sys/net/ipv4/tcp_keepalive_time > Could you please let me know and thanks again! If the above doesn't work for you, let me know. ----- the end -----