Tuesday, March 24, 2015

networking - Too many established connections left open




I have a (probably quite old) CentOS 4.5 server with a custom java application running inside.



I found the application was crashing after some running time and found it was handling 1024 connections and trying to open one more socket when it died.



As a matter of fact if I check ulimit -n I can confirm it is 1024, so the application is getting closed since it has no more free file descriptors..



What bothers me is that there are hundreds of apparently inactive connections, in a "ESTABLISHED off" state, all from a relatively small number of IPs (about 200), and that they tend to add up as time goes by and clients connect, much like these which I see running netstat -nato:



tcp        0      0 ::ffff:10.39.151.20:10000   ::ffff:78.152.97.98:12059   ESTABLISHED off (0.00/0/0)
tcp 0 0 ::ffff:10.39.151.20:10000 ::ffff:78.152.97.98:49179 ESTABLISHED off (0.00/0/0)

tcp 0 0 ::ffff:10.39.151.20:10000 ::ffff:78.152.97.42:45907 ESTABLISHED off (0.00/0/0)


I know it is not a DOS attack, the connections are legitimate, but the seem not to close after the clients connect and do a short data exchange with the server.. furthermore the pace is slow, since the are generated by 200 clients (counting different IP)..



Should I investigate on some weird application bug (maybe on jre 1.6), or dig into CentOS network configuration? I have no clue on what more to look upon..



Thanks in advance, any hint is appreciated!


Answer



Hypothesis 1: your application is behind a firewall that drops idle tcp-connections after a given amount of time.




When the client tries to use this connection again, it finds it unresponsive, drops it an starts a new one.



For the server, as the TCP connections don't have a keep-alive timer there is no way of knowing that the connection is invalid and it will be kept open indefinitely.



To prove: make a long running tcpdump of one connection to show it becomes unused after a given amount of time.



Solution:





  • Change the code to use keepalive on the tcp sockets and (optionally for best performance) set the keepalive timer lower than the firewall tcp-idle timer

  • Change the firewall tcp-idle timer to a higher value beyond the maximum functional idle time of the client. Most likely this will be a global setting on the firewall, so your security administrator might be slightly reluctant to do so.


No comments:

Post a Comment

linux - How to SSH to ec2 instance in VPC private subnet via NAT server

I have created a VPC in aws with a public subnet and a private subnet. The private subnet does not have direct access to external network. S...