I recently ran into an puzzling issue with a web framework that was failing to perform under a load test. This web framework was being front-ended by a pair of Cisco ACE 4710 Application Control Engine (Load-Balancer) using a single IP address in a SNAT pool. The Cisco ACE 4710 was the initial suspect, but a quick analysis determined that we were potentially experiencing a TCP port exhaustion issue because the test would start failing almost at the same point every time. While the original suspect was the Cisco ACE 4710 it turned out to be a TCP port exhaustion on the web application tier. The load test was hitting the site so hard and so fast that it was cycling through all ~ 64,000+ possible TCP ports before the web server had freed up the TCP port from the previous request on that same port. The ports were in TIME_WAIT state even though the Cisco ACE 4710 had sent a FIN requesting the port be CLOSED. Thinking the port was available the Cisco ACE 4710 attempted to make a connection on the port a second time which failed because the web application tier still had the TCP port in a TIME_WAIT state and hadn’t closed or freed up the port. While the Linux system administrators attempted to tune their web application tier we still had issues with TCP ports overlapping between requests so the intermin solution was to add 4 more IP addresses to the SNAT pool on the Cisco ACE 4710. This way we’d need to go through 5 * 64,000 TCP ports before we’d need to cycle back through the ports.
References;
LogNormal – http://www.lognormal.com/blog/2012/09/27/linux-tcpip-tuning/
Cheers!
Image Credit: Jaylopez
Vincent Bernat says
The article that you link has some inaccuracies about the TIME_WAIT state and how to handle it. Your solution (using more IP addresses) is correct. See this article for more details: http://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux.html
Michael McNamara says
Thanks for the comment Vincent.
That’s a very detailed post of your own… thanks for sharing!