I had an interesting problem this past week where a vendor tried to tell me that “no other customers were having issues“. How many times have you heard that line? The problem started with the application folks coming over to ask if there were any network issues. In a short discussion with them I learned that they had application interfaces that were taking upwards of 20-40 seconds to complete a transaction exchange and that was causing their transaction queues to back up and fall behind.
It’s fascinating now that I’m working in retail to follow the actual process flow from order entry to order fulfillment.
In any case a few quick tests using ICMP pings didn’t show any issues or problems. However, a subsequent packet trace performed from the server revealed a large number of TCP Re-transmissions and Duplicate ACKs. It was pretty clear to me that we had some significant packet loss between the two servers. However the vendor felt it was indicative of “application packet loss“. I’ve been in the networking field for quite a few years now… I’ve seen a lot and heard a lot but I’ve never heard the phrase “application packet loss”. The vendor was suggesting that it was the application that was causing the TCP Re-transmissions and Duplicate ACKs and that the network was not to blame.
In classic fashion I politely called bullshit, ok maybe I was a little more forceful than that.
It was the TCP Re-transmissions that was causing the slow down in the transaction exchanges. The packets were being re-transmitted because they were being lost somewhere between the two servers. I could see a 8 second delay here, a 8 second delay there… when you add them up you get interfaces that generally take 200ms to exchange data taking upwards of 20-40 seconds. The larger problem, we had 5000+ transactions backing up and we were falling further and further behind since the rate at which the transactions were entering the queue was far outpacing the rate at which the transactions were being processed.
In the end the vendor changed some of their Internet BGP peering in order to leverage a different Internet provider and path and that magically solved the problem instantly. There was some peering point out on the Internet that was throwing out some packets and that was causing our issues.
If you’ve ever heard of application packet loss please by all means please educate me!
Why does packet loss destroy application performance over the WAN? by Andy Gottlieb