I had been hoping for a relatively quiet day yesterday but it was anything of the sort. We had been on a conference bridge around 11:30PM on Thursday night to monitor the system for a number of marketing campaigns that were kicking off at 12AM Friday morning. While there was a significant increase in the number of online shoppers (sessions) by 1AM Friday morning everything was running smoothly. We reconvened in the office on Friday morning around 7AM and for the first 2 hours things were relatively quiet. We could see a steady increase in the number of sessions and bandwidth being served from origin but for the most part everything looked pretty good. It wasn’t until the mobile app push came that the boat sprang a few leaks.
Then came the news that Best Buy was down which only served to excite the room even more. We run the same back-end E-Commerce solution as Best Buy. Which in a twist of irony that vendor’s website is currently down as I write this post on Saturday morning.
We had started failing behind such that we had a combined 180,000+ TCP connections on our Tier1 and Tier2 firewalls (we would usually have around 21,000 TCP connections). The load-balancers were also getting a workout as the number of open TCP connections started climbing first from the usual double digits into five digits. While the websites were slowing down they were still working but the number of open TCP connections possibly indicated that we were starting to fall behind and corrective action would be necessary or we would eventually fall over and crash.
The assembled team dug into the data using tools from New Relic, Zabbix and Splunk to name a few and quickly found issues with API calls from the mobile app that were in need of tuning. The team worked diligently to try and steady the ship by spinning up additional back-end instances and by adding additional threads/workers to our existing caching layers, web and app tiers. Then we discovered something odd, we started observing packet loss on traffic egressing our Tier1 perimeter firewall. That certainly hadn’t been there before but it was there now. Was it a result of the extreme number of sessions that were hitting the websites or had something else broken. The team decided to spin up additional data centers and started shedding new customers to other data centers which helped keep the existing data centers from falling over. We weren’t sure if the problem with the firewall would get worse so spinning up additional data centers allowed us some flexibility if things went south. With the other data centers online and running we decided to fail over the HA firewall pair to see if that would correct the packet loss issue and sure enough the problem disappeared after we failed over.
I never thought we would have seen such a basic issue as packet loss within the internal network but we did. We were able to quickly identify the issue and we took corrective action to remedy the problem. As for the root cause, I’m not sure we’ll ever know with 100% certainty but we’ll probably be implementing some yearly preventative maintenance as that firewall had been up some 680 days.
After a few hours we spun down the extra data centers that we had brought online now that we had addressed and resolved the issues we found. While there were a few traffic spikes throughout the remaining afternoon and evening everything ran smoothly. In the end the executive team was happy with the outcome and result and I was relieved that I had survived Black Friday with only a few bruises. Thankfully we didn’t have any issues on the retail store side or with our call centers or distribution centers. It was the E-Commerce that had literally consumed my entire day.
I’m sitting here writing this post contemplating what’s in store for Cyber Monday. Which for me will actually kick off around 8PM Sunday night. I’m told that in past years we’ve generally seen a 20-30% increase in traffic between Black Friday and Cyber Monday.