I’m here to report that I survived my first holiday season working in retail. Thankfully my team and I were able to keep the network infrastructure humming along without any majors hiccups or issues which left everyone extremely happy including our customers. Here’s a short look behind the scenes of a our first big sale of the holiday leading up to Black Friday 2014.
The sale was scheduled to go from 7:00PM to 11:00PM and had become the kick-off event to the holiday sales season for one of our brands. I had previously learned that there were significant technical issues in previous years so there was a considerable amount of stress and pressure to make sure that everything went off without a hitch. If there were any issues I was there to help quickly identify them and implement any needed workarounds to minimize any disruptions. There was no shortage of wise cracks regarding possible network, firewall or load-balancer issues as the event grew closer. In the months leading up to tonight we had run into a number of performance problems and technical hurdles so while I was confident in my team’s preparation and in the network infrastructure in general I knew success wasn’t necessarily assured. The application that was driving the website was overly resource, memory and bandwidth hungry which can be extremely problematic when trying to scale an application. While the website was utilizing a premiere Content Distribution Network (CDN) it wasn’t as optimized as some of our other brands and would occasionally pull 1MB+ HTML files from origin. While gains had been made in optimizing the website there were still concerns that the load on the web servers, application servers and database servers might be too much for the hardware to sustain.
Earlier that morning around 10:30AM an email campaign had gone out announcing the sale and that led to a significant jump in the number of sessions to the website along with a sizable increase in bandwidth utilization. A second email campaign would go out around 3:00PM which would again lead to an additional increase in the number of active sessions. At 6:30PM a push notification was sent to all mobile app (Apple and Android) users which caused yet another significant spike in the number of connections. At 7:00PM sharp the sale started with an impressive but steady stream of users shopping the site and checking out. There wasn’t much change until I noticed another increase in the number of connections to the load balancer around 8:30PM. After inquiring with the marketing team we learned there had been another email campaign sent to registered users. There were brief stints here and there were one or two application servers would slow down a little but nothing really significant until 10:30PM. Around that time there was yet another scheduled push notification made to all the mobile app users at which time more than half of the application servers came under a significant load each fielding more than 100 active connections from the load balancer. After a few minutes one of the application servers just couldn’t keep up and it became unresponsive, eventually failing out of the server farm with those customers that had been connected to that server being directed to one of the remaining application servers. A few minutes later the server recovered by itself and was automatically placed back into the server farm without any manual intervention and would run fine the rest of the night. When 11PM came we learned that the executive team decided to extend the sale an additional hour to 12AM midnight but the number of sessions and the traffic was already starting to wind down. We had made it through the 4 hour sale relatively unscathed. There was a small interruption to a few customers when one of the application servers went unresponsive but that issue only lasted about 5 minutes. It was a multi-million dollar event and I felt pretty good regarding the performance of the network infrastructure.
I had a few observations that night… email marketing campaigns if not properly throttled can inadvertently overload a website. Mobile push notifications can have the same affect although I would say they are more detrimental. While we saw significant click-thru conversions on the email marketing campaigns which in turn drove system and network utilization, the peaks out of mobile push notifications were much more severe. I’m guessing the conversion rate was much higher on the mobile app than the email campaigns and as such led to significant upticks in both system and network utilization. In short people were much more inclined to open a push notifications and immediately start shopping the website.
Now the work begins for next year’s holiday season. Here are a few of the bigger network infrastructure projects that myself and my team will be working on;
- Upgrade Cisco 6509 pair from standalone to VSS cluster
- Upgrade Cisco ACE 4710s to A10 Thunder 3030 Load Balancers
- Design and deploy Aruba ClearPass for Network Access Control
- Design and deploy Infoblox IPAM for DNS/DHCP Services with DNS Firewall
Image Credit: Sias van Schalkwyk