I’m here to report that I survived my first holiday season working in retail. Thankfully my team and I were able to keep the network infrastructure humming along without any majors hiccups or issues which left everyone extremely happy including our customers. Here’s a short look behind the scenes of a our first big sale of the holiday leading up to Black Friday 2014.
The sale was scheduled to go from 7:00PM to 11:00PM and had become the kick-off event to the holiday sales season for one of our brands. I had previously learned that there were significant technical issues in previous years so there was a considerable amount of stress and pressure to make sure that everything went off without a hitch. If there were any issues I was there to help quickly identify them and implement any needed workarounds to minimize any disruptions. There was no shortage of wise cracks regarding possible network, firewall or load-balancer issues as the event grew closer. In the months leading up to tonight we had run into a number of performance problems and technical hurdles so while I was confident in my team’s preparation and in the network infrastructure in general I knew success wasn’t necessarily assured. The application that was driving the website was overly resource, memory and bandwidth hungry which can be extremely problematic when trying to scale an application. While the website was utilizing a premiere Content Distribution Network (CDN) it wasn’t as optimized as some of our other brands and would occasionally pull 1MB+ HTML files from origin. While gains had been made in optimizing the website there were still concerns that the load on the web servers, application servers and database servers might be too much for the hardware to sustain.
Earlier that morning around 10:30AM an email campaign had gone out announcing the sale and that led to a significant jump in the number of sessions to the website along with a sizable increase in bandwidth utilization. A second email campaign would go out around 3:00PM which would again lead to an additional increase in the number of active sessions. At 6:30PM a push notification was sent to all mobile app (Apple and Android) users which caused yet another significant spike in the number of connections. At 7:00PM sharp the sale started with an impressive but steady stream of users shopping the site and checking out. There wasn’t much change until I noticed another increase in the number of connections to the load balancer around 8:30PM. After inquiring with the marketing team we learned there had been another email campaign sent to registered users. There were brief stints here and there were one or two application servers would slow down a little but nothing really significant until 10:30PM. Around that time there was yet another scheduled push notification made to all the mobile app users at which time more than half of the application servers came under a significant load each fielding more than 100 active connections from the load balancer. After a few minutes one of the application servers just couldn’t keep up and it became unresponsive, eventually failing out of the server farm with those customers that had been connected to that server being directed to one of the remaining application servers. A few minutes later the server recovered by itself and was automatically placed back into the server farm without any manual intervention and would run fine the rest of the night. When 11PM came we learned that the executive team decided to extend the sale an additional hour to 12AM midnight but the number of sessions and the traffic was already starting to wind down. We had made it through the 4 hour sale relatively unscathed. There was a small interruption to a few customers when one of the application servers went unresponsive but that issue only lasted about 5 minutes. It was a multi-million dollar event and I felt pretty good regarding the performance of the network infrastructure.
I had a few observations that night… email marketing campaigns if not properly throttled can inadvertently overload a website. Mobile push notifications can have the same affect although I would say they are more detrimental. While we saw significant click-thru conversions on the email marketing campaigns which in turn drove system and network utilization, the peaks out of mobile push notifications were much more severe. I’m guessing the conversion rate was much higher on the mobile app than the email campaigns and as such led to significant upticks in both system and network utilization. In short people were much more inclined to open a push notifications and immediately start shopping the website.
Now the work begins for next year’s holiday season. Here are a few of the bigger network infrastructure projects that myself and my team will be working on;
- Upgrade Cisco 6509 pair from standalone to VSS cluster
- Upgrade Cisco ACE 4710s to A10 Thunder 3030 Load Balancers
- Design and deploy Aruba ClearPass for Network Access Control
- Design and deploy Infoblox IPAM for DNS/DHCP Services with DNS Firewall
Image Credit: Sias van Schalkwyk
Sounds like a heck of a ride, straight out of the last chapters of The Phoenix Project :)
I’d love to hear more about the additional projects you listed at the end, as a couple of them relate to some of our long term ones. On the load balancing side, we currently have a pair of AX3030 units that we’ve been been happy with, so we’re starting to think about the new Thunder platform. I’ve started going through some of the new Harmony materials, but I haven’t found anything concrete enough to give me a good feel for what the big deal is.
On the IPAM front, I just wanted to throw out that we’re going through a selection process to replace our current system, and we have been much more impressed with Efficient IP than either InfoBlox or Bluecat. We have a handful of interesting integrations and use cases (edu environment), and they seem a lot more flexible.
Thanks again for the story =)
Michael McNamara says
Thanks for the comment Frank!
I already have two pair of A10 AX 3200s servicing the internal facing network, providing GSLB internally between my data centers. We had a few code issues early one but they have been working well for the past year. Now we’re going to be replacing our Internet facing Cisco ACE 4710s.
I’ve deployed Infoblox in the past so I can’t claim I’m impartial. I had great success with them in the past and I’m hoping to bring that success to this new job. The security team is equally excited about the DNS firewall capabilities.
Great Post Michael. I really like it. For me it is always a highlight as a network engineer to troubleshoot and monitor big events. That feeling when everybody is working togheter and inside the team you understand each other blind and keep the network going under all possible circumstances. That is something that I really like on my job. And at the end of the day like RFC 1925 (9) says “For all ressources , what ever it is, you need more”
Best of luck for the upgrade projects.
I really appreciate your post. Could you comment on how many pageviews and hits your web servers were taking at the peak? When one of your application servers went unresponsive, what was the pageviews and hits at the time?
I too have Cisco ACE 4710 but I’m replacing them with Citrix Netscalers. Did you consider the Netscalers when you were looking at load balancer alternatives? Just curious why you chose the A10s.