Michael McNamara

Black Friday 2015, behind the scenes

Michael McNamara — Fri, 04 Dec 2015 22:45:30 +0000

I was reading a post from Paul Stewart entitled, “Black Friday, Technology Glitches and Revenue Lost“, last night and thought to myself – there’s a blog post in there somewhere.

As someone who works for a large global retailer I have a new found appreciation for the challenges and issues that present themselves during the holiday season for retailers. It’s been a very tiring week for myself and a large number of my co-workers as we worked tirelessly from Thursday night (Thanksgiving) to Tuesday morning to keep any issues like what Paul’s wife experienced from impacting our brands. In the past two years I’ve learned that scale brings a whole other dimension to the game. I’m not just talking about bandwidth, that’s usually a pretty simple problem to solve. Instead I’m referring to all the inter-dependencies in both the application layer and in the hardware (networking, storage, compute). Let me try and put this into perspective, over the long weekend one of the brands we manage attracted over 7.9 million visitors to their website generating some 234 million page views. In comparison this blog gets about 1,500 page views a day just to put that number into perspective.

Why is it so hard?

For these few weeks of the year most sites will see significantly more traffic than they see throughout the year. Discount sales will also drive additional volume to the sites which may have issues trying to keep up with the number of online users. Often time the issue is scale… can the application scale… can the infrastructure scale to meet the demand. And more importantly can the vendors and third-parties that you rely on also scale to meet the demand. Look at the issues Target experienced on Cyber Monday when they offered 15% off every item. Around 10AM that morning they started having load issues and had to turn on Akamai’s Shopper Prioritization Application (SPA) which essentially holds users in a queue within Akamai’s network only allowing a fixed number of users through the door to keep the website from completely collapsing under the load.

In my role I’m generally concerned with the following infrastructure;

Internet Service Providers (availability of the websites from multiple peering points)
Internet Bandwidth (bandwidth and performance to/from the websites from a global viewpoint)
Internet Load Balancers (balancing the external load across multiple external facing web servers)
Internet Firewalls (protecting origin servers)
Internal Bandwidth (performance within and between data centers)
Internal Load Balancers (load balancing internal API services)
Internal Firewalls (filtering traffic between web/app/db tiers)
Storage Fabric Performance (bandwidth utilization on individual SAN switches)
Storage Array Performance (read/write latency per storage pool or LUN)

When we have problems we’ll often see connection counts on our external and internal firewalls spike as the web or app servers try to spin up additional processes and connections to compensate either for the load or for the previously failed query or connection. This is often just a symptom of the problem and not the cause of the problem. The cause might be a third-party API that’s not responding quickly enough and because that API is backing up, now our application is backing up because we’re relying on them. In another example a poorly written stored procedure severely impacted a database, while some folks believed the database was at fault, it wasn’t hard to quickly identify the poorly written SQL code as the culprit. And while this stored procedure worked for 100 users it quickly failed when there were 1,000 users on the website.

Here’s one that people often forget about…

Voice Infrastructure – Contact Center 1-800 PRI/T1/SIP channel availability

We have 10 T1 circuits into one of our Call Centers, providing some 240 channels (ability to handle 240 concurrent calls). Throughout the year we rarely ever get above 60 concurrent calls. On Black Friday and Cyber Monday though we ran at 100% utilization (240 calls in queue) for quite a few hours each day, keeping our agents extremely busy.

Isn’t the cloud all about scalability?

Yes, it certainly is but not every retailer is as big as Amazon and has the ability to re-architect their entire application stack around the cloud. We purposely spin up additional web and app server instances just prior to the holiday season and we spend a lot of time running load tests to validate that everything is working properly. Not every retailer has either the resources or the staff to bulk up for the holiday rush.

The majority of large retailers definitely track cart abandonment, some retailers will even remind you of an item left in your cart and occasionally entice you with an additional discount or coupon code. Go add something to your cart at NewEgg.com and see how long it is before they drop you an email message.

Retailers rely on tools like Pingdom, AlertSite, New Relic, AppDynamics and AppBoy to provide visibility into their website or mobile application performance along with the user experiences (user timings).

So while it might seem like a pretty easy problem to quickly solve, it’s actually a very complex problem.

Cheers!

Image Credit: Philippe Ramakers

Black Friday – the aftermath

Michael McNamara — Sat, 29 Nov 2014 15:07:09 +0000

I had been hoping for a relatively quiet day yesterday but it was anything of the sort. We had been on a conference bridge around 11:30PM on Thursday night to monitor the system for a number of marketing campaigns that were kicking off at 12AM Friday morning. While there was a significant increase in the number of online shoppers (sessions) by 1AM Friday morning everything was running smoothly. We reconvened in the office on Friday morning around 7AM and for the first 2 hours things were relatively quiet. We could see a steady increase in the number of sessions and bandwidth being served from origin but for the most part everything looked pretty good. It wasn’t until the mobile app push came that the boat sprang a few leaks.

Then came the news that Best Buy was down which only served to excite the room even more. We run the same back-end E-Commerce solution as Best Buy. Which in a twist of irony that vendor’s website is currently down as I write this post on Saturday morning.

We had started failing behind such that we had a combined 180,000+ TCP connections on our Tier1 and Tier2 firewalls (we would usually have around 21,000 TCP connections). The load-balancers were also getting a workout as the number of open TCP connections started climbing first from the usual double digits into five digits. While the websites were slowing down they were still working but the number of open TCP connections possibly indicated that we were starting to fall behind and corrective action would be necessary or we would eventually fall over and crash.

The assembled team dug into the data using tools from New Relic, Zabbix and Splunk to name a few and quickly found issues with API calls from the mobile app that were in need of tuning. The team worked diligently to try and steady the ship by spinning up additional back-end instances and by adding additional threads/workers to our existing caching layers, web and app tiers. Then we discovered something odd, we started observing packet loss on traffic egressing our Tier1 perimeter firewall. That certainly hadn’t been there before but it was there now. Was it a result of the extreme number of sessions that were hitting the websites or had something else broken. The team decided to spin up additional data centers and started shedding new customers to other data centers which helped keep the existing data centers from falling over. We weren’t sure if the problem with the firewall would get worse so spinning up additional data centers allowed us some flexibility if things went south. With the other data centers online and running we decided to fail over the HA firewall pair to see if that would correct the packet loss issue and sure enough the problem disappeared after we failed over.

I never thought we would have seen such a basic issue as packet loss within the internal network but we did. We were able to quickly identify the issue and we took corrective action to remedy the problem. As for the root cause, I’m not sure we’ll ever know with 100% certainty but we’ll probably be implementing some yearly preventative maintenance as that firewall had been up some 680 days.

After a few hours we spun down the extra data centers that we had brought online now that we had addressed and resolved the issues we found. While there were a few traffic spikes throughout the remaining afternoon and evening everything ran smoothly. In the end the executive team was happy with the outcome and result and I was relieved that I had survived Black Friday with only a few bruises. Thankfully we didn’t have any issues on the retail store side or with our call centers or distribution centers. It was the E-Commerce that had literally consumed my entire day.

I’m sitting here writing this post contemplating what’s in store for Cyber Monday. Which for me will actually kick off around 8PM Sunday night. I’m told that in past years we’ve generally seen a 20-30% increase in traffic between Black Friday and Cyber Monday.

Cheers!

Note: This is a series of posts made under the Network Engineer in Retail 30 Days of Peak, this is post number 6 of 30. All the posts can be viewed from the 30in30 tag.

Black Friday – Is is storming yet?

Michael McNamara — Fri, 28 Nov 2014 13:00:58 +0000

What’s the Internet weather look like today?

While it was snowing in the Philadelphia, PA suburbs on Wednesday – I do love a white Thanksgiving, today I’m going to be focused entirely on the Internet weather. I’ll most likely be spending the majority of my day in the war room carefully watching all the different metrics and assuring board members, executives and directors that everything is working smoothly. While I can control the technical pieces to the puzzle I have little control over the actual sales numbers although it’s not a stretch for people to blame technology when sales numbers fail to meet projections.

What are some resources that I use to monitor the overall health of the Internet outside of my data centers?

Cheers!

Note: This is a series of posts made under the Network Engineer in Retail 30 Days of Peak, this is post number 5 of 30. All the posts can be viewed from the 30in30 tag.

Happy Thanksgiving!

Michael McNamara — Thu, 27 Nov 2014 14:00:46 +0000

Happy Thanksgiving!

This year the Thanksgiving holiday has special significance given that I’m now working in the retail industry. Thanksgiving is quite literally the calm before the storm. The storm that is Black Friday and Cyber Monday. Those two days are the most important shopping days out of the entire calendar year to almost every retailer. What’s my biggest concern? That Black Friday and Cyber Monday will be too good, so much so that various front-end and back-end systems will fail and/or collapse under the surge of orders and shopping users. I’m not talking about just web based E-Commerce but also the traditional brick and mortar store fronts and all the POS (Point of Sale) transactions and credit charges that are sure to flood our data centers, servers and networks. Black Friday also marks the start of the holiday shopping season which runs through Christmas. That’s a total of 28 days where my team and I will need to be on our game and make sure we quickly address any issues or problems that might arise. Any downtime during the holiday peak is very painful to the bottom line. If for instance you have a website that takes in $1,000,000 an hour in sales and you’re down for even just 10 minutes, that equates to ~ $166,000 in lost sales. If we’re down for 30 minutes that’s $500,000 in potential lost sales. In this case downtime can literally cost you your job and then some. If that wasn’t enough the recent cyber break-ins at some of the largest retailers has everyone on edge.

It’s my belief that we’ve adequately prepared ourselves and that the infrastructure and architecture we’ve assembled including the Internet Service Providers and Content Delivery Networks will meet the surging demand over the next 28 days.

I definitely hope to enjoy the turkey, stuffing, carrots, mashed potatoes, sweet potatoes, corn and gravy this year, not to mention the ice cream and apple pie. Here in the states, football or American Football as it’s referred to abroad is a very popular sport. I also hope to take in some of the football game between my local Philadelphia Eagles and the Dallas Cowboys. The wife cheerfully reminded me after proof-reading this post that I’ll also be spending some time in the kitchen doing the dishes! ;)

Cheers!

Note: This is a series of posts made under the Network Engineer in Retail 30 Days of Peak, this is post number 4 of 30. All the posts can be viewed from the 30in30 tag.