I was reading a post from Paul Stewart entitled, “Black Friday, Technology Glitches and Revenue Lost“, last night and thought to myself – there’s a blog post in there somewhere.
As someone who works for a large global retailer I have a new found appreciation for the challenges and issues that present themselves during the holiday season for retailers. It’s been a very tiring week for myself and a large number of my co-workers as we worked tirelessly from Thursday night (Thanksgiving) to Tuesday morning to keep any issues like what Paul’s wife experienced from impacting our brands. In the past two years I’ve learned that scale brings a whole other dimension to the game. I’m not just talking about bandwidth, that’s usually a pretty simple problem to solve. Instead I’m referring to all the inter-dependencies in both the application layer and in the hardware (networking, storage, compute). Let me try and put this into perspective, over the long weekend one of the brands we manage attracted over 7.9 million visitors to their website generating some 234 million page views. In comparison this blog gets about 1,500 page views a day just to put that number into perspective.
Why is it so hard?
For these few weeks of the year most sites will see significantly more traffic than they see throughout the year. Discount sales will also drive additional volume to the sites which may have issues trying to keep up with the number of online users. Often time the issue is scale… can the application scale… can the infrastructure scale to meet the demand. And more importantly can the vendors and third-parties that you rely on also scale to meet the demand. Look at the issues Target experienced on Cyber Monday when they offered 15% off every item. Around 10AM that morning they started having load issues and had to turn on Akamai’s Shopper Prioritization Application (SPA) which essentially holds users in a queue within Akamai’s network only allowing a fixed number of users through the door to keep the website from completely collapsing under the load.
In my role I’m generally concerned with the following infrastructure;
- Internet Service Providers (availability of the websites from multiple peering points)
- Internet Bandwidth (bandwidth and performance to/from the websites from a global viewpoint)
- Internet Load Balancers (balancing the external load across multiple external facing web servers)
- Internet Firewalls (protecting origin servers)
- Internal Bandwidth (performance within and between data centers)
- Internal Load Balancers (load balancing internal API services)
- Internal Firewalls (filtering traffic between web/app/db tiers)
- Storage Fabric Performance (bandwidth utilization on individual SAN switches)
- Storage Array Performance (read/write latency per storage pool or LUN)
When we have problems we’ll often see connection counts on our external and internal firewalls spike as the web or app servers try to spin up additional processes and connections to compensate either for the load or for the previously failed query or connection. This is often just a symptom of the problem and not the cause of the problem. The cause might be a third-party API that’s not responding quickly enough and because that API is backing up, now our application is backing up because we’re relying on them. In another example a poorly written stored procedure severely impacted a database, while some folks believed the database was at fault, it wasn’t hard to quickly identify the poorly written SQL code as the culprit. And while this stored procedure worked for 100 users it quickly failed when there were 1,000 users on the website.
Here’s one that people often forget about…
- Voice Infrastructure – Contact Center 1-800 PRI/T1/SIP channel availability
We have 10 T1 circuits into one of our Call Centers, providing some 240 channels (ability to handle 240 concurrent calls). Throughout the year we rarely ever get above 60 concurrent calls. On Black Friday and Cyber Monday though we ran at 100% utilization (240 calls in queue) for quite a few hours each day, keeping our agents extremely busy.
Isn’t the cloud all about scalability?
Yes, it certainly is but not every retailer is as big as Amazon and has the ability to re-architect their entire application stack around the cloud. We purposely spin up additional web and app server instances just prior to the holiday season and we spend a lot of time running load tests to validate that everything is working properly. Not every retailer has either the resources or the staff to bulk up for the holiday rush.
The majority of large retailers definitely track cart abandonment, some retailers will even remind you of an item left in your cart and occasionally entice you with an additional discount or coupon code. Go add something to your cart at NewEgg.com and see how long it is before they drop you an email message.
Retailers rely on tools like Pingdom, AlertSite, New Relic, AppDynamics and AppBoy to provide visibility into their website or mobile application performance along with the user experiences (user timings).
So while it might seem like a pretty easy problem to quickly solve, it’s actually a very complex problem.
Cheers!
Image Credit: Philippe Ramakers