Michael McNamara

AT&T Says Failure was Verizon’s Fault

Michael McNamara — Sat, 10 Aug 2019 12:13:50 +0000

This is yet another example of AT&T failing a large enterprise customer. While this post has nothing to-do with all the recent hubbub around AT&T’s new “5G E” marketing campaign it highlights the continued challenges enterprises have in dealing with large telecom carriers that either just don’t care or simply just don’t have the ability to operate and manage large scale networks effectively.

The incident started as most incidents start… a call from the Help Desk alerting that there was a lot of red on the network dashboard and e-mail alerts were flowing in by the hundreds. In this case I got a call from one of my network engineer’s informing me that the primary AT&T AVPN/MPLS link into our primary Data Center was down and had been down for almost 60 minutes. That was very unwelcome news as it would significantly impact a large portion of our user base and a number of business critical applications.

While AT&T was supposedly “testing” the circuit my team and I went about re-routing traffic through a secondary Data Center that was still connected to the AT&T AVPN/MPLS network. From our secondary Data Center traffic could flow on a dedicated WAN link to our primary Data Center. That effort of dealing with BGP and EIGRP route maps and policies took about 2 hours to get the majority of traffic re-routed and working again although the re-route was adding about 140ms round-trip time to every IP packet as it needed to traverse the US West coast instead of the US East coast. We have firewalls all throughout the WAN layer so asynchronous routing will cause all sorts of issues and problems and since we also have some DMVPN sprinkled in there some care and planning was needed to successfully re-route traffic.

At the 3 hour mark AT&T had declared that the circuit was good and that there were no issues found on the circuit. The technician explained that we should “verify power”. At the 7 hour mark AT&T would be telling us that their last mile provider, Verizon, had de-provisioned the 1Gbps transport, and that was the cause of our outage.

Thankfully Verizon was able to re-provision the circuit within 20 minutes. Although it would take AT&T and Verizon another 9 hours before they could commit that the circuit wouldn’t be “automatically” de-provisioned again the following night.

I truly miss the days of un-managed dark fiber where all I needed to worry about were fiber breaks and my own gear… while we had a number of fiber breaks they were fairly infrequent and in the majority of cases there were quickly remedied within 2 hours – I can’t even get a call back from AT&T in under 2 hours, forget about resolution in under 2 hours.

What story do you have to share regarding any telecom carriers?

Cheers!

Who tested the test plan before the change?

Michael McNamara — Sat, 12 Nov 2016 23:48:03 +0000

I was reminded of this little gem this past weekend while I was doing some consulting work. I was replacing a legacy Cisco router and splitting out the Internet and WAN routing to separate pieces of hardware so there were more than a few routing changes needed. After about 90 minutes of work and configuration changes I asked the client to run through their test plan.

I learned a long long time ago that you need to test the test plan. All too often I’ve found items listed in the test plan never worked even before the change, and many hours were wasted trying to fix something that never worked in the beginning and had nothing to-do with the change that was in progress.

I had mentioned this fact to this specific client but I guess my warning fell on deaf ears. The client was unable to reach site Z via their VPN. I checked all the routing and ACLs and everything looked good. I asked the client if this had worked before today and the client was adamant that it had worked before this change. It was time for truth or dare. I launched a Windows 7 VM and fired up Cisco AnyConnect so I could observe the problem first hand. I quickly noticed that my Windows VM didn’t have a route for the remote network in question, the IP network for site Z wasn’t in the split tunnel list. I hadn’t changed anything regarding the VPN AnyConnect configuration so in short it had never worked. I added the IP network to the split tunnel list and asked the client to disconnect and reconnect and bingo it was now working.

Please do yourself a favor and make life easier on yourself. Why don’t you run through that test plan before do anything and make sure that everything works as expected before you make any changes. It will save you time and money. I’m happy to take on the challenge of unraveling the mystery but my time isn’t cheap.

Cheers!
Image Credit: Flaviu Lupoian

Internet Utilization at 99.9% Arrgghhh!

Michael McNamara — Thu, 25 Jun 2009 00:00:31 +0000

I thought I would just share this short story with you all… it’s a classic case of what can happen even with the best of plans and intentions. We recently deployed Adobe Acrobat Reader 9.1.2 via Microsoft Active Directory Group Policy.

We rushed the deployment in order to address some of the recent Acrobat vulnerabilities that were being actively exploited in the wild by Nine-Ball and other trojans/malware. We noticed an unusual uptick in Internet utilization almost immediately after the package had been deployed. When we examined our Websense logs we found an extreme number of HTTP requests to swupd.adbobe.com. We determined that these requests were coming from Adobe software products that were attempting to check for an update via Adobe’s auto-update feature. The HTTP requests were being denied by our Blue Coat ProxySG appliances because we require user authentication to access the Internet. While the Adobe auto-update component was able to read the PAC file configured within Internet Explorer it was not able to provide authentication when challenged with a 407 response. We originally thought the sheer number of clients making requests was putting an undo burden on the system so we added some CPL code to our Blue Coat ProxySG appliances to allow non-authenticated access to *.adobe.com. Within minutes of that change the wheels on the bus came flying off literally.

We just happen to have two 50Mbps Ethernet links to the Internet being served from two Blue Coat ProxySG appliances with about 5,500 client PCs. Within minutes both ProxySG appliances went to 96% CPU utilization and both Internet links went to 99.9% utilization. We had literally let the cat out of the bag and it was off and running… the number of client PCs trying to download updates from Adobe surged and they literally started to choke our two Internet connections.

Thankfully the Blue Coat ProxySG appliances support bandwidth classes. We created a 1Mbps class and added some CPL code to bandwidth restrict access to *.adobe.com. While that proved to be the quick fix we’re also deploying an update via Group Policy to disable the auto-update feature per Adobe’s knowledgebase article.

Cheers!