This is yet another example of AT&T failing a large enterprise customer. While this post has nothing to-do with all the recent hubbub around AT&T’s new “5G E” marketing campaign it highlights the continued challenges enterprises have in dealing with large telecom carriers that either just don’t care or simply just don’t have the ability to operate and manage large scale networks effectively.
The incident started as most incidents start… a call from the Help Desk alerting that there was a lot of red on the network dashboard and e-mail alerts were flowing in by the hundreds. In this case I got a call from one of my network engineer’s informing me that the primary AT&T AVPN/MPLS link into our primary Data Center was down and had been down for almost 60 minutes. That was very unwelcome news as it would significantly impact a large portion of our user base and a number of business critical applications.
While AT&T was supposedly “testing” the circuit my team and I went about re-routing traffic through a secondary Data Center that was still connected to the AT&T AVPN/MPLS network. From our secondary Data Center traffic could flow on a dedicated WAN link to our primary Data Center. That effort of dealing with BGP and EIGRP route maps and policies took about 2 hours to get the majority of traffic re-routed and working again although the re-route was adding about 140ms round-trip time to every IP packet as it needed to traverse the US West coast instead of the US East coast. We have firewalls all throughout the WAN layer so asynchronous routing will cause all sorts of issues and problems and since we also have some DMVPN sprinkled in there some care and planning was needed to successfully re-route traffic.
At the 3 hour mark AT&T had declared that the circuit was good and that there were no issues found on the circuit. The technician explained that we should “verify power”. At the 7 hour mark AT&T would be telling us that their last mile provider, Verizon, had de-provisioned the 1Gbps transport, and that was the cause of our outage.
Thankfully Verizon was able to re-provision the circuit within 20 minutes. Although it would take AT&T and Verizon another 9 hours before they could commit that the circuit wouldn’t be “automatically” de-provisioned again the following night.
I truly miss the days of un-managed dark fiber where all I needed to worry about were fiber breaks and my own gear… while we had a number of fiber breaks they were fairly infrequent and in the majority of cases there were quickly remedied within 2 hours – I can’t even get a call back from AT&T in under 2 hours, forget about resolution in under 2 hours.
What story do you have to share regarding any telecom carriers?
Cheers!