Michael McNamara

How to troubleshoot Faceook, Instagram, WhatsApp outages?

October 4, 2021 by Michael McNamara

Things certainly went south for Facebook today in a spectacular way as Reddit and other forums lit up with posts about Facebook, Instagram and WhatsApp being down and unreachable. Someone asked me a simple question? How do you troubleshoot an outage like that? We’re obviously limited as “outsiders” but even as a regular netizen we can do a bit of investigative troubleshooting to get some idea of what’s going on at Facebook.

If you tried to visit Facebook earlier today you would have likely seen this message in your web browser.

This site can’t be reached
www.facebook.com’s server IP address count not be found.

Let’s start with the basics…. DNS resolution.

[root@woodstock ~]# dig facebook.com +short
[root@woodstock ~]#

That’s not good… we can’t get an IP address for facebook.com, let’s try www.facebook.com as well.

[root@woodstock ~]# dig www.facebook.com +short
[root@woodstock ~]#

Ok, equally bad… let’s try to find the authoritative DNS servers for the domain facebook.com. We know from experience that a.gtld-servers.net. is a top level DNS server for the .com TLD, but let’s confirm it’s still in the list of servers. (I’ll edit the output below to help save space and focus our attention)

[root@woodstock ~]# dig ns com

;; ANSWER SECTION:
com. 170780 IN NS b.gtld-servers.net.
com. 170780 IN NS i.gtld-servers.net.
com. 170780 IN NS m.gtld-servers.net.
com. 170780 IN NS j.gtld-servers.net.
com. 170780 IN NS l.gtld-servers.net.
com. 170780 IN NS e.gtld-servers.net.
com. 170780 IN NS k.gtld-servers.net.
com. 170780 IN NS h.gtld-servers.net.
com. 170780 IN NS g.gtld-servers.net.
com. 170780 IN NS d.gtld-servers.net.
com. 170780 IN NS c.gtld-servers.net.
com. 170780 IN NS a.gtld-servers.net.
com. 170780 IN NS f.gtld-servers.net.

;; ADDITIONAL SECTION:
a.gtld-servers.net. 69518 IN A 192.5.6.30
b.gtld-servers.net. 82780 IN A 192.33.14.30
c.gtld-servers.net. 84678 IN A 192.26.92.30
d.gtld-servers.net. 84679 IN A 192.31.80.30
e.gtld-servers.net. 84678 IN A 192.12.94.30
f.gtld-servers.net. 84138 IN A 192.35.51.30
g.gtld-servers.net. 84679 IN A 192.42.93.30
h.gtld-servers.net. 84678 IN A 192.54.112.30
i.gtld-servers.net. 84679 IN A 192.43.172.30
j.gtld-servers.net. 82780 IN A 192.48.79.30
k.gtld-servers.net. 84679 IN A 192.52.178.30
l.gtld-servers.net. 84138 IN A 192.41.162.30
m.gtld-servers.net. 84679 IN A 192.55.83.30
a.gtld-servers.net. 81113 IN AAAA 2001:503:a83e::2:30

Ok, so a.gtld-servers.net is still in there… so let’s ask that DNS server who are the DNS servers for the domain facebook.com.

[root@woodstock ~]# dig @a.gtld-servers.net. ns facebook.com

;; QUESTION SECTION:
;facebook.com. IN NS

;; AUTHORITY SECTION:
facebook.com. 172800 IN NS a.ns.facebook.com.
facebook.com. 172800 IN NS b.ns.facebook.com.
facebook.com. 172800 IN NS c.ns.facebook.com.
facebook.com. 172800 IN NS d.ns.facebook.com.

;; ADDITIONAL SECTION:
a.ns.facebook.com. 172800 IN A 129.134.30.12
a.ns.facebook.com. 172800 IN AAAA 2a03:2880:f0fc:c:face:b00c:0:35
b.ns.facebook.com. 172800 IN A 129.134.31.12
b.ns.facebook.com. 172800 IN AAAA 2a03:2880:f0fd:c:face:b00c:0:35
c.ns.facebook.com. 172800 IN A 185.89.218.12
c.ns.facebook.com. 172800 IN AAAA 2a03:2880:f1fc:c:face:b00c:0:35
d.ns.facebook.com. 172800 IN A 185.89.219.12
d.ns.facebook.com. 172800 IN AAAA 2a03:2880:f1fd:c:face:b00c:0:35

There are the DNS servers for the domain facebook.com, so let’s see if we can communicate with any of them.

Let’s start by pinging the servers (for brevity I’m only going to go through the first server above… but they all were having issues today)

[root@woodstock ~]# ping a.ns.facebook.com -c 5 -q
PING a.ns.facebook.com (129.134.30.12) 56(84) bytes of data.

--- a.ns.facebook.com ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 3999ms

That’s not completely unexpected as most networks today block ICMP traffic by default to prevent DoS attacks so let’s try a simple DNS query to that server.

[root@woodstock ~]# dig @a.ns.facebook.com ns facebook.com

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.5 <<>> @a.ns.facebook.com ns facebook.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

That’s definitely not good, so we can assume at this point that we’re unable to communicate with the DNS servers for the facebook.com domain name, hence the error message we’re gettting in the web browser. But let’s dig a little deeper to see if the IP networks that are associated with those DNS servers are “online” and reachable. We can do that by looking at a BGP looking glass or full BGP routing table and see if that prefix is being advertised, we can also try to traceroute to the IP address in question and see if we can reach the Facebook network.

Let’s use WHOIS to see what network that IP address is a member of (again I’ve cut out some of the output below).

[root@woodstock ~]# whois 129.134.30.12
[Querying whois.arin.net]
[whois.arin.net]

NetRange: 129.134.0.0 - 129.134.255.255
CIDR: 129.134.0.0/16
NetName: THEFA-3
NetHandle: NET-129-134-0-0-1
Parent: NET129 (NET-129-0-0-0-0)
NetType: Direct Assignment
OriginAS:
Organization: Facebook, Inc. (THEFA-3)
RegDate: 2015-05-13
Updated: 2015-05-13
Ref: https://rdap.arin.net/registry/ip/129.134.0.0

Ok, so the original netblock assigned to Facebook from ARIN was 129.134.0.0/16 but Facebook could have subnetted that so we need to mindful that it could be smaller than the /16 we see allocated above.

There was a mention in some of the forums that all BGP peers to Facebook were down, so let’s check there. Let’s look at the Hurricane Electric’s Network Looking Glass using the IP address of 129.134.30.12. That shows us the following (as of 5:00PM EDT Monday October 4, 2021).

core1.mnz1.he.net> show ip bgp routes detail 129.134.30.12
Number of BGP Routes matching display condition : 2
S:SUPPRESSED F:FILTERED s:STALE x:BEST-EXTERNAL
1 Prefix: 129.134.0.0/17, Rx path-id:0x00000000, Tx path-id:0x00000001, rank:0x00000001, Status: BI, Age: 28d7h21m27s
NEXT_HOP: 65.49.109.182, Metric: 1486, Learned from Peer: 216.218.252.172 (6939)
LOCAL_PREF: 100, MED: 0, ORIGIN: igp, Weight: 0, GROUP_BEST: 1
AS_PATH: 3491 32934
COMMUNITIES: 6939:1111 6939:7039 6939:8392 6939:9003
2 Prefix: 129.134.0.0/17, Rx path-id:0x00000000, Tx path-id:0x00040001, rank:0x00000002, Status: Ex, Age: 86d22h8m40s
NEXT_HOP: 62.115.42.144, Metric: 0, Learned from Peer: 62.115.42.144 (1299)
LOCAL_PREF: 70, MED: 48, ORIGIN: igp, Weight: 0, GROUP_BEST: 1
AS_PATH: 1299 32934
COMMUNITIES: 6939:2000 6939:7297 6939:8840 6939:9001
Last update to IP routing table: 2d3h2m25s

Entry cached for another 60 seconds.

So it would appear that the routes are in the Internet BGP tables for that first server… I’m going to guess that Facebook is in recovery mode and slowly restoring their network – assuming it’s not a DoS attack or something similar.

Let’s try a traceroute using ICMP packets, again we need to be mindful that some organizations will block all ICMP traffic to protect themselves against the miscredants and to better conceal their network topology.

[root@woodstock~]# traceroute -I 129.134.30.12
traceroute to 129.134.30.12 (129.134.30.12), 30 hops max, 60 byte packets
1 107.170.19.254 (107.170.19.254) 4.061 ms 4.040 ms 4.037 ms
2 138.197.248.154 (138.197.248.154) 1.545 ms 1.558 ms 1.558 ms
3 157.240.71.232 (157.240.71.232) 41.384 ms 41.345 ms 41.380 ms
4 157.240.42.70 (157.240.42.70) 1.893 ms 1.911 ms 1.913 ms
5 157.240.40.230 (157.240.40.230) 3.552 ms 3.529 ms 3.538 ms
6 129.134.47.188 (129.134.47.188) 8.797 ms 7.276 ms 7.229 ms
7 * * *
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *

Ok, so we’re definitely reaching parts of the Facebook network, as 129.134.47.188 is on the same advertised network as a.ns.facebook.com (129.134.30.12).

Unfortunately that’s about as far as we can take it from here, we’ll need to wait for the news from Facebook itself.

Cheers!

CenturyLink/Level 3 Internet meltdown followed by Reddit moderator madness

August 30, 2020 by Michael McNamara

It was another exciting morning around the Internet. Seems that CenturyLink(Level 3) had a meltdown that caused all sorts of issues for ~ 5 hours this morning starting around 6:04AM EDT and lasting until around 11:12AM EDT.

It started as it always does with reports of DNS issues, then CDN issues (Cloudflare) and eventually CenturyLink was identified as the culprit, or to be more precise any packets traversing the CenturyLink (Level3) network.

Thankfully Reddit was a great community resource and reports quickly started rolling in on these two threads;

Global AS3356 (Level3) Outages
byu/pigtrotsky innetworking

National CenturyLink outage causing problems everywhere (US)
byu/inphosys insysadmin

For reasons that still aren’t 100% clear the moderators for r/networking decided to delete the first thread. So the refugees from r/networking went to r/sysadmin to escape the persecution only to have the moderators of r/networking admit their mistake sometime later and un-delete the post.

I’ll admit I was floored when I found the original thread was deleted. There were hundreds of us struggling to source what was actually going on and trying to understand how we could mitigate the impact to our employers and some moderator deletes the thread?!? @$%#

The refugees eventually made their feelings known in a thread titled, META: I guess major news-worthy outages are off topic here?

Cheers!

Story – Packet Loss and Failing 10Gbps SFP+ Optic

July 6, 2019 by Michael McNamara

Here’s an old story that I never published.. and seeing that I haven’t been writing much lately I’m going to take the easy route and just publish this.

It’s been another interesting weekend… and by interesting I actually mean another weekend of working through yet another challenging issue.

Summary

It started back on Thursday with more than a few alerts from my own custom built monitoring solution. A few years back I wrote a Bash script to help monitor the Internet facing infrastructure and numerous VIPs that we host in our Data Centers. That script has worked well over the years helping validate application availability against network availability. With everything else going on I purposely ignored the alerts, assuming there was some DoS attack or other malady that the Internet was suffering from and it would soon fix itself. By late Friday afternoon I could no longer ignore the alerts as they were pilling up in my Inbox by the hundreds and it was long past time to roll up the sleeves and figure out what had broken where. I initially assumed that I would find some issue or problem with either the hosting company or an Internet Service Provider. A cursory review of the Internet border routers revealed that a few 10Gbps Internet links had bounced within the past 30 days but everything was running clean from the Internet Service Provider through our border routers, switches and firewalls up to our Internet facing load balancers. Initially I thought there was an issue with either AT&T or NTT as a number of the monitoring servers were traversing those ISPs but after a number of tests I found that packet loss across either of those ISPs was generally less than 0.4% which isn’t all that bad. If the plumbing was looking good then why were the alerts firing? I looked at the alert again and noticed that the messages read “socket timeout” and not “socket connection failure”.

In any event I ran a quick packet trace using tcpdump from one of the monitoring servers and found that there was traffic flowing, although there was a significant amount of retransmissions and missing packets. It looked like the health checks were timing out at the default of 10 seconds. I increased the timeout to 20 seconds and bingo the majority of health checks were now returning successfully. I’m not sure I agree with the verbiage of “socket timeout” since the socket was exchanging information between the client and server, it was more of an overall application timeout since the request was not completed with the specified timeout value.

Data Analysis

Now the $1,000,000 question, what had changed that I needed to increase the timeout?

Thankfully I’ve been logging this data for the past 3+ years so I was able to import of a few of the data points since Sept 2017 (207K rows – 1 every 60 seconds) into Excel and using the quick chart shortcut (Alt-F1) I was able to quickly visualize the data which provided some interesting results. The amount of time it was taking the health checks to complete had risen significantly in the past few weeks.

With that data it was now clear that the health checks were failing because they were hitting the 10 second default timeout. But what had happened that it was now taking on average longer than 10 seconds for the backend to return the result to the client? Was the backend slower to respond that it had previously been? Was the Internet slower than it had previously been? Was there enough packet loss and retransmissions to impact the timing? Was the size of the data being returned changing?

In short the answer appears to be a little bit of everything above.

Was the backend slower to respond that it had previously been? Yes
Was the Internet slower than it had previously been? Yes (I always assuming the Internet is getting more and more congested)
Was there enough packet loss and retransmissions to impact the timing? Yes (especially with 3K+ miles between the endpoints)
Was the size of the data being returned changing? Yes (the size of the HTML was increased causing more data to be transferred)

An interesting but logical side affect, the monitoring servers that were the farthest from the Data Center in question had a greater number of errors. This is logical because they would have the greater latency to reach that specific Data Center, any packet loss or retransmissions would cause additional delay given the latency. This explains why some monitoring servers were reporting no issues or problems and others were reporting all sorts of issues and problems. The increased physical distance between the Data Center and the monitoring server was exacerbating the timing because of the inherit packet loss and retransmissions on the Internet which was further exacerbated by the growing size of the HTML that was being transferred across those vast distances and increased time it was taking the backend to ultimately serve up the response.

This is a great example of why you can’t always just blame the network, even though it’s the easiest thing to do.

Resolution

In the end I found a failing 10Gbps SFP in the Internet facing load-balancers that needed to be replaced. I placed a monitoring probe on the local network and found the same amount of packet loss and re-transmissions which confirmed that the problem was local to my Data Center. I failed over between the primary and secondary Internet facing load-balancers and the problem disappeared so the issue was with the primary Internet facing load-balancer.

Cheers!

Verizon FiOS Gigabit Internet

March 2, 2018 by Michael McNamara

Like many folks before me I decided to cut the cord on the traditional cable box and turned to YouTube TV and Roku to satisfy my family’s limited TV appetite canceling my Verizon FiOS TV service and upgrading my Verizon FiOS Internet speed from 30Mbps to 1Gbps (alright 940Mbps or whatever it is). This past week Verizon replaced the ONT and router and I’ve officially moved into the Internet fast lane…

I had noticed some occasional buffering using my old 30/20 Verizon FiOS Internet connection – having three teenage daughters in the household doesn’t really help. Thankfully since the upgrade we’ve been issue free and buffering free. Now I might need to spend some time optimizing the two Aruba IAP 205s that I have providing wireless connectivity throughout the household as more and more devices are connecting via wireless.

Cheers!

Verizon FiOS Internet – Juniper Private VLANs

September 19, 2017 by Michael McNamara

I recently stumbled over an interesting problem with Verizon’s FiOS Internet service while doing some consulting. In an effort to protect the innocent and prevent and ass hattery, I’ve changed the IP addressing to use something from RFC5737.

A client had two physical sites about 1 mile apart which were connected to the Internet by separate Verizon FiOS broadband connections and which were assigned the following static IP addresses;

Site A:

IP Network: 198.51.100.226/28
Subnet Mask: 255.255.255.0
Default Gateway: 198.51.100.1
Usable IP Addresses: 198.51.100.226 – 198.254.100.238

Site B:

IP Network: 198.51.100.50/28
Subnet Mask: 255.255.255.0
Default Gateway: 198.51.100.1
Usable IP Addresses: 198.51.100.50 – 198.51.100.63

Let me be the first to admit that the information above isn’t quite right… there is no IP address block 198.51.100.226/28, it should be 198.51.100.224/28. I believe that’s Verizon trying to avoid having customers accidentally use the network address or the first address in the IP address block which is likely reserved for the actual Verizon Actiontec router.

The client was trying to establish a VPN tunnel between the two sites and was running into difficulties. The issue was with the IP addressing provided by Verizon and it’s likely implementation of private VLANs on the Juniper hardware. I’m assuming that Verizon is likely using PVLANs to isolate traffic between individual customers to minimize the number of IP subnets they need to create. Instead of creating 16 /28 IP networks they are using a single /24 network and then isolating the traffic between customers using PVLANs. The issue in the example above is pretty obvious – the individual client devices are attempting to communicate with each other on the local subnet. Believing that there’s no need to signal the upstream router because the netmask indicates that the remote site should be in the same IP network. While the remote site is actually in the same IP network, the implementation of PVLANs is blocking communication between the client devices.

Anyone have any experience with Verizon FiOS using PVLANs?

I believe I heard years ago that Verizon chose Juniper for their FiOS implementation.

Cheers!

Reference: Juniper – Understanding Private VLANs on EX Series Switches

How to troubleshoot Faceook, Instagram, WhatsApp outages?

CenturyLink/Level 3 Internet meltdown followed by Reddit moderator madness

Story – Packet Loss and Failing 10Gbps SFP+ Optic

Summary

Data Analysis

Resolution

Verizon FiOS Gigabit Internet

Verizon FiOS Internet – Juniper Private VLANs

Recent Posts

Categories

SUBSCRIBE BY EMAIL