Michael McNamara

How to troubleshoot Faceook, Instagram, WhatsApp outages?

Michael McNamara — Mon, 04 Oct 2021 20:52:27 +0000

Things certainly went south for Facebook today in a spectacular way as Reddit and other forums lit up with posts about Facebook, Instagram and WhatsApp being down and unreachable. Someone asked me a simple question? How do you troubleshoot an outage like that? We’re obviously limited as “outsiders” but even as a regular netizen we can do a bit of investigative troubleshooting to get some idea of what’s going on at Facebook.

If you tried to visit Facebook earlier today you would have likely seen this message in your web browser.

This site can’t be reached
www.facebook.com’s server IP address count not be found.

Let’s start with the basics…. DNS resolution.

[root@woodstock ~]# dig facebook.com +short
[root@woodstock ~]#

That’s not good… we can’t get an IP address for facebook.com, let’s try www.facebook.com as well.

[root@woodstock ~]# dig www.facebook.com +short
[root@woodstock ~]#

Ok, equally bad… let’s try to find the authoritative DNS servers for the domain facebook.com. We know from experience that a.gtld-servers.net. is a top level DNS server for the .com TLD, but let’s confirm it’s still in the list of servers. (I’ll edit the output below to help save space and focus our attention)

[root@woodstock ~]# dig ns com

;; ANSWER SECTION:
com. 170780 IN NS b.gtld-servers.net.
com. 170780 IN NS i.gtld-servers.net.
com. 170780 IN NS m.gtld-servers.net.
com. 170780 IN NS j.gtld-servers.net.
com. 170780 IN NS l.gtld-servers.net.
com. 170780 IN NS e.gtld-servers.net.
com. 170780 IN NS k.gtld-servers.net.
com. 170780 IN NS h.gtld-servers.net.
com. 170780 IN NS g.gtld-servers.net.
com. 170780 IN NS d.gtld-servers.net.
com. 170780 IN NS c.gtld-servers.net.
com. 170780 IN NS a.gtld-servers.net.
com. 170780 IN NS f.gtld-servers.net.

;; ADDITIONAL SECTION:
a.gtld-servers.net. 69518 IN A 192.5.6.30
b.gtld-servers.net. 82780 IN A 192.33.14.30
c.gtld-servers.net. 84678 IN A 192.26.92.30
d.gtld-servers.net. 84679 IN A 192.31.80.30
e.gtld-servers.net. 84678 IN A 192.12.94.30
f.gtld-servers.net. 84138 IN A 192.35.51.30
g.gtld-servers.net. 84679 IN A 192.42.93.30
h.gtld-servers.net. 84678 IN A 192.54.112.30
i.gtld-servers.net. 84679 IN A 192.43.172.30
j.gtld-servers.net. 82780 IN A 192.48.79.30
k.gtld-servers.net. 84679 IN A 192.52.178.30
l.gtld-servers.net. 84138 IN A 192.41.162.30
m.gtld-servers.net. 84679 IN A 192.55.83.30
a.gtld-servers.net. 81113 IN AAAA 2001:503:a83e::2:30

Ok, so a.gtld-servers.net is still in there… so let’s ask that DNS server who are the DNS servers for the domain facebook.com.

[root@woodstock ~]# dig @a.gtld-servers.net. ns facebook.com

;; QUESTION SECTION:
;facebook.com. IN NS

;; AUTHORITY SECTION:
facebook.com. 172800 IN NS a.ns.facebook.com.
facebook.com. 172800 IN NS b.ns.facebook.com.
facebook.com. 172800 IN NS c.ns.facebook.com.
facebook.com. 172800 IN NS d.ns.facebook.com.

;; ADDITIONAL SECTION:
a.ns.facebook.com. 172800 IN A 129.134.30.12
a.ns.facebook.com. 172800 IN AAAA 2a03:2880:f0fc:c:face:b00c:0:35
b.ns.facebook.com. 172800 IN A 129.134.31.12
b.ns.facebook.com. 172800 IN AAAA 2a03:2880:f0fd:c:face:b00c:0:35
c.ns.facebook.com. 172800 IN A 185.89.218.12
c.ns.facebook.com. 172800 IN AAAA 2a03:2880:f1fc:c:face:b00c:0:35
d.ns.facebook.com. 172800 IN A 185.89.219.12
d.ns.facebook.com. 172800 IN AAAA 2a03:2880:f1fd:c:face:b00c:0:35

There are the DNS servers for the domain facebook.com, so let’s see if we can communicate with any of them.

Let’s start by pinging the servers (for brevity I’m only going to go through the first server above… but they all were having issues today)

[root@woodstock ~]# ping a.ns.facebook.com -c 5 -q
PING a.ns.facebook.com (129.134.30.12) 56(84) bytes of data.

--- a.ns.facebook.com ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 3999ms

That’s not completely unexpected as most networks today block ICMP traffic by default to prevent DoS attacks so let’s try a simple DNS query to that server.

[root@woodstock ~]# dig @a.ns.facebook.com ns facebook.com

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.5 <<>> @a.ns.facebook.com ns facebook.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

That’s definitely not good, so we can assume at this point that we’re unable to communicate with the DNS servers for the facebook.com domain name, hence the error message we’re gettting in the web browser. But let’s dig a little deeper to see if the IP networks that are associated with those DNS servers are “online” and reachable. We can do that by looking at a BGP looking glass or full BGP routing table and see if that prefix is being advertised, we can also try to traceroute to the IP address in question and see if we can reach the Facebook network.

Let’s use WHOIS to see what network that IP address is a member of (again I’ve cut out some of the output below).

[root@woodstock ~]# whois 129.134.30.12
[Querying whois.arin.net]
[whois.arin.net]

NetRange: 129.134.0.0 - 129.134.255.255
CIDR: 129.134.0.0/16
NetName: THEFA-3
NetHandle: NET-129-134-0-0-1
Parent: NET129 (NET-129-0-0-0-0)
NetType: Direct Assignment
OriginAS:
Organization: Facebook, Inc. (THEFA-3)
RegDate: 2015-05-13
Updated: 2015-05-13
Ref: https://rdap.arin.net/registry/ip/129.134.0.0

Ok, so the original netblock assigned to Facebook from ARIN was 129.134.0.0/16 but Facebook could have subnetted that so we need to mindful that it could be smaller than the /16 we see allocated above.

There was a mention in some of the forums that all BGP peers to Facebook were down, so let’s check there. Let’s look at the Hurricane Electric’s Network Looking Glass using the IP address of 129.134.30.12. That shows us the following (as of 5:00PM EDT Monday October 4, 2021).

core1.mnz1.he.net> show ip bgp routes detail 129.134.30.12
Number of BGP Routes matching display condition : 2
S:SUPPRESSED F:FILTERED s:STALE x:BEST-EXTERNAL
1 Prefix: 129.134.0.0/17, Rx path-id:0x00000000, Tx path-id:0x00000001, rank:0x00000001, Status: BI, Age: 28d7h21m27s
NEXT_HOP: 65.49.109.182, Metric: 1486, Learned from Peer: 216.218.252.172 (6939)
LOCAL_PREF: 100, MED: 0, ORIGIN: igp, Weight: 0, GROUP_BEST: 1
AS_PATH: 3491 32934
COMMUNITIES: 6939:1111 6939:7039 6939:8392 6939:9003
2 Prefix: 129.134.0.0/17, Rx path-id:0x00000000, Tx path-id:0x00040001, rank:0x00000002, Status: Ex, Age: 86d22h8m40s
NEXT_HOP: 62.115.42.144, Metric: 0, Learned from Peer: 62.115.42.144 (1299)
LOCAL_PREF: 70, MED: 48, ORIGIN: igp, Weight: 0, GROUP_BEST: 1
AS_PATH: 1299 32934
COMMUNITIES: 6939:2000 6939:7297 6939:8840 6939:9001
Last update to IP routing table: 2d3h2m25s

Entry cached for another 60 seconds.

So it would appear that the routes are in the Internet BGP tables for that first server… I’m going to guess that Facebook is in recovery mode and slowly restoring their network – assuming it’s not a DoS attack or something similar.

Let’s try a traceroute using ICMP packets, again we need to be mindful that some organizations will block all ICMP traffic to protect themselves against the miscredants and to better conceal their network topology.

[root@woodstock~]# traceroute -I 129.134.30.12
traceroute to 129.134.30.12 (129.134.30.12), 30 hops max, 60 byte packets
1 107.170.19.254 (107.170.19.254) 4.061 ms 4.040 ms 4.037 ms
2 138.197.248.154 (138.197.248.154) 1.545 ms 1.558 ms 1.558 ms
3 157.240.71.232 (157.240.71.232) 41.384 ms 41.345 ms 41.380 ms
4 157.240.42.70 (157.240.42.70) 1.893 ms 1.911 ms 1.913 ms
5 157.240.40.230 (157.240.40.230) 3.552 ms 3.529 ms 3.538 ms
6 129.134.47.188 (129.134.47.188) 8.797 ms 7.276 ms 7.229 ms
7 * * *
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *

Ok, so we’re definitely reaching parts of the Facebook network, as 129.134.47.188 is on the same advertised network as a.ns.facebook.com (129.134.30.12).

Unfortunately that’s about as far as we can take it from here, we’ll need to wait for the news from Facebook itself.

Cheers!

Troubleshooting Opengear via SSH

Michael McNamara — Sun, 19 May 2019 13:17:05 +0000

We had an odd problem over the weekend… a recently installed Opengear ACM7004 started intermittently dropping off the internal network, interestingly enough this Opengear was also connected to the public Internet and was having no issues on that NIC, so we had to do some basic troubleshooting from a reverse SSH tunnel – no Web user interface.

I wanted to-do some basic troubleshooting;

Is there LINK?
What speed and duplex are we auto-negotiating?
Any errors on the switch side or host side?

There are a few different tools in Linux to help troubleshoot basic network connectivity issues, ifconfig, netstat, ethtool, and ip are among the top of the pile.

$ ethtool eth0
Settings for eth0:
	Supported ports: [ TP MII ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Half 1000baseT/Full 
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Half 1000baseT/Full 
	Advertised auto-negotiation: Yes
	Speed: 10Mb/s
	Duplex: Half
	Port: MII
	PHYAD: 0
	Transceiver: external
	Auto-negotiation: on
Cannot get wake-on-lan settings: Operation not permitted
	Link detected: no

In the above output eth0 is down, in the below output eth0 is up. The NIC appeared to be bouncing up and down intermittently on and off for no apparent reason.

$ ip link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0:  mtu 1500 qdisc mq state UP qlen 532
    link/ether 00:13:c6:aa:bb:cc brd ff:ff:ff:ff:ff:ff
3: eth1:  mtu 1500 qdisc mq state UP qlen 532
    link/ether 00:13:c6:aa:bb:cc brd ff:ff:ff:ff:ff:ff

Ok, so I was specifically interested in eth0 and at the time it was reachable (ie. working) so I had a look at the Juniper EX4300 switch and found just a few issues;

 show interfaces ge-1/0/4 extensive | match error 
   Link-level type: Ethernet, MTU: 1514, MRU: 0, Speed: Auto, Duplex: Auto, BPDU Error: None, MAC-REWRITE Error: None, Loopback: Disabled,
   Input errors:
     Errors: 10, Drops: 0, Framing errors: 10, Runts: 0, Policed discards: 0, L3 incompletes: 0, L2 channel errors: 0, L2 mismatch timeouts: 0,
     FIFO errors: 0, Resource errors: 0
   Output errors:
     Carrier transitions: 1399, Errors: 0, Drops: 0, Collisions: 0, Aged packets: 0, FIFO errors: 0, HS link CRC errors: 0, MTU errors: 0,
     Resource errors: 0
     CRC/Align errors                        10                0
     FIFO errors                              0                0

There were 1399 carrier transitions (the port had bounced up and down 1399 times). So that immediately told me there was definitely a problem somewhere. The CRC/Align errors could be a result of the port bouncing so much. I was able to quickly correlate the logs from the Juniper switch to the monitoring system. The monitoring system was loosing connectivity to the Opengear when the switch port was going down – which is obviously expected. So this was essentially a physical Layer1 problem – perhaps a cabling issue?

1000BaseT requires all 8 wires in order to make a connection, 100BaseT only requires 4 wires to make a connection, so I changed the Juniper switch to auto-negotiate at 10Mbps or 100MBps, and not 1Gbps and the port immediately connected.

set interfaces ge-1/0/4 speed auto-10m-100m

I’m going to guess that we have a bad patch cable between the Juniper EX4300 and the Opengear ACM7004, but for now we can run the Opengear at 100Mbps without an issue.

Cheers!

Response: Is It Really Always The Network? #ITNF

Michael McNamara — Wed, 25 Jan 2017 03:37:41 +0000

I recently read Tom Hollingsworth’s post titled “Is It Really Always The Network?” and immediately thought, I need to find time to post a reply. I’ve been in this industry for more than 20 years now and it has been a constant struggle to educate and train those around me to perform their due diligence before issuing the knee-jerk response that “it must be the network.” If you don’t understand the problem then ask for help, don’t pretend to understand the problem and then assign blame when you have no idea what you are talking about. I suspect people are getting worse and not better, although I almost wonder if people are getting lazier and just want someone else to fix their problems. Last week I had two examples of people throwing sh*t over the wall without the even performing the most basic troubleshooting steps and as you can already guess I was pissed.

In the first example, a SOAP/XML interface to a third-party wasn’t accepting transactions. Well that’s proof positive that it’s a network problem, right? A simple “telnet” test to the IP address and port from the origin server was successful. Yet the response from the team reporting the problem? “If it’s not the network we don’t know what to-do next” which left me speechless. After a 15 minute conference call, I had the third-party restart their back-end service and magically transactions started flowing again.

“If it’s not the network we don’t know what to-do next” 15 minutes later “oh, we restarted the back-end service and it’s working now”. Doh!

— Michael McNamara (@mfMcNamara) January 17, 2017

In another example, I was notified that a SOAP/XML interface that we host on the public Internet was inaccessible – must be a network issue. I verified that I was unable to connect to that specific host from the internal network on the TCP port specified and advised that they should check the specific host in question, I was rebuked by the application analyst telling me, “Nginx is up and running”. A co-worker remotely connected to the specific server and found Nginx prompting for the passphrase (password) for the SSL certificate that Nginx was trying to load. The team in question had never even checked the server.

It’s one thing to say ‘I don’t know what’s going on here and can you help me”. It’s a completely different thing to say “it’s a network problem, you need to fix it”, especially when it becomes abundantly clear that the team/person making this statement has done zero troubleshooting or due diligence and doesn’t even understand the details of the problem.

Occasionally we run into a genuine problem, yes they do occur. Unfortunately I’ve heard too many crying wolf stories that I almost never trust what I hear until I verify all the details myself.

How about this, I’ll take your credit card (Visa, Mastercard, Discover, no Amex) and if it’s a network problem I won’t bill you. However, if it’s not a network problem I’ll be sure to make it painful enough that the next time you’ll definitely do your homework before you come calling.

I’ll close this post out with the following line,

It’s Not Always the Networks Fault!

Cheers!

Is troubleshooting a dwindling skillset?

Michael McNamara — Thu, 09 Dec 2010 11:00:53 +0000

As an Information Technology professional I’ve noticed a growing trend lately around a specific skillset, troubleshooting. I’m not just talking about vendors and Customer Support personnel but all professionals working in the Information Technology field. I’ve seen system administrators that just throw up their hands if setup.exe doesn’t finish completely. Likewise I’ve seen network engineers throw up their hands if the configuration guide their following doesn’t match up 100% with the CLI output.

It might be that I’m being too critical… I’ll let you guys tell me if you think so.

I’ve also personally noticed an increased level of reliance on support and maintenance contracts. I personally don’t call a vendor until I’ve thoroughly researched the topic and have educated myself if necessary. Now obviously in a network down or similar critical situation that basic rule goes out the window, but I would hope that myself or the person responsible would have the basic knowledge and training to support the product or system.

Cheers!

Update: Here’s a great video from YouTube – Thank Carl!

Remote Packet Capture with WireShark and WinPCAP

Michael McNamara — Sun, 05 Sep 2010 14:22:26 +0000

I’m just continually impressed with the quality of so many open source products available today. One such product that should be extremely high on any network engineer’s list is WireShark. WireShark has become the de-facto standard for packet capture software and is almost unrivaled in features and functionality.

Last week I had the task of diagnosing some very intermittent desktop/application performance issues at a remote site. I had installed WireShark locally on a few desktops but I wanted the ability to remotely monitor a few specific desktops without obstructing the users workflow to get a baseline for later comparison. I was excited to learn that WireShark and WinPCAP had (experimental) remote packet capture functionality built into each product. I followed the instructions on the WireShark website by installing WinPCAP v4.1.2 on the remote machine and then starting the “Remote Packet Capture Protocol v.0 (experimental)” service. With that done I then proceeded to launch WireShark on my local desktop and configure the remote packet capture settings. From within WireShark I chose Options -> Capture, changed the Interface from Local to Remote. Then enter the IP address of the remote machine along with the TCP port (the default TCP port is 2002). I initially tried to use “Null authentication” but was unsuccessful. I eventually ended up choosing “Password authentication” and used the local Administrator account and password of the remote desktop that had WinPCAP installed on it. If the remote desktop had multiple interfaces I could have selected which interface I wanted to perform the remote packet capture on. In this case the desktop in question only had an integrated Intel(R) 82567LM-3 network adapter. I clicked ‘Start’ and to my sheer amazement the packet trace was off and running collecting packets from the remote desktop. There will still be the occasional need to place the Dolch (portable sniffer) onsite when the situation demands it but this is a great tool to have available.

Cheers!

Updated: Sunday September 5, 2010
The images appear to be missing above because the URL paths are wrong, not sure how WordPress messed up that. I don’t have time right now to fix it but I will fix it a little later.