Michael McNamara

It’s never a DNS issue right?

Michael McNamara — Sun, 23 Jan 2022 22:02:02 +0000

I stumbled into an interesting issue today that gave me a smile when I determined it was a DNS issue.

I was doing some consulting work around WireGuard for a client, and noticed a number of odd issues and just general wonky behavior with everything being slow. This specific client uses Ubuntu Linux while I’m more of a RedHat/CentOS/Rocky guy so I thought it was an issue with the DNS caching that Ubuntu utilizes in systemd-resolve. A few quick tests using a Windows client proved that the issues weren’t limited to just the Ubuntu server, it was impacting every device. DNS queries were taking between 5 to 6 seconds and some were timing out entirely.

The client had mentioned some oddities and issues and I thought there might be a duplicate IP on the network – pretty standard affair in some networks. This wasn’t a duplicate IP issue so I went straight to the DNS servers themselves – Microsoft Windows Server 2019. I found that the root forwarders for each server were setup to use some very old Verizon DNS servers – and wouldn’t you know that some of them were no longer responding. I removed all the Verizon entries and added the two standard Google DNS servers – 8.8.8.8, 8.8.4.4. After applying that and restarting each DNS server the problem was gone and everything was running smoothly again.

What do you use for your DNS forwarders? Or do you rely on the root hints file maintained by Internic?

Cheers!

CenturyLink/Level 3 Internet meltdown followed by Reddit moderator madness

Michael McNamara — Sun, 30 Aug 2020 20:05:56 +0000

It was another exciting morning around the Internet. Seems that CenturyLink(Level 3) had a meltdown that caused all sorts of issues for ~ 5 hours this morning starting around 6:04AM EDT and lasting until around 11:12AM EDT.

It started as it always does with reports of DNS issues, then CDN issues (Cloudflare) and eventually CenturyLink was identified as the culprit, or to be more precise any packets traversing the CenturyLink (Level3) network.

Thankfully Reddit was a great community resource and reports quickly started rolling in on these two threads;

Global AS3356 (Level3) Outages
byu/pigtrotsky innetworking

National CenturyLink outage causing problems everywhere (US)
byu/inphosys insysadmin

For reasons that still aren’t 100% clear the moderators for r/networking decided to delete the first thread. So the refugees from r/networking went to r/sysadmin to escape the persecution only to have the moderators of r/networking admit their mistake sometime later and un-delete the post.

I’ll admit I was floored when I found the original thread was deleted. There were hundreds of us struggling to source what was actually going on and trying to understand how we could mitigate the impact to our employers and some moderator deletes the thread?!? @$%#

The refugees eventually made their feelings known in a thread titled, META: I guess major news-worthy outages are off topic here?

Cheers!

DNS Loops – how to not configure DNS forwarding

Michael McNamara — Sun, 03 Apr 2016 12:51:31 +0000

I’m continually amazed by how much I don’t know and by all the little issues and problems that I encounter in my day to day tasks. You never know what’s going to pop-up.

I recently stumbled over a very poorly configured DNS environment and thought I would share how to not configuring DNS forwarding.

There was a standalone Microsoft domain which had four domain controllers which were set to forward their requests to another pair of Microsoft DNS servers which eventually forwarded those requests to a fairly new Infoblox DNS environment. Upon looking at the Infoblox reports I noticed a number of non-resolvable hostnames at the top of the reports, they outpaced the next domains in the list by several million requests. Assuming that there was some mis-configured application server that was continually pounding the DNS environment I decided to hunt through the logs to see if I could identify the original requestor and get them to clean up their act. I enabled query logging on one of the servers and set out to examine who was making the request. Oddly enough I found that the other three DNS servers were making the request. Ok, I went to the next server and repeated the steps finding that the next server showed that again the other three servers were making the request. I repeated the process on the remaining two servers and found that all the requests for this bad hostname that I could capture weren’t coming from a specific client but were instead coming just from the servers themselves. This isn’t completely odd, the servers themselves can be clients at the same time but the volume of requests was huge and there was nothing running on these servers except for being Microsoft Domain Controllers and Domain Name Services for a very small Microsoft environment.

It wasn’t until I opened the Microsoft DNS server configuration that I was able to piece together what was happening.

The servers were configured all as (multiple) Primary Masters for the internal domains but they had all been configured to use each other as forwarders along with OpenDNS. So in short the configuration was causing a DNS loop for any requests that failed to resolve. A query from a client to WEST-02 would be forwarded to all three DNS servers along with OpenDNS. Those three DNS servers would then forward that query again back to each of the other servers over and over and over. There is no TTL on a DNS query with respect to propagation. A DNS query can be propagated through as many servers as needed. There is a TTL value on the actual DNS record but that value is used in determining the caching lifecycle.

As I understand it a single DNS query for a bad DNS name would continually self propagate in this configuration because the servers would continually try to obtain an answer by looking to the forwarders and since the forwarders were configured to forward to each other you’d end up in a loop scenario.

You should never configure a pair of DNS servers to forward to each other.

Cheers!

Image Credit: airfin

Digital Ocean – DNS Issues & Kernel Issues

Michael McNamara — Sat, 23 Aug 2014 13:00:59 +0000

Here’s a short story, partially still in progress. With all the security breaches going on I thought for a few moments that I might have been caught up in one of them. This past Thursday night I noticed that both my virtual servers, which are hosted by Digital Ocean, were very slow to login via SSH, I mean extremely slow on the order of 8 to 10 seconds. Accessing the WordPress installation on one of those virtual servers was also extremely slow. I began to fear that one of my servers had fallen victim to some vulnerability or exploit and because I have shared ssh keys that might mean that both servers could have possibly been accessed without my knowledge or permission.

I checked that both DNS servers were responding and resolving queries;

/etc/resolv.conf:
 nameserver 4.2.2.2
 nameserver 8.8.8.8

I was able to resolve www.google.com from both 4.2.2.2 as well as 8.8.8.8 so I initially thought the problem wasn’t with DNS. However, when I was unable to figure out where the problem was I decided to disable DNS resolution within SSH by placing the following statement in /etc/ssh/sshd_config;

UseDNS no

A quick restart of sshd (service sshd restart) and the problem seemed to be resolved.. but what had fixed it?

I went back to the /etc/resolv.conf file and decided to change the order of the DNS servers, placing Google’s DNS server ahead of Level3 and adding a second Google DNS server into the mix. Digital Ocean doesn’t maintain their own DNS infrastructure which is somewhat surprising for such a large provider.

/etc/resolv.conf
 nameserver 8.8.4.4
 nameserver 8.8.8.8
 nameserver 4.2.2.2

And to my surprise when I went back to test the problem was gone? So was there some problem with 4.2.2.2?

With that problem partially explained I decided to apply the latest and greatest patches and security updates for CentOS 6.5. And there I also ran into any problem… upon rebooting I found that the OS was unable load the driver for the network interface eth0, returning the error;

FATAL: Could not load /lib/modules/2.6.32-358.6.2.el6.x86_64/modules.dep

I was able to quickly change my kernel version via the Digital Ocean control panel to 2.6.32-431.20.3.el6.x86_64 and reboot.

With all that done I turned my attention toward my WordPress administration which was still extremely slow. I had been guessing that the problem was probably related to some plugin that I recently updated so I went back and disabled all the plugins to see if I could possibly find the one that was causing the slow down. That search proved fruitless so I turn my attention to the performance of Nginx, PHP-FPM and MySQL, After optimizing the tables and some of the configurations within MySQL I found the response of the WordPress Admin portal better (3-5 seconds) but there’s still a problem somewhere there that I need to track down.

Here are a few resources if you are struggling with tunning your MySQL instance;

MySQL Tuning Primer Script by Matthew Montgomery
MySQLTuner-perl by Major Hayden
mysqlfragfinder by Phil Dufault

It’s never a boring day blogging on the Internet.

Cheers!

Infoblox API Perl Modules

Michael McNamara — Wed, 09 Nov 2011 23:58:33 +0000

We recently migrated from Alcatel-Lucent’s VitalQIP to Infoblox for our IPAM (IP Address Management) solution. I hope to make a more detailed post reviewing Infoblox in the future, for now I’ll stick with the issue of integrating with the API interface. One of our goals for the past few years has been to enable MAC address registration essentially turning off the dynamic nature of DHCP. This would prevent someone from connecting any device to our internal network and getting a DHCP issued IP address. It certainly not a complete solution to the security dilemmas but it would be a good first step.

I do most of my work with CentOS and RedHat Linux because those are the distributions that my organization supports internally (even if I’m one of two people that support Linux across the entire organization). In this case I was working with a CentOS 5.7 server but I was having an issue compiling and installing the Infoblox Perl modules.

LWP::UserAgent version 5.813 required–this is only version 2.033

When I attempted to compile the Infoblox Perl modules I received the following errors;

LWP::UserAgent version 5.813 required--this is only version 2.033 at /usr/lib/perl5/site_perl/5.8.8/Infoblox/Agent.pm line 3.
BEGIN failed--compilation aborted at /usr/lib/perl5/site_perl/5.8.8/Infoblox/Agent.pm line 3.
Compilation failed in require at /usr/lib/perl5/site_perl/5.8.8/Infoblox/Session.pm line 19.
BEGIN failed--compilation aborted at /usr/lib/perl5/site_perl/5.8.8/Infoblox/Session.pm line 19.
Compilation failed in require at /usr/lib/perl5/site_perl/5.8.8/Infoblox.pm line 8.
BEGIN failed--compilation aborted at /usr/lib/perl5/site_perl/5.8.8/Infoblox.pm line 8.
Compilation failed in require at ./ibcli.pl line 78.
BEGIN failed--compilation aborted at ./ibcli.pl line 78.

This was with Perl 5.8.8 on CentOS 5.7 x64, unfortunately it seems this is a known issue with the version of LWP::UserAgent that is currently being distributed via the CentOS repository.

I was able to spin up a new CentOS 6.0 x86 server which was running Perl 5.10.1 and didn’t experience this problem.

The installation was pretty straight forward (except for the issue above) but the API reference manual does a very thorough job of detailing all the possible installation methods on both Windows and Unix/Linux. I just opened a browser to one of the Infoblox appliances and downloaded the Perl modules.

https://10.1.1.1/api/dist/CPAN/authors/id/INFOBLOX/

Just replace the IP address of 10.1.1. with the IP address of your Infoblox appliance. I’m not sure why Infoblox hides their manuals behind their support portal, I just don’t understand why companies do that. You can find the manual right here, Infoblox_API_Documentation_6.1.0.pdf.

Cheers!

BlueCoat ProxySG – Flush DNS and Cache

Michael McNamara — Thu, 02 Oct 2008 22:00:42 +0000

There can be a few occasions where you may need to manually purge the local DNS cache and/or the actual web cache of a Blue Coat ProxySG appliance. While both the DNS cache and web cache will eventually age out it can be helpful to sometimes speed up the process by flushing/purging the DNS and web cache.

While this can all be done from the web interface I generally prefer the CLI (if available). The Blue Coat ProxySG appliances that I managed are setup for SSH access you may need to confirm that SSH is enabled (telnet might be enabled).

Let’s start by connecting to the BlueCoat ProxySG appliance (proxysg.acme.org);

[root@linuxhost etc]# ssh -l admin proxysg.acme.org
admin@proxysg.acme.org's password:

proxysg.acme.org - Blue Coat SG510 Series>

Once we’re connected we need to go into privledged mode to issue the commands;

proxysg.acme.org - Blue Coat SG510 Series>enable
Enable Password:

Now that we’re in privledged mode we can clear the web content cache with the following command;

proxysg.acme.org - Blue Coat SG510 Series#clear-cache
ok

And to clear the DNS cache we can use the following command;

proxysg.acme.org - Blue Coat SG510 Series#purge-dns-cache
ok

And don’t forget to logout when you’re all done.

proxysg.acme.org - Blue Coat SG510 Series#exit
Connection to proxysg.acme.org closed.

Cheers!

Domain Name Server patch

Michael McNamara — Sun, 13 Jul 2008 23:00:51 +0000

Last week there was a flurry of information revolving around a new security flaw in the Domain Name System — software that acts as the central nervous system for the entire Internet.

On Tuesday July 10, 2008 a number of vendors including Microsoft, Cisco, Juniper and RedHat released patches and/or acknowledged the flaw existed. The Internet Software Consortium, the group responsible for development of the popular Berkeley Internet Domain Named (BIND) server from which nearly all DNS offshoots are based, also acknowledged the flaw and released a patch.

I personally spent about 90 minutes on last Wednesday updating several internal and external systems including numerous CentOS v5.2 servers and Windows 2003 Service Pack 2 servers. I was unable to find any mention of the DNS flaw on the Alcatel-Lucent website so I’ll probably need to place a call concerning Alcaltel-Lucent’s VitalQIP product.

I used yum to patch the CentOS Linux servers [“yum update”] and then just restarted the named process [“service named restart”]. On the Windows 2003 Service Pack 2 servers I used Windows Update to download and install KB941672 after which I rebooted the servers.

Here are some references:

http://www.theregister.co.uk/2008/07/09/dns_fix_alliance/
http://www.networkworld.com/news/2008/071008-patch-domain-name-servers-now.html
http://www.networkworld.com/news/2008/070808-dns-flaw-disrupts-internet.html

http://www.networkworld.com/podcasts/newsmaker/2008/071108nmw-dns.html

http://www.us-cert.gov/cas/techalerts/TA08-190B.html
http://www.microsoft.com/technet/security/bulletin/MS07-062.mspx

I would strongly suggest that all network administrators start looking into patching their DNS servers as soon as possible.

Cheers!

UPDATE: July 14, 2008

Here’s an update from RedHat concerning the configuration (named.conf) of BIND;

We have updated the Enterprise Linux 5 packages in this advisory. The default and sample caching-nameserver configuration files have been updated so that they do not specify a fixed query-source port. Administrators wishing to take advantage of randomized UDP source ports should check their configuration file to ensure they have not specified fixed query-source ports.

It seems that a check of the configuration file would be in order. Let me throw in a quick warning though if your DNS server is sitting behind a firewall you may need to check with the firewall administrator to understand how the firewall will behave if you randomize your source ports. I believe there are quite a few firewalls out there that only expect to see DNS traffic sourced from a DNS server on UDP/53.

Good Luck!