Digital Ocean – DNS Issues & Kernel Issues

digital-ocean-logo-4x3

Here’s a short story, partially still in progress. With all the security breaches going on I thought for a few moments that I might have been caught up in one of them. This past Thursday night I noticed that both my virtual servers, which are hosted by Digital Ocean, were very slow to login via SSH, I mean extremely slow on the order of 8 to 10 seconds. Accessing the WordPress installation on one of those virtual servers was also extremely slow. I began to fear that one of my servers had fallen victim to some vulnerability or exploit and because I have shared ssh keys that might mean that both servers could have possibly been accessed without my knowledge or permission.

I checked that both DNS servers were responding and resolving queries;

/etc/resolv.conf:
 nameserver 4.2.2.2
 nameserver 8.8.8.8

I was able to resolve www.google.com from both 4.2.2.2 as well as 8.8.8.8 so I initially thought the problem wasn’t with DNS. However, when I was unable to figure out where the problem was I decided to disable DNS resolution within SSH by placing the following statement in /etc/ssh/sshd_config;

UseDNS no

A quick restart of sshd (service sshd restart) and the problem seemed to be resolved.. but what had fixed it?

I went back to the /etc/resolv.conf file and decided to change the order of the DNS servers, placing Google’s DNS server ahead of Level3 and adding a second Google DNS server into the mix. Digital Ocean doesn’t maintain their own DNS infrastructure which is somewhat surprising for such a large provider.

/etc/resolv.conf
 nameserver 8.8.4.4
 nameserver 8.8.8.8
 nameserver 4.2.2.2

And to my surprise when I went back to test the problem was gone? So was there some problem with 4.2.2.2?

With that problem partially explained I decided to apply the latest and greatest patches and security updates for CentOS 6.5. And there I also ran into any problem… upon rebooting I found that the OS was unable load the driver for the network interface eth0, returning the error;

FATAL: Could not load /lib/modules/2.6.32-358.6.2.el6.x86_64/modules.dep

I was able to quickly change my kernel version via the Digital Ocean control panel to 2.6.32-431.20.3.el6.x86_64 and reboot.

With all that done I turned my attention toward my WordPress administration which was still extremely slow.  I had been guessing that the problem was probably related to some plugin that I recently updated so I went back and disabled all the plugins to see if I could possibly find the one that was causing the slow down. That search proved fruitless so I turn my attention to the performance of Nginx, PHP-FPM and MySQL, After optimizing the tables and some of the configurations within MySQL I found the response of the WordPress Admin portal better (3-5 seconds) but there’s still a problem somewhere there that I need to track down.

Here are a few resources if you are struggling with tunning your MySQL instance;

It’s never a boring day blogging on the Internet.

Cheers!

{ 2 comments }

Akamai CDN and TCP Connections

998467_93055466-scale

In my latest adventure I had to untangle the interaction between a pair of Cisco ACE 4710s and Akamai's Content Distribution Network (CDN) including SiteShield, Mointpoint, and SiteSpect. It's truly amazing how complex and almost convoluted a CDN can make any website. Any when it fails you can guess who's going to get the blame. Over the past few weeks I've been looking at a very interesting problem where an Internet facing VIP was experiencing a very unbalanced distribution across the real servers in the severfarm. I wrote a few quick and dirty Bash shell scripts to-do some repeated load tests […] Read More

{ 0 comments }

Web Application Load Testing – TCP Port Exhaustion

914288_69337190-scale

I recently ran into an puzzling issue with a web framework that was failing to perform under a load test. This web framework was being front-ended by a pair of Cisco ACE 4710 Application Control Engine (Load-Balancer) using a single IP address in a SNAT pool. The Cisco ACE 4710 was the initial suspect, but a quick analysis determined that we were potentially experiencing a TCP port exhaustion issue because the test would start failing almost at the same point every time. While the original suspect was the Cisco ACE 4710 it turned out to be a TCP port exhaustion […] Read More

{ 2 comments }

Response: Scripting Does Not Scale For Network Automation

screencapture-etherealmind-com-scripting-scale-network-automation

About three weeks ago Greg Ferro from Etherealmind posted an article entitled "Scripting Does Not Scale For Network Automation". It's quite clear from reading the article that Greg really is "bitter and jaded".  While I agree that there are challenges in scripting they also come with some large rewards for those that are able to master the skill. In a subsequent comment Greg really hits on his point.. "We need APIs for device consistency, frameworks for validation and common actions. But above that we need platforms that solve big problems - scripting can only solve little problems. " I agree […] Read More

{ 1 comment }

Your customer needs help? Tell them to hire me!

831838_16000623-scale

This is a little off-topic but I've probably let this slide for too long and unfortunately I've been going around with this bent up anger for quite sometime now and it's time to vent and rant. I provide a blog and forum to the community as a way to help educate people and hopefully learn a little something myself along the way. I'm generally interested in targeting the actual end-user, the network engineer or system administrator that's working for Acme Corp. or Wayne Enterprises or the Umbrella Corp, hopefully you get the idea. Inevitably there will be a reseller or […] Read More

{ 1 comment }