The folks at Gestalt IT are putting on yet another Networking Field Day event in Silicon Valley. You can follow the hash tag #NFD8 on Twitter and/or watch the live video streams at http://techfieldday.com/event/nfd8
Cheers!
technology, networking, virtualization and IP telephony
The folks at Gestalt IT are putting on yet another Networking Field Day event in Silicon Valley. You can follow the hash tag #NFD8 on Twitter and/or watch the live video streams at http://techfieldday.com/event/nfd8
Cheers!
Here’s a short story, partially still in progress. With all the security breaches going on I thought for a few moments that I might have been caught up in one of them. This past Thursday night I noticed that both my virtual servers, which are hosted by Digital Ocean, were very slow to login via SSH, I mean extremely slow on the order of 8 to 10 seconds. Accessing the WordPress installation on one of those virtual servers was also extremely slow. I began to fear that one of my servers had fallen victim to some vulnerability or exploit and because I have shared ssh keys that might mean that both servers could have possibly been accessed without my knowledge or permission.
I checked that both DNS servers were responding and resolving queries;
/etc/resolv.conf: nameserver 4.2.2.2 nameserver 8.8.8.8
I was able to resolve www.google.com from both 4.2.2.2 as well as 8.8.8.8 so I initially thought the problem wasn’t with DNS. However, when I was unable to figure out where the problem was I decided to disable DNS resolution within SSH by placing the following statement in /etc/ssh/sshd_config;
UseDNS no
A quick restart of sshd (service sshd restart) and the problem seemed to be resolved.. but what had fixed it?
I went back to the /etc/resolv.conf file and decided to change the order of the DNS servers, placing Google’s DNS server ahead of Level3 and adding a second Google DNS server into the mix. Digital Ocean doesn’t maintain their own DNS infrastructure which is somewhat surprising for such a large provider.
/etc/resolv.conf nameserver 8.8.4.4 nameserver 8.8.8.8 nameserver 4.2.2.2
And to my surprise when I went back to test the problem was gone? So was there some problem with 4.2.2.2?
With that problem partially explained I decided to apply the latest and greatest patches and security updates for CentOS 6.5. And there I also ran into any problem… upon rebooting I found that the OS was unable load the driver for the network interface eth0, returning the error;
FATAL: Could not load /lib/modules/2.6.32-358.6.2.el6.x86_64/modules.dep
I was able to quickly change my kernel version via the Digital Ocean control panel to 2.6.32-431.20.3.el6.x86_64 and reboot.
With all that done I turned my attention toward my WordPress administration which was still extremely slow. I had been guessing that the problem was probably related to some plugin that I recently updated so I went back and disabled all the plugins to see if I could possibly find the one that was causing the slow down. That search proved fruitless so I turn my attention to the performance of Nginx, PHP-FPM and MySQL, After optimizing the tables and some of the configurations within MySQL I found the response of the WordPress Admin portal better (3-5 seconds) but there’s still a problem somewhere there that I need to track down.
Here are a few resources if you are struggling with tunning your MySQL instance;
It’s never a boring day blogging on the Internet.
Cheers!
In my latest adventure I had to untangle the interaction between a pair of Cisco ACE 4710s and Akamai’s Content Distribution Network (CDN) including SiteShield, Mointpoint, and SiteSpect. It’s truly amazing how complex and almost convoluted a CDN can make any website. Any when it fails you can guess who’s going to get the blame. Over the past few weeks I’ve been looking at a very interesting problem where an Internet facing VIP was experiencing a very unbalanced distribution across the real servers in the severfarm. I wrote a few quick and dirty Bash shell scripts to-do some repeated load tests utilizing curl and sure enough I was able to confirm that there was something amiss between the CDN and the LB. If I tested against the origin VIP I had near perfect round-robin load-balancing across the real servers in the VIP, if I tested against the CDN I would get very uneven load-balancing results.
When a web browser opens a connection to a web server it will generally send multiple requests across a single TCP connection similar to the figure below. Occasionally some browsers will even utilize HTTP pipelining if both the server and browser support that feature, sending multiple requests without waiting for the corresponding TCP acknowledgement.
The majority of load balancers, including the Cisco ACE 4710 and the A10 AX ADC/Thunder, will look at the first request in the TCP connection and apply the load-balancing metric and forward the traffic to a specific real server in the VIP. In order to speed the processing of future requests the load balancer will forward all traffic in that connection to the same real server in the VIP. This generally isn’t a problem if there’s only a single user associated with a TCP connection.
Akamai will attempt to optimize the number of TCP connections from their edge servers to your origin web servers by sending multiple requests from different users all over the same TCP connection. In the example below there are requests from three different users but it’s been my experience that you could see requests for dozens or even hundreds of users across the same TCP connection.
And here lies the problem, the load balancer will only evaluate the first request in the TCP connection, all subsequent requests will be sent to the same real server leaving some servers over utilized and others under utilized.
Thankfully there are configuration options in the majority of load balancers to work around this problem and instruct the load balancer to evaluate all requests in the TCP connection independently.
A10 AX ADC/Thunder
strict-transaction-switch
Cisco ACE 4710
parameter-map type http HTTP_PARAMETER_MAP persistence-rebalance strict
With the configuration change made now every request in the TCP connection is evaluated and load-balanced independently resulting in a more even distribution across the real servers in the farm.
In this scenario I’m using HTTP cookies to provide session persistence and ‘stickiness’ for the user sessions. If your application is stateless then you don’t really need to worry that a user lands on the same real server for each and every request.
Cheers!
Image Credit: topfer
I recently ran into an puzzling issue with a web framework that was failing to perform under a load test. This web framework was being front-ended by a pair of Cisco ACE 4710 Application Control Engine (Load-Balancer) using a single IP address in a SNAT pool. The Cisco ACE 4710 was the initial suspect, but a quick analysis determined that we were potentially experiencing a TCP port exhaustion issue because the test would start failing almost at the same point every time. While the original suspect was the Cisco ACE 4710 it turned out to be a TCP port exhaustion on the web application tier. The load test was hitting the site so hard and so fast that it was cycling through all ~ 64,000+ possible TCP ports before the web server had freed up the TCP port from the previous request on that same port. The ports were in TIME_WAIT state even though the Cisco ACE 4710 had sent a FIN requesting the port be CLOSED. Thinking the port was available the Cisco ACE 4710 attempted to make a connection on the port a second time which failed because the web application tier still had the TCP port in a TIME_WAIT state and hadn’t closed or freed up the port. While the Linux system administrators attempted to tune their web application tier we still had issues with TCP ports overlapping between requests so the intermin solution was to add 4 more IP addresses to the SNAT pool on the Cisco ACE 4710. This way we’d need to go through 5 * 64,000 TCP ports before we’d need to cycle back through the ports.
References;
LogNormal – http://www.lognormal.com/blog/2012/09/27/linux-tcpip-tuning/
Cheers!
Image Credit: Jaylopez
About three weeks ago Greg Ferro from Etherealmind posted an article entitled “Scripting Does Not Scale For Network Automation“. It’s quite clear from reading the article that Greg really is “bitter and jaded“. While I agree that there are challenges in scripting they also come with some large rewards for those that are able to master the skill.
In a subsequent comment Greg really hits on his point.. “We need APIs for device consistency, frameworks for validation and common actions. But above that we need platforms that solve big problems – scripting can only solve little problems. ”
I agree but for now we need to work with what we have available, and that’s no reason to stop scripting today. That said scripting is not a tool that’s going to solve every problem in IT. It might helpful for initial deployments, provisioning, backups, monitoring, testing, etc. but it’s rare that scripting will solve every problem. I personally employ a combination of commercial management solutions with scripting to achieve my goals. I’ve worked with the following methods and technologies: EXPECT/TCL, SNMP, PHP, PERL, XML, NETCONF. These all have their individual challenges but each can be used in their own fashion to help automate a task or process depending on the task or the vendor in question. If you need to-do something once or twice there’s no need for a script or automation, but if you are going to-do something daily or weekly across dozens or hundreds of assets then a script can be extremely helpful.
The point of writing a script is really two fold in my opinion, first to automate the task but more importantly to remove the human error element. I do a lot of my work in the wee morning hours when the eyes are bloodshot and the mind isn’t always as rested as it should be. It’s easy to make simple stupid mistakes repeating monotonous commands on dozens even hundreds of switches or routers. A script helps to actually do the work and it makes sure that I won’t accidentally blow something up, I’m really there just to monitor for problems or issues.
It should be no surprise that there’s effort required to maintain a script, it’s just like a commercial vendor maintaining a product. Here’s the changelog for a Perl script I maintained between 2003 and 2014 that utilized SNMP and TFTP against Avaya/Nortel, Cisco, Motorola/Symbol and HP gear. You can see some of the challenges that Greg referred to in his article;
# Changes: # # May 04, 2011 (M.McNamara) added support for HP C-Class GbE2c and legacy P-Class GbE2 # thanks to Karol Perkowski for his code addition # Dec 28, 2010 (M.McNamara) added additional code to support ERS4500 being slow TFTP transfer # Dec 27, 2010 (M.McNamara) updated CISCO-PRODUCTS-MIB to cover ciscoCBS3120 blade # Dec 20, 2010 (M.McNamara) updated ASCII routine with OID s5AgSysAsciiConfigManualUpload # Aug 31, 2010 (M.McNamara) added routines to handle binary and ASCII data for Avaya ERS switches # also added code to keep 4 archive copies per device # Dec 02, 2009 (M.McNamara) cleaned up code added additional debug routines # Oct 23, 2008 (M.McNamara) added support for Motorola RFS7000 Wireless LAN Switch # Oct 22, 2008 (M.McNamara) added support for ASCII configuration files for Avaya ERS switches # Oct 10, 2008 (M.McNamara) added support for Cisco switches # Jan 22, 2008 (M.McNamara) added support for HP GbE2c (C-Class) switch # Apr 24, 2007 (M.McNamara) added support for WS5100 3.x software # Oct 24, 2006 (M.McNamara) added support for ERS1600 v2.1 release # Sep 29, 2006 (M.McNamara) added support for BayStack 470 PwR 48T # Oct 20, 2005 (M.McNamara) added support for Baystack 5510 24 port also added # Ethernet Routing Switch (formerly Passport) 8600 code # Mar 01, 2005 (M.McNamara) incorporated a sub to check for the presence of the # proper filename on the TFTP server (/tftpboot) thereby # eliminating the first script "readytftpbackup.pl" # Feb 25, 2005 (M.McNamara) added the ability to retry a failed backup # Jan 13, 2004 (M.McNamara) some minor bugs throughout code base # Jan 06, 2004 (M.McNamara) implemented a workaround for the Passport RAPID-CITY MIB # > 3.2 problem, copied OIDs for Passport 1600 into # existing MIB along with required MIBS and added sub # to handle 1600s # Jan 05, 2004 (M.McNamara) issues with SNMP MIB for Passport 8600 v3.3.4 is presenting # problems with the Net-SNMP perl modules and the old MIB # cannot identify the newly added Passport 1600 switches. # Dec 11, 2003 (M.McNamara) resolved issue with Passport 8600 not backing up properly # Sep 17, 2003 (M.McNamara) added code to incorporate all BayStack switches into backup # Oct 1, 2003 (M.McNamara) added code to email status report to notify@acme.org # also added Perl script to weekly crontab
Will the scripts I write today be useless in two years, possibly but that’s pretty much the case with anything these days including your phone, your laptop, etc. While we wait for something else to come along the the scripts I write and maintain will be very helpful in making my job easier and making me more efficient.
Cheers!
PS: I’ve finally cleaned up the Scripting section of my blog, fixing all the broken links and updating all the code.