Michael McNamara

It’s never a DNS issue right?

January 23, 2022 by Michael McNamara

I stumbled into an interesting issue today that gave me a smile when I determined it was a DNS issue.

I was doing some consulting work around WireGuard for a client, and noticed a number of odd issues and just general wonky behavior with everything being slow. This specific client uses Ubuntu Linux while I’m more of a RedHat/CentOS/Rocky guy so I thought it was an issue with the DNS caching that Ubuntu utilizes in systemd-resolve. A few quick tests using a Windows client proved that the issues weren’t limited to just the Ubuntu server, it was impacting every device. DNS queries were taking between 5 to 6 seconds and some were timing out entirely.

The client had mentioned some oddities and issues and I thought there might be a duplicate IP on the network – pretty standard affair in some networks. This wasn’t a duplicate IP issue so I went straight to the DNS servers themselves – Microsoft Windows Server 2019. I found that the root forwarders for each server were setup to use some very old Verizon DNS servers – and wouldn’t you know that some of them were no longer responding. I removed all the Verizon entries and added the two standard Google DNS servers – 8.8.8.8, 8.8.4.4. After applying that and restarting each DNS server the problem was gone and everything was running smoothly again.

What do you use for your DNS forwarders? Or do you rely on the root hints file maintained by Internic?

Cheers!

Troubleshooting Application Performance and Monitoring with Selenium

January 28, 2021 by Michael McNamara

It was yet another exciting week…

When Cloud or SaaS application performance starts impacting user productivity how do you go about troubleshooting? Performance can be extremely subjective… what is fast to some people is slow to others and vice versa. How do you even measure performance? Invariably people want to blame the network because that’s the simplest answer. However, it can take a lot of effort and due diligence to dig down and find the actual culprit.

In this specific case we had ~ 8,000 miles between the users and the server infrastructure. So I’m personally expecting additional challenges due to the extreme round trip times (220ms) and latency that may play some roll in any possible issue or issues.

Let’s try to frame the issue;

Is the issue persistent or intermittent? Intermittent
Is the issue occurring with any regularity? Yes, 11:00AM – 12:30PM local time daily
Is the issue impacting every user or just specific users? Multiple users, not clear if every user is impacted but a majority of users
Is there anything common among the impacted users? They are all using the same VPN and proxy server infrastructure, they are all located in the same country.
When did the problem start? Users have been working for 3+ months without issue, but this problem is fresh within the past 2 weeks.

The last point is likely key… so what’s changed in the past 2 weeks that’s causing this issue? Let’s get to that later but those simple facts are key in driving your investigation.

We start with the simple baseline network tests;

ping – good with minimal pack loss
traceroute (mtr) – looks like pathways with multiple ISPs
speed tests – generally good
packet capture – in general looks good, some out of order packets, some dupe ACKs, these are likely the result of the ~ 8,000 miles between the endpoints.

In the baseline results there are no smoking guns but there are some suspect data points in there, although we need to remember that this isn’t a LAN based application. This is an Internet based application with 8,000 miles between the endpoints so there is going to be some noise in the packet trace.

Note: I’ve seen all sorts of interesting Internet issues since March 2020 when the pandemic lock-down first kicked off here in the US, and again recently at the beginning of September 2020 when the majority of US school students returned to remote learning. I observed a large number of my US users had better latency to our UK VPN gateways than to our local US VPN gateways. Ultimately we found a number of Internet peering points between the different Internet Service Providers (I’m being nice here and not naming names) were getting completely blasted and was adding 75-125ms to every packet. Eventually the providers addressed this problem with additional peering but it was a painful couple of weeks.

Now what we need are some additional data points that can be collected during the issue;

HAR (HTTP Archive) from Chrome web browser collected from user experiencing issue – this was a key piece of data that helped move the issue forward
packet capture – wasn’t able to be captured due to locked down computers

What can we do to monitor the performance of the cloud application?

ping – We setup pings monitors from a number of data centers globally to monitor for basic availability
curl – We setup some simple HTTP/HTTPS monitoring using cURL
selenium – At the recommendation of the application provider we setup ThousandEyes and a transaction monitor to generate synthetic transactions by logging into the application and working through a few different functions which themselves have dependencies on external REST and SOAP APIs.

The application itself has a number of dependencies from external microservices, so initially we were concerned that these external services might be having performance issues themselves which might be impacting the application itself. So we had to setup additional monitoring to try and validate the performance of those REST and SOAP APIs during the reported timeframes.

This was my first foray into working with Selenium and ThousandEyes but I was able to kludge my way through the solution after about 2 days. I did run into a few problems with the application website using dynamic Class IDs but eventually I got some basic tests working properly. The solution itself worked fairly well… we had some decent “front door” statistics within hours and the synthetic transaction data gave us a good idea that the application was performing properly during the reported timeframes the users were experiencing issues.

The application vendor was extremely helpful in examining the HAR data, and quickly determined from the HAR and their own internal logs that HTTP/HTTPS requests from the clients were being queued up and delayed from reaching their back-end infrastructure (Chrome only allows 6 concurrent connections to a single hostname). Within the HAR data the vendor observed some fairly aggressive custom polling within the application that was making unconditional Javascript calls every 2 seconds that resulted in a 12Kb data set being transferred to the client. The initial theory was that some Internet slowdown was causing the client requests to slowdown and eventually fall behind which then coupled with the unconditional Javascript calls and the six connection limit in Chrome led to an extremely poor user experience.

We eventually learned that the infrastructure the users were riding had recently switched Internet Service Providers two weeks earlier. Hmmm… hadn’t the issues started 2 weeks earlier? Yes they had! Ultimately we determined that there was enough occasionally packet loss and packet retransmissions over this new Internet link that it was impacting this specific application. The infrastructure was switched back to the original Internet link and the issue hasn’t been observed since.

My Thoughts?

In this specific case the intermittent packet loss and retransmissions were causing the application to fall behind in it’s communications with the backend infrastructure which was resulting in an extremely poor user experience. It’s relatively safe to argue that if the application code wasn’t as aggressive in it’s polling that it could potentially “tolerate” a certain amount of packet loss and retransmissions.

I personally believe as a network engineer it’s invaluable to learn why something doesn’t work instead of just accepting that it doesn’t work. Inevitably there will be things that we can’t explain but I’m a huge advocate of spending the effort to make sure you understand the vast majority, it’s really the only way you’ll make the environment around you better and ultimately more resilient.

Cheers!

How would you plan a network migration?

May 7, 2017 by Michael McNamara

You can usually measure the success of any project by the amount of planning, research and testing that’s been invested into the project. I’ve done dozens of network migrations, from complete forklifts and gradual side by side migrations and all of them required a significant amount of planning, research and testing prior to the actual execution. It was the planning, research and testing that I directly credit for the success for all those projects. Here are the general steps that I go through when migrating a network (replacing or upgrading the physical hardware or equipment);

Cleanup
Documentation
Research
Testing pre-migration
Execution
Testing post-migration
Turnover

Let’s go through each of those steps and I’ll explain what I’m talking about.

Step 1. Cleanup

This step is usually overlooked but can be the step that provides the biggest bang for the buck. We have a core switch with 240 ports, 110 of which are completely idle. Why not cleanup and remove those 110 ports and save yourself from having to worry about trying to migrate them. The same goes for the actual configuration. We have 8 port-channels of which only 5 are active. The other 3 port-channels have been decommissioned but no one ever cleaned up the configuration or cabling. Let’s clean up the configuration prior to any migration so we only need to worry about what’s actually in use.

Step 2. Documentation

I generally like to document all the switch ports, not just the uplinks and downlinks but also dumping the MAC/FDB and ARP tables and document what’s connected to every port. You’d be surprised how often this has proved very helpful either during the migration or post migration troubleshooting.

Step 3. Research

It’s really important to-do the research to understand what caveats you could run into. In most cases you won’t be the first person building a wheel, there will have been a bunch of other folks that have done this already and have discussed their issues, problems and experiences online somewhere. It’s equally important to understand how you should be configuring the new gear and how you’re going to reach the final goal. Let’s not forget the logistics of any implementation. Is there enough space, power, cooling… is the power 120V or 220V, do I have the proper PDU and UPS sized properly, do I need 5-15P or C14 power cords?

Step 4. Testing Pre-Migration

No one wants to jump off a cliff without knowing with a high degree of certainty that the parachute is going to open and work. This is the phase where you prove that all the planning and research is going to show real fruit. If you have a test plan, please make sure you execute it pre-migration. You’d be surprised how many times I run into people telling me that X or Y isn’t working after a network change – only to find out that X or Y had never worked for quite sometime.

Step 5. Execution

Here’s where the rubber meets the road.. whether it’s an overnight forklift or a side by side migration this is what you’ve been planning for. It’s time to get the job done.

Step 6. Testing Post-Migration

Let’s make sure that everything is still working properly… before the users start calling on Monday morning.

Step 7. Turnover

The final hurdle, documentation and the implementation of some type of monitoring and management solution.

Let me know what’s been your largest or most challenging upgrade or migration in the past few years.

Cheers!

Image Credit: sanja gjenero

War story from the frontlines of E-Commerce

February 6, 2015 by Michael McNamara

I’m here to report that I survived my first holiday season working in retail. Thankfully my team and I were able to keep the network infrastructure humming along without any majors hiccups or issues which left everyone extremely happy including our customers. Here’s a short look behind the scenes of a our first big sale of the holiday leading up to Black Friday 2014.

It had all the markings of the Y2K war room, if you were around for that exciting event. There were no fewer than 30 people packed inside a large conference room which held 1 wall mounted 80′ TV, a floor stand mounted 65′ TV and a 50′ projector all cycling through a dizzying array of system, application, network performance metrics and sales statistics. There wasn’t a free seat to be had and there were a few people just standing in the background. There were developers, QA testers, marketing specialists, systems engineers, network engineers, mid-level managers (myself included) and a number of executives. There were no fewer than 6 power strips laid out across the conference room table with almost every outlet being spoken for. I’m not sure what Apple was thinking when they designed that power brick for the Mac books. At one end of the room, the 50′ projector was displaying the new Javascript/AJAX based sales dashboard against the wall. At the other end of the room was the 80′ wall mounted TV and floor stand mounted 65′ TV were both cycling through multiple system, application and network dashboards. It was curious to watch the business folks all watching the 50′ projector with the sales figures while the technical folks were all staring just as intently at the 80′ and 65′ TVs. I later noticed that there were another 20-25 people outside the conference room all staring just as intently at their laptop displays.

The sale was scheduled to go from 7:00PM to 11:00PM and had become the kick-off event to the holiday sales season for one of our brands. I had previously learned that there were significant technical issues in previous years so there was a considerable amount of stress and pressure to make sure that everything went off without a hitch. If there were any issues I was there to help quickly identify them and implement any needed workarounds to minimize any disruptions. There was no shortage of wise cracks regarding possible network, firewall or load-balancer issues as the event grew closer. In the months leading up to tonight we had run into a number of performance problems and technical hurdles so while I was confident in my team’s preparation and in the network infrastructure in general I knew success wasn’t necessarily assured. The application that was driving the website was overly resource, memory and bandwidth hungry which can be extremely problematic when trying to scale an application. While the website was utilizing a premiere Content Distribution Network (CDN) it wasn’t as optimized as some of our other brands and would occasionally pull 1MB+ HTML files from origin. While gains had been made in optimizing the website there were still concerns that the load on the web servers, application servers and database servers might be too much for the hardware to sustain.

Earlier that morning around 10:30AM an email campaign had gone out announcing the sale and that led to a significant jump in the number of sessions to the website along with a sizable increase in bandwidth utilization. A second email campaign would go out around 3:00PM which would again lead to an additional increase in the number of active sessions. At 6:30PM a push notification was sent to all mobile app (Apple and Android) users which caused yet another significant spike in the number of connections. At 7:00PM sharp the sale started with an impressive but steady stream of users shopping the site and checking out. There wasn’t much change until I noticed another increase in the number of connections to the load balancer around 8:30PM. After inquiring with the marketing team we learned there had been another email campaign sent to registered users. There were brief stints here and there were one or two application servers would slow down a little but nothing really significant until 10:30PM. Around that time there was yet another scheduled push notification made to all the mobile app users at which time more than half of the application servers came under a significant load each fielding more than 100 active connections from the load balancer. After a few minutes one of the application servers just couldn’t keep up and it became unresponsive, eventually failing out of the server farm with those customers that had been connected to that server being directed to one of the remaining application servers. A few minutes later the server recovered by itself and was automatically placed back into the server farm without any manual intervention and would run fine the rest of the night. When 11PM came we learned that the executive team decided to extend the sale an additional hour to 12AM midnight but the number of sessions and the traffic was already starting to wind down. We had made it through the 4 hour sale relatively unscathed. There was a small interruption to a few customers when one of the application servers went unresponsive but that issue only lasted about 5 minutes. It was a multi-million dollar event and I felt pretty good regarding the performance of the network infrastructure.

I had a few observations that night… email marketing campaigns if not properly throttled can inadvertently overload a website. Mobile push notifications can have the same affect although I would say they are more detrimental. While we saw significant click-thru conversions on the email marketing campaigns which in turn drove system and network utilization, the peaks out of mobile push notifications were much more severe. I’m guessing the conversion rate was much higher on the mobile app than the email campaigns and as such led to significant upticks in both system and network utilization. In short people were much more inclined to open a push notifications and immediately start shopping the website.

Now the work begins for next year’s holiday season. Here are a few of the bigger network infrastructure projects that myself and my team will be working on;

Upgrade Cisco 6509 pair from standalone to VSS cluster
Upgrade Cisco ACE 4710s to A10 Thunder 3030 Load Balancers
Design and deploy Aruba ClearPass for Network Access Control
Design and deploy Infoblox IPAM for DNS/DHCP Services with DNS Firewall

Cheers!

Image Credit: Sias van Schalkwyk

Certificate Life Cycle – A problem for everyone to watch out for

December 14, 2014 by Michael McNamara

There was a lot of news recently around the Hypercom pin-pads (payment terminals) that just stopped working because an internal certificate stored within the device had expired on Dec 7, 2014. While there were some rumblings throughout the retail digital underground there was a post by Brian Krebs cheekily entitled, ‘Security by Antiquity’ Bricks Payment Terminals that exposed the issue to a larger audience.

This is a familiar problem for those of us that have been around for a while. I first ran into this problem around 2007 when our internal root certificate was due to expire after being created when I first joined my previous employer in 1997 – I was the person that had created the root certificate back in 1997 having deployed an internal PKI utilizing Microsoft’s Certificate Services. The problem wasn’t that we couldn’t renew the certificate, or that we wouldn’t push that new root certificate to the thousands of Windows desktops and laptops or hundreds of servers. The problem was how were we going to push the certificate to the hundreds of HP Thin Clients and had a locked flash. The HP Thin Client would initially authenticate via it’s computer account using 802.1x which relied on the appropriate certificates being in place and functional. When the user would login the HP Thin Client would switch over and re-authenticate via 802.1x as that specific user. We needed to authenticate via the computer account so we could get the devices connected to the network without user intervention so we could manage the devices, otherwise the HP Thin Client would need to be physically cabled up to the network. Ultimately there were a small scramble to “upgrade” all our Thin Clients – the upgrade included all the latest security patches and updates along with a new root certificate.

It’s usually public SSL certificates that organizations occasionally forget to renew before they expire. Take the example of Literature & Latte, the folks behind Scapple, the simple flow and diagram charting application for Windows and Mac OS X. Their SSL certificate expired on December 12, 2014 and has yet to be replaced and/or updated which causes Internet Explorer and other browsers to error when connecting to their website.

At the time I was using Microsoft Certificate Services I believe you could only issue 10 year root certificate which seemed adequate at the time. Imagine my surprise when 2007 came around and I was still managing that same infrastructure. I believe I read that you can now issue 20 year certificates with an encryption key length of 4096?

The moral of the story here… if you’re using internal certificates you’ll want to make sure you plan accordingly so that your root certificate authority doesn’t just expire one day ten years from now and leave you in a lurch.

Cheers!

Note: This is a series of posts made under the Network Engineer in Retail 30 Days of Peak, this is post number 20 of 30. All the posts can be viewed from the 30in30 tag.

It’s never a DNS issue right?

Troubleshooting Application Performance and Monitoring with Selenium

My Thoughts?

How would you plan a network migration?

Step 1. Cleanup

Step 2. Documentation

Step 3. Research

Step 4. Testing Pre-Migration

Step 5. Execution

Step 6. Testing Post-Migration

Step 7. Turnover

War story from the frontlines of E-Commerce

Certificate Life Cycle – A problem for everyone to watch out for

Recent Posts

Categories

SUBSCRIBE BY EMAIL