Michael McNamara

Story – Packet Loss and Failing 10Gbps SFP+ Optic

July 6, 2019 by Michael McNamara

Here’s an old story that I never published.. and seeing that I haven’t been writing much lately I’m going to take the easy route and just publish this.

It’s been another interesting weekend… and by interesting I actually mean another weekend of working through yet another challenging issue.

Summary

It started back on Thursday with more than a few alerts from my own custom built monitoring solution. A few years back I wrote a Bash script to help monitor the Internet facing infrastructure and numerous VIPs that we host in our Data Centers. That script has worked well over the years helping validate application availability against network availability. With everything else going on I purposely ignored the alerts, assuming there was some DoS attack or other malady that the Internet was suffering from and it would soon fix itself. By late Friday afternoon I could no longer ignore the alerts as they were pilling up in my Inbox by the hundreds and it was long past time to roll up the sleeves and figure out what had broken where. I initially assumed that I would find some issue or problem with either the hosting company or an Internet Service Provider. A cursory review of the Internet border routers revealed that a few 10Gbps Internet links had bounced within the past 30 days but everything was running clean from the Internet Service Provider through our border routers, switches and firewalls up to our Internet facing load balancers. Initially I thought there was an issue with either AT&T or NTT as a number of the monitoring servers were traversing those ISPs but after a number of tests I found that packet loss across either of those ISPs was generally less than 0.4% which isn’t all that bad. If the plumbing was looking good then why were the alerts firing? I looked at the alert again and noticed that the messages read “socket timeout” and not “socket connection failure”.

In any event I ran a quick packet trace using tcpdump from one of the monitoring servers and found that there was traffic flowing, although there was a significant amount of retransmissions and missing packets. It looked like the health checks were timing out at the default of 10 seconds. I increased the timeout to 20 seconds and bingo the majority of health checks were now returning successfully. I’m not sure I agree with the verbiage of “socket timeout” since the socket was exchanging information between the client and server, it was more of an overall application timeout since the request was not completed with the specified timeout value.

Data Analysis

Now the $1,000,000 question, what had changed that I needed to increase the timeout?

Thankfully I’ve been logging this data for the past 3+ years so I was able to import of a few of the data points since Sept 2017 (207K rows – 1 every 60 seconds) into Excel and using the quick chart shortcut (Alt-F1) I was able to quickly visualize the data which provided some interesting results. The amount of time it was taking the health checks to complete had risen significantly in the past few weeks.

With that data it was now clear that the health checks were failing because they were hitting the 10 second default timeout. But what had happened that it was now taking on average longer than 10 seconds for the backend to return the result to the client? Was the backend slower to respond that it had previously been? Was the Internet slower than it had previously been? Was there enough packet loss and retransmissions to impact the timing? Was the size of the data being returned changing?

In short the answer appears to be a little bit of everything above.

Was the backend slower to respond that it had previously been? Yes
Was the Internet slower than it had previously been? Yes (I always assuming the Internet is getting more and more congested)
Was there enough packet loss and retransmissions to impact the timing? Yes (especially with 3K+ miles between the endpoints)
Was the size of the data being returned changing? Yes (the size of the HTML was increased causing more data to be transferred)

An interesting but logical side affect, the monitoring servers that were the farthest from the Data Center in question had a greater number of errors. This is logical because they would have the greater latency to reach that specific Data Center, any packet loss or retransmissions would cause additional delay given the latency. This explains why some monitoring servers were reporting no issues or problems and others were reporting all sorts of issues and problems. The increased physical distance between the Data Center and the monitoring server was exacerbating the timing because of the inherit packet loss and retransmissions on the Internet which was further exacerbated by the growing size of the HTML that was being transferred across those vast distances and increased time it was taking the backend to ultimately serve up the response.

This is a great example of why you can’t always just blame the network, even though it’s the easiest thing to do.

Resolution

In the end I found a failing 10Gbps SFP in the Internet facing load-balancers that needed to be replaced. I placed a monitoring probe on the local network and found the same amount of packet loss and re-transmissions which confirmed that the problem was local to my Data Center. I failed over between the primary and secondary Internet facing load-balancers and the problem disappeared so the issue was with the primary Internet facing load-balancer.

Cheers!

YouTube TV – cutting the cord with Roku

February 5, 2018 by Michael McNamara

Like many folks before me I’m looking to cut the cord on the traditional cable TV. I picked up a Roku Streaming Stick+ and enrolled in the 7-day trial for YouTube TV since it’s available in the Philadelphia market. I’ll hopefully be able to drop Verizion FiOS TV and keep the Verizon FiOS Internet and significantly reduce my $200/monthly Internet and Cable TV bill.

YouTube TV has Nat Geo and Nat Geo Wild which are a requirement from the family.

The next big question… should I go with Verizon Gigabit Internet?

Anyone with any recommendations?

Verizon FiOS Internet – Juniper Private VLANs

September 19, 2017 by Michael McNamara

I recently stumbled over an interesting problem with Verizon’s FiOS Internet service while doing some consulting. In an effort to protect the innocent and prevent and ass hattery, I’ve changed the IP addressing to use something from RFC5737.

A client had two physical sites about 1 mile apart which were connected to the Internet by separate Verizon FiOS broadband connections and which were assigned the following static IP addresses;

Site A:

IP Network: 198.51.100.226/28
Subnet Mask: 255.255.255.0
Default Gateway: 198.51.100.1
Usable IP Addresses: 198.51.100.226 – 198.254.100.238

Site B:

IP Network: 198.51.100.50/28
Subnet Mask: 255.255.255.0
Default Gateway: 198.51.100.1
Usable IP Addresses: 198.51.100.50 – 198.51.100.63

Let me be the first to admit that the information above isn’t quite right… there is no IP address block 198.51.100.226/28, it should be 198.51.100.224/28. I believe that’s Verizon trying to avoid having customers accidentally use the network address or the first address in the IP address block which is likely reserved for the actual Verizon Actiontec router.

The client was trying to establish a VPN tunnel between the two sites and was running into difficulties. The issue was with the IP addressing provided by Verizon and it’s likely implementation of private VLANs on the Juniper hardware. I’m assuming that Verizon is likely using PVLANs to isolate traffic between individual customers to minimize the number of IP subnets they need to create. Instead of creating 16 /28 IP networks they are using a single /24 network and then isolating the traffic between customers using PVLANs. The issue in the example above is pretty obvious – the individual client devices are attempting to communicate with each other on the local subnet. Believing that there’s no need to signal the upstream router because the netmask indicates that the remote site should be in the same IP network. While the remote site is actually in the same IP network, the implementation of PVLANs is blocking communication between the client devices.

Anyone have any experience with Verizon FiOS using PVLANs?

I believe I heard years ago that Verizon chose Juniper for their FiOS implementation.

Cheers!

Reference: Juniper – Understanding Private VLANs on EX Series Switches

Net Neutrality and the Future of the Internet

July 11, 2017 by Michael McNamara

If you have been under a rock for the past 6+ months you might need to take notice.

On July 12th this blog will be participating in an “INTERNET-WIDE DAY OF ACTION TO SAVE NET NEUTRALITY” in order to help raise awareness and spur action on a part of the masses.

Cheers!

Retail Holiday Peak 2016

November 19, 2016 by Michael McNamara

It’s that time of year again… the holidays are just around the corner and every retailer is gearing up for Black Friday and Cyber Monday. My employer kicked off the holiday shopping season last night with one brand having their yearly 4-hour sale. Thankfully there were no surprises and our infrastructure and application stack was able to handle the additional load without issue. I did stumble upon an instrumentation issue between PRTG and a Cisco FirePOWER 4110 firewall – perhaps I’ll share more about that problem in another post. It’s a challenge every year to try and forecast the potential load and then meet the surge in demand, let’s not forget about all the email marketing campaigns and app push notifications that the brands want to hit their customers with. It can be a very challenging time for many Information Technology teams.

Now we wait for Thanksgiving and the four days to follow… confident that we’ve taken all the correct steps and everything is ready. Only time will tell the true story.

Cheers!

Story – Packet Loss and Failing 10Gbps SFP+ Optic

Summary

Data Analysis

Resolution

YouTube TV – cutting the cord with Roku

Verizon FiOS Internet – Juniper Private VLANs

Net Neutrality and the Future of the Internet

Retail Holiday Peak 2016

Recent Posts

Categories

SUBSCRIBE BY EMAIL