How ICMPv6 Multicast Listener Reports almost spoiled Christmas

December 21, 2014 by Michael McNamara

If you’ve been following me recently you might recall that I’ve been chasing an issue with a Motorola WS5100 running v3.3.5.0-002R experiencing high CPU utilization. The problem came to a head this weekend and here’s my quick account of the experience.

The WS5100 would intermittently come under extreme load for 5-30 minutes, so much load that ultimately the entire wireless network would collapse as the Access Ports started experiencing watchdog resets and would just continually reboot themselves. This problem would come and go throughout the day or night, we could go 12 hours without an issue and then go the next 12 hours with issues every 30 minutes. The problem was affecting both the primary and secondary WS5100 so I eliminated the hardware almost out of the gate. I have first hand experience running v3.3.5.0-002R software on a large number of WS5100s and have never had an issue with that software release so I really didn’t suspect the software. This wireless solution had been in place for more than 18 months without any major issues or problems. The local engineers reported that there had been no changes, no new devices. So what was causing this problem? I immediately suspected an external catalyst but how would I find it?

As with most highly technical problems it wasn’t until I could get my hands on some packet traces and I had time to dissect those packet traces that I could start to fully understand and comprehend what was actually going on.

Topology

A pair of Motorola WS5100 Wireless LAN Switches with 30 AP300 running software release v3.3.5.0-002R in a cluster configuration with one running as primary and the other running as secondary. The network was comprised of a single Cisco Catalyst 4500 with around ten individual Cisco Catalyst 2960S switches at the edge each trunked to the core in a simple hub and spoke design. The entire network was one single flat VLAN. The WS5100s were attached to the Cisco Catalyst 4500 via a single 1Gbs interface, one arm router style. The peak number of wireless devices was around 200, the total number of MAC addresses on the network was around 525 (this includes the wireless devices).

Symptoms

The initial problem report centered around poor wireless performance and sure enough I quickly found 30-40% packet loss while just trying to ping the WS5100. When I finally got logged into the WS5100 I could see that the CPU was running at 100%. The SYSLOG data showed me that the APs were rebooting because of watchdog timeouts. PTRG was showing me that here was a huge traffic surge being received from the WS5100. I quickly realized that the traffic spikes in the graph correspond to events that users were experiencing problems.

Packet Traces

I directed the team to setup a SPAN port to capture the traffic that was flowing between the WS5100 and the Cisco Catalyst 4500 switch. This would provide me a better idea of what was actually on the wire and might provide a clue as to what was transpiring. The team setup Wireshark to continually capture to disk using a 100MB file size and allowing the file to wrap 10 times for a total of 1GB of captured data. The next time the problem occurred I was alerted within 15 minutes by the help desk and users but I found that we missed the start of the event. There was so much traffic Wireshark only had the past 3 minutes available on disk so we had to increase the filesize to 300MB and the number of wrap files to 25 giving us a total capacity of 7.5GB. That configuration would eventually allow me to capture the initial events along with the time needed to get to the laptop and copy the data before it was overwritten. While I waited for the problem to occur I took to setting up SWATCH to alert myself and the team when the problem started so we could quickly gather all the data points during the start of the event.

Using the data from the packet traces we were able to identify and locate two HP desktops that were apparently intermittently flooding the network with ICMPv6 Multicast Listener Reports.

We removed those HP desktops from the network and everything has been stable since.

Analysis

Here’s the current working theory which I believe is fairly accurate. The HP desktops were intermittently flooding the network with ICMPv6 Multicast Listener Reports. Those packets were reaching the WS5100 and because the network at this location is a single flat VLAN the WS5100 needs to bridge those packets over to the wireless network. It does this by encapsulating them in MiNT in a fashion very similar to CAPWAP or LWAPP. The issue here is the number of packets and the number of access points or access ports. In this case we had 30 APs connected to the WS5100 so let’s do some rough math;

41,000 ICMPv6 Multicast packets * 2 HP desktops = 82,000 packets * 30 APs = 2,460,000 packets

This explains the huge amount of traffic the WS5100 is transmitting. For every ICMPv6 Multicast packet (or broadcast packet for that matter) received by the WS5100, it needs to encapsulate and send a copy of that packet to each and every AP. If there are 30 APs then the WS5100 needs to copy each and every packet 30 times. Now multiply that by the number of ICMPv6 packets that were being received by the WS5100 and you have a recipe for disaster.

A quick search of Google will reveal a number of well documented issues with Intel NICs.

The HP desktops turned out to be HP ProDesk 600 G1s running Windows 7 SP1 with Intel I217-LM NICs driver v12.10.30.5890 with sleep and WoL enabled.

Summary

There were a few lessons learned here;

The days of the single flat network are gone. It’s very important to follow best practice when designing and deploying both wired and wireless infrastructures. In this case if the wireless infrastructure had dedicated VLANs both for the wireless client traffic and for the AP traffic this problem would have never impacted the WS5100. It may have impacted the Cisco Catalyst 4500 somewhat but it wouldn’t have caused the complete collapse of the wireless infrastructure. Unfortunately in this case everything was on VLAN 1, wired clients, APs, wireless clients, servers, IP phone systems, routers, everything.
The filtering of IPv6 along with Multicast and broadcast traffic from the wireless infrastructure is especially important. I posted back in September 2013 how to filter IPv6, multicast and broadcast packets from a Motorola RFS7000, the same applies to the WS5100. Unless you are leveraging IPv6 in your infrastructure, or have some special multicast applications you should definitely look into filtering this traffic from your wireless network.
Validate those desktop and laptop images, especially the NIC drivers and WNIC drivers. In the early days of 802.1x I can remember documenting a long list of driver versions and Microsoft hotfixes required for Microsoft Windows XP (pre SP2) in order to get 802.1x authentication (Zero Wireless Configuration) to work properly.

Conclusion

Wireshark saved this network engineer’s holiday – Thanks!

Cheers!

Note: This is a series of posts made under the Network Engineer in Retail 30 Days of Peak, this is post number 27 of 30. All the posts can be viewed from the 30in30 tag.

Motorola RFS 4000 WiNG 5.5 Captive Portal

November 22, 2014 by Michael McNamara

We use both Motorola and Aruba equipment in our locations. We recently deployed a few newer Motorola RFS 4000s in Spain and the United Kingdom which were running WiNG 5.5. We almost immediately noticed an issue with our externally hosted captive portal where the client would get an error after getting redirected, “Query Variable Qv not found”. That error was being generated by a piece of Javascript code that was running on the externally hosted captive portal pages that parses the Qv value so it can be returned to the RFS4000 to properly identify the user/device that is authenticating via the captive portal.

Here’s an example of WiNG 5.4 forwarding a client to an external captive portal;

"GET /?hs_server=172.16.1.10?Qv=it_qpmjdz=BVQ@bbb_qpmjdz=@dmjfou_njou=532778335@dmjfou_nbd=ED.:C.:D.13.EE.C9@ttj e=VSCBO_HVFTU_XJGJ@bq_nbd=C5.D8.::.33.32.:9 HTTP/1.0" 200 6040 "-" "CaptiveNetworkSupport-305 wispr"

Here’s an example of WiNG 5.5 forwarding a client to an external captive portal;

“GET /?hs_server=172.16.1.10&Qv=it_qpmjdz=BVQ@bbb_qpmjdz=@dmjfou_njou=23:45:9571@dmjfou_nbd=F1.C6.3E.53.:F.:9@ttje=Vscbo _Hvftu_XjGj@bq_nbd=95.35.9E.2:.49.6D HTTP/1.0" 200 6040 "-" "CaptiveNetworkSupport-306.3.1 wispr"

If you look at the URL you’ll notice that the WiNG 5.4 software release utilizes “?” as the variable separator while the WiNG 5.5 software release utilizes “&” as the variable separator.

The Javascript on the captive portal was parsing the URL by splitting it using the “?” value. I could have just changed the value to a “&” but that would have broken the WS2000s and older RFS 4000s that were using the same webpage and Javascript. As a work around I created an additional path, aup.acme.com/wing55, copied all the files and images into that directory and then editted the Javascript in that folder only. I then reconfigured all the RFS 4000s running WiNG 5.5 to use the new path, example; “http://aup.acme.org/wing55/” and left the remaining devices using “http://aup.acme.org/”.

Cheers!

Reference;
http://www.michaelfmcnamara.com/files/motorola/WiNG5_Captive_Portal_Design_Guide_June_2011.pdf

Adopting US Access Ports in GB – Ooppss

November 20, 2014 by Michael McNamara

I ran into another interesting problem today…. we use a combination of Aruba and Motorola wireless equipment. When we have an issue with say the captive portal on a Motorola RFS 4000 it’s pretty easy to take an AP 650 and configure it via DHCP to connect via WISP/WISPE/CAPWAP to the remote controller so you can easily observe the problem first hand. In this case the Motorola RFS 4000 happened to be in Barcelona, Spain although it was configured with a country code of GB (Great Britain). I probably spent the better part of two hours trying to get the AP to adopt to the RFS 4000. I originally thought the problem was related to the AP trying to perform a software upgrade over a 200+ms distance between the AP and WLS but even after I disabled the auto-upgrade feature I would still end up with the following SYSLOG messages;

Nov 20 18:21:59 2014: LED state message WIOS_LED_NO_COUNTRY_0_24G from module DOT11 : %DIAG-6-NEW_LED_STATE:
Nov 20 18:21:59 2014: LED state message RADIO_ALL_LED_OFF from module DOT11 : %DIAG-6-NEW_LED_STATE:
Nov 20 18:21:59 2014: Radio 'ap650-981XXX:R1' changing state from 'Initializing' to 'Off(no country-code)' : %RADIO-5-RADIO_STATE_CHANGE: ff(no country-code)'
Nov 20 18:21:49 2014: RFS-4000 : %AP-6-ADOPTED: Access Point('ap650-981XXX'/'AP650'/5C-0E-8B-98-XX-XX) at rf-domain:'default' adopted and configured. Radios: Count=1, Bss: 5C-0E-8B-31-XX-XX|

I finally realized that the AP650s I had were US models and not WW models. I was able to take an AP300 (WSAP-5110-100-WW) and configure it to connect to the RFS 4000 via DHCP option 189 and ultimately solve the puzzle around the captive portal issue (future blog post).

You can't adopt an AP-0650-60010-US Access Port to a RFS 4K in GB, you need the AP-0650-60010-WW Access Port to do that – SO MUCH TIME LOST!
— Michael McNamara (@mfMcNamara) November 20, 2014

Cheers!

802.11 Wireless LANs vs. broadcast traffic

September 15, 2013 by Michael McNamara

Like many engineers and network managers I’m finding more and more clients are connecting via our 802.11a/b/g wireless network than ever before. While some of the wireless clients are corporate devices which connect to the corporate network, a large number of wireless devices are connecting to the public guest network which connects to the public Internet. At our largest facility we have some 1,500 corporate devices connecting via wireless. However, we can have upwards of 2,000 public devices connecting to our public guest network at any one time. All those smartphones, tablets and computers put out an immense amount of broadcast and multicast traffic which can adversely impact a wireless network.

I originally calculated that the broadcast and multicast traffic was accounting for between 40Kbps and 60Kbps of traffic on our wireless network. However, looking at the traffic graphs right after the change I was shocked at the delta. I performed the change just before noon and you can see a delta of Mbps not Kbps. I would estimate that the changes are saving us 5Mbps of traffic to/from our wireless network.

That’s a lot of needless background noise that ultimately leads to airtime issues which eventually results in retransmissions, delayed packets, jitter and packet loss which can severely impact application performance.

Over the past few weeks I’ve been working to deploy some filters on our Motorola RFS 7000 Wireless LAN Switches (v4.4.2) so I thought I would share them as a best practice in any medium to large scale wireless deployment. If you only have 10 APs then you probably don’t need to worry about filtering the broadcast and multicast traffic. If you have 500 APs then you definitely need to be paying attention to all the needless noise being generated on your wireless network. In the example below I also took the opportunity to block IPv6 frames since we’re still utilizing only IPv4 on our wireless networks.

enable
config t

firewall enable

no firewall stateful-packet-inspection l2

mac access-list extended ARP-ALLOW-ACL
deny any any type ipv6 rule-precedence 10
permit any any type arp rule-precedence 20
permit any any type ip rule-precedence 30

ip access-list extended WLAN-FILTER-BCMC-ACL
permit udp any any range 67 68 rule-precedence 10
deny udp any range 137 138 any range 137 138 rule-precedence 20
deny udp any eq 17500 any eq 17500 rule-precedence 40
deny ip any host 255.255.255.255 rule-precedence 50
deny ip any 224.0.0.0/4 rule-precedence 60
permit ip any any rule-precedence 70

wlan-acl <wlan idx> WLAN-FILTER-BCMC-ACL in
wlan-acl <wlan idx> ARP-ALLOW-ACL in
wlan-acl <wlan idx> WLAN-FILTER-BCMC-ACL out
wlan-acl <wlan idx> ARP-ALLOW-ACL out

You’ll notice that the firewall needs to be enabled. And you need to verify that Layer 2 inspection is disabled.

If you are utilizing VRRP you may need to enable ARP trust on the interfaces relieving the VRRP packets, if you don’t you may see errors such as the following;

sw-wireless.store.acme.org*#Sep 12 11:27:00 2013: %DATAPLANE-4-ARPPOISON: ARP CACHE POISONING: Conflicting ethernet header and inner arp header :Ethernet Src Mac: 00-21-62-E3-XX-XX, Ethernet Dst Mac: 00-15-70-82-XX-XX, ARP Src Mac: 00-00-5E-00-01-C8, ARP Dst Mac: 00-15-70-82-XX-XX, ARP Src IP: 10.1.255.1, ARP Target IP: 10.1.255.19

sw-wireless.store.acme.org*#Sep 12 11:27:25 2013: %DATAPLANE-4-ARPPOISON: ARP CACHE POISONING: Conflicting ethernet header and inner arp header :Ethernet Src Mac: 00-21-62-E3-XX-XX, Ethernet Dst Mac: 00-15-70-82-XX-XX, ARP Src Mac: 00-00-5E-00-01-C8, ARP Dst Mac: 00-15-70-82-XX-XX, ARP Src IP: 10.1.255.1, ARP Target IP: 10.1.255.19

sw-wireless.store.acme.org*#Sep 12 11:27:48 2013: %DATAPLANE-4-ARPPOISON: ARP CACHE POISONING: Conflicting ethernet header and inner arp header :Ethernet Src Mac: 00-21-62-E3-XX-XX, Ethernet Dst Mac: 00-15-70-82-XX-XX, ARP Src Mac: 00-00-5E-00-01-C8, ARP Dst Mac: 00-15-70-82-XX-XX, ARP Src IP: 10.1.255.1, ARP Target IP: 10.1.255.19

Just enable ARP trust on the interface connected to the routers/switches running VRRP;

enable
config t

interface ge1
ip arp trust
exit

Cheers!

Motorola Access Point DHCP Vendor Class IDs

September 8, 2013 by Michael McNamara

Here are the DHCP vendor class IDs for the Motorola Wireless LAN Switches, Access Ports and Access Points;

MotorolaRFS.RFS7000 (RFS7000)
MotorolaRFS.RFS6000 (RFS6000)
MotorolaRFS.RFS4000 (RFS4000)
MotorolaAP.AP7131 (AP7161)
MotorolaAP.AP7131 (AP7131)
MotorolaAP.AP650 (AP650)
MotorolaAP.AP621 (AP621)
MotorolaAP.AP6521 (AP6521)
MotorolaAP.AP6532 (AP6532)
MotorolaAP.AP6511 (AP6511)

The APs will try to associate via a Layer 2 broadcast with a controller, if they fail to adopt via Layer 2 they will issue a a DHCP request with the vendor class IDs listed above.

This is really helpful in your IPAM or DHCP server, you can define specific pools based on the vendor class to return very specific DHCP options. In this case you would probably want to return DHCP option 189 (string) with the IP address of the wireless LAN switch. You can include multiple IPs separated by commas.

Cheers!

How ICMPv6 Multicast Listener Reports almost spoiled Christmas

Topology

Symptoms

Packet Traces

Analysis

Summary

Conclusion

Motorola RFS 4000 WiNG 5.5 Captive Portal

Adopting US Access Ports in GB – Ooppss

802.11 Wireless LANs vs. broadcast traffic

Motorola Access Point DHCP Vendor Class IDs

Recent Posts

Categories

SUBSCRIBE BY EMAIL