It would seem that Nortel has discovered some serious flaws in software 4.1.6.0 for the Nortel Ethernet Routing Switch 8600. Nortel published a bulletin today entitled, “Ethernet Routing Switch (ERS) 8600: System Instability maybe seen after upgraded to 4.1.6.0 software“.
CPU instability issues with Maintenance Release (MR) 4.1.6.0 have been discovered during initial deployment. CPU crash dump events leading to system instability have been reported after the operational SW was upgraded to 4.1.6.0 at different customer site.
There was also an additional note concerning an interaction between Microsoft’s Network Load Balancing (NLB) when running in multicast mode on 4.1.6.0 software code that might result in a CPU crash on the ERS 8600.
Nortel is advising that 4.1.6.0 software has been pulled from their website and replaced with 4.1.6.3 software. They are also advising that 4.1.7.0 software will be available in August 2008.
While I haven’t personally seen this problem I only have 1 switch running 4.1.6.0 software out of the 37 ERS 8600 switches we have in production.
Cheers!
Tom says
Hmmm. We are running 4.1.6.0 and it has been very stable. But it is rare indeed when Nortel pulls software off their site, so we will install 4.1.6.3 this weekend.
Michael McNamara says
I would advise that you error on the side of caution and upgrade to 4.1.6.3 if you can get the downtime window.
Thanks for the comment!
QAZ says
Well 4160 is a fully supported version if it works for you then let it go. Even with 4163 I’ve seen some issues. Not the mentioned errors in the bulletin, but core dump in /pcmcia/wdtlog.
Gene says
My environment is dual core, and mesh cores. I have had a reocurring problem with my dual cores, using the high density fiber rmodules with dual 8692SF fabrics per core. The problem I had with the 4.1.3.0 code was that I would get a rash of IP fourth octet odd number addressed devices that would not function, no matter the VLAN, static or DHCP. We also had one of the 8692SF cards go south. After that, we went to the 4.1.6.0 code. The problem now, from what I can tell, is that there seems to be some sort of VRRP routing confusion or issue with some of my DHCP VLANS. I tried to catch the VLANS on each core trying to be master or backup at the same time in the VRRP properties on each core, when a problem would come up, but it never showed up that way. To solve the problem, I had to disconnect one of the SMLT fiber links to the IDF closet; once that SMLT became a SLT, the problem would go away. I also had an issue where my edge forwarding table would not show a MAC address of a connected device, yet the MAC would show up in the 8600 ARP table and correctly point to the downlink. I also had MAC address point to the edge fiber uplink, when the device was connected to that edge switch stack. I sure hope the 4.1.6.3 code solves my problems. My cores support a large hospital and these multiple code upgrades and switch fabric failures are getting old. I sure hope Nortel gets their act together. Sounds like they have lost some of their brain trust to the competition, or something.
Michael McNamara says
Hi Gene,
I would agree with you that the last few versions of ERS 8600 software have left a sour taste in my mouth as well. I actually just upgraded a number of dual core switches to v4.1.6.3 which seems to have some SMLT/ARP/FDB issues. While I have four locations running dual ERS 8600 switches in an IST pair, only the largest site (45+ closets) shows the SMLT/ARP/FDB issue. Nortel has responded to my support case that they are aware of multiple SMLT/ARP/FDB issues and planning to release 4.1.8.0 sometime in November 2008 which should resolve the problems. They’ve also advised that 4.1.7.1 might not be affected by the problem as frequently as 4.1.6.x software.
I believe the problem you are referring to with the MAC address not showing up in your FDB table is very similar to my current SMLT/FDB/ARP problem. Only in this case one of the core switches has no FDB entry for the affected end device and the ARP table entry points to the IST, which it shouldn’t since there is a directly connected SMLT port to that downstream switch.
If you are using SLPP with VLACP over SMLT links there is a known problem right now. Although I need to dig up some additional information on that problem.
Thanks for the comment and good luck!
QAZ says
Michael you wrote in your response to gene that Nortel is aware of fdb/arp issues. I do not know your case number but I cannot find any related Q-number in their known issues tool https://app107.nortelnetworks.com/BT/bug_search.jsp I´m really interested since we are experiencing the same on a customer site.
Michael McNamara says
I haven’t been provided any specific CR numbers as of yet but the SMLT/FDB/ARP issue is generally well known and can be pretty easy to identify by dumping the FDB and ARP tables during an event (a period of time you are unable to reach a device that is physically connected to a switch that is SMLT connected to the core).
The IST provides the two core ERS 8600 switches with the ability to sync up the FDB/MAC table. When the problem is present one core switch doesn’t have a FDB/MAC table entry for the affected end device (or it if does the entry doesn’t point to the correct downlink port).
Here’s an example for a device (10.2.255.23/00:1a:8f:89:c8:00) which is connected to an ERS 5520 switch which itself is SMLT connected to ports 8/16 on both core ERS 8600 switches. Here are the ARP and FDB table entries from both switch for that specific device during the problem;
ERS8600-A
10.2.255.23 00:1a:8f:89:c8:00 200 8/16 DYNAMIC 2125
200 learned 00:1a:8f:89:c8:00 Port-8/16 false 1 false
ERS8600-B
10.2.255.23 00:1a:8f:89:c8:00 200 MLT 1 DYNAMIC 2157
NO FDB ENTRY
Those of us that have seen this problem before will immediately notice that either something is very wrong or port 8/16 should be down on ERS8600-B. Unfortunately port 8/16 on ERS8600-B is up and functioning so that leaves something very wrong.
Here’s what the FDB and ARP table entries look like when there is no problem;
ERS8600-A
10.2.255.23 00:1a:8f:89:c8:00 200 8/16 DYNAMIC 2121
200 learned 00:1a:8f:89:c8:00 Port-8/16 false 1 false
ERS8600-B
10.2.255.23 00:1a:8f:89:c8:00 200 8/16 DYNAMIC 2159
200 learned 00:1a:8f:89:c8:00 Port-8/16 false 1 true
The trick is usually in capturing the data (show ip arp info/show vlan info fdb-entry) while the problem is actually occurring. That’s where I employ Perl and Expect in a scripting solution.
Thanks for the comment!
Michael McNamara says
Here’s the CR for the issue described above (Q01869054);
On one of the aggregation switches (IST peers) arp is pointing to IST. When arp pointing to IST, the fdb entry of the device is missing on the switch. Fdb-entry however, is present on the other IST peer. This causes connectivity loss. The problem corrects itself sometimes in seconds and sometimes longer. The problem is seen due to fdb entry getting aged out, and for some reason the fdb entry is not getting quickly realerned from arp reply and hence causing the arp pointing to IST. Once fdb is learned, the ERS corrects the arp making it in syn with fdb entry.
However, the traffic is lost until the fdb entry is learned and the arp gets corrected.
Cheers!
Michael McNamara says
I received a workaround for the CR described above from a local Nortel sales engineer;
“You may want to try setting every VLAN FDB age-out value to 21601, which is 1 second higher that default system ARP time-out (6 hours or 21600 seconds).
From what I know is 21601 is used, CR 1869054 is not seen.” -Al
I’ve personally deployed this workaround at three sites and it does resolve the problem.
Cheers!
Gene says
Michael
What is the “VLAN operation action” setting for the age-out reconfig. Also, are your VLAN all byport or a mix of byipsubnet. My ARP is set to 1200 seconds. So you say to make my VLANS 1201 and the ARP to stay at 1200. My current config is the byipsubnet VLANS are 600 and the byports are 0(zero). TIA
Michael McNamara says
Hi Gene,
Sorry about that… I should have thought enough to provide that information. All the VLANS in my network are byport. Here is an example command for VLAN 1 (Default VLAN);
config vlan 1 fdb-entry aging-time 21601
The default ARP timeout value is 360 minutes which is 21600 seconds (360 * 60 = 21600), so you need to make the FDB aging-time 1 second greater than the ARP timeout so that there will always be and FDB entry so long as there is an ARP entry.
Cheers!
Rene says
Hi Michael,
you have very good webpage!
We had the same problem with the 4.1.6 Firmware, but the timer change didn’t solve the problem. Nortel told us to perform a downgrade to 4.1.5.
Ok, now we running the firmware 4.1.5, but the problem is still existent!
The problem is,´the switch lost the traffic and doesn’t have a entry in the arp table. A ping from the switch is the solution.
We have 4 ERS8600, two pairs(IST)connected a full meshed SMLT.
On the ERS8600 linecards we have HP bladecenter with cisco switche in a SMLT configuration.
The strange thing on the bug is, sometimes we have the bug, sometimes we don’t have the bug.
I’m at the end and I think my consulatns, too.
Normally I had work we Cisco 6500 switches, so that’s my first contact with Nortel and it’s not the best start :-(
I hope you have any ideas
By
René
Michael McNamara says
Hi Rene,
I’ve personally experienced the bug you are referring to. The latest stable version of code that I ran was v4.1.1. I didn’t have any IST/SMLT or ARP/FDB issues with v4.1.1 software. A fully meshed four way switch cluster can be very complicated. When you made the FDB aging change did you change every VLAN on all four core switches? If you are still having the problem I would suggest you continue to lean on Nortel and have the case escalated. I did hear through the grape vine that v4.1.8 is supposed to be release by Friday of this week.
Good Luck!
michael gagnon says
i ran into the same issues upgrading from 4.1.1.1 –> 4.1.7.2 over holiday shutdown a month ago. i’m surprised it took nortel (and escalation) so long before someone got a fix for us (approx 3-4 days going back and forth with escalation). the FDB-ageout entry fix worked immediately. still waiting for 4.1.8.0 to arrive; anyone else hear of an ETA?
Michael McNamara says
I believe we should see 4.1.8.2 within the next week or two.
While I can’t say much right now (Non-disclosure agreement), I will be happy to comment on 4.1.8.2 when it is released.
Thanks again for the feedback!
Gene says
We are now at 4.1.7.2 as of two weeks ago. The only error of note in the logs are:
CPU5 [01/05/09 04:18:09] SW WARNING dpmGetActivityBit:ltrSyncSend FAILED for MAC_ADDR/MAC_VLAN. MAC addr: 0x000e7f36ecc8,pimport: 4, destslot: 3, status=9
I got two instances of this on two user devices. Nortel ticket opened and they stated if it does not happen a great deal, don’t worry. It is some sort of “I’m too busy right now to handle your specific need, but I’ll log this message for you” type of error. I don’t know. I have not noticed any problems since going to 4.1.7.2.
These entries appear in both ERS cores IST connected. Things have been stable, thus far. Glad to dump the 4.0.3.0/4.0.6.0 crap.
Gene
Rene says
Do you use ecmp routing, we have a issue with ecmp and smlt.
after disabled ecmp the issue gone. do you have more info about your network?
Michael McNamara says
Welcome back Rene!
I use Equal Cost Multi-Pathing (ECMP) in my WAN routing to provide redundancy and high availability between my sites. I bridge the majority of my LAN traffic between the edge and the core across a SMLT connection.
Here’s a test for you… re-enable ECMP but disable one of the SMLT links, re-test your problem. Then enable the original SMLT link and disable the other SMLT link, re-test your problem. Does the problem persist if you are only running a single link?
Good Luck!
Michael McNamara says
For those that may have an email subscription to this thread. Nortel has finally released v4.1.8.2 software for the Ethernet Routing Switch 8600.
http://blog.michaelfmcnamara.com/2009/02/ethernet-routing-switch-8600-software-release-v4182/
Mary says
Hi Gene,
I have the same message on a 8600 running 4.1.4.0 version :
CPU6 [03/04/09 21:53:37] COP-SW ERROR Slot 1: dpmRxSyncReceive: CommandType 155
UnKnown
CPU6 [03/04/09 21:53:37] SW WARNING dpmGetActivityBit:ltrSyncSend FAILED for MAC
_ADDR/MAC_VLAN. MAC addr: 0x00188b1f784a,pimport: 1, destslot: 1, status=9
You said that you have opened a Nortel case, what was their explanation for the error message ?
Thanks.
Gene says
Mary
From Nortel:
Thankyou for choosing Nortel.
I will explain it again for better clarification.
[*Mar 1 07:18:04 [172.20.133.3.12.24] CPU5 [03/01/07 20:28:12] SW WARNING dpmGetActivityBit:ltrSyncSend FAILED for MAC_ADDR/MAC_VLAN. MAC addr: 0x00130abbb800,pimport: 15, destslot: 1, status=9*]
This is occurring because dpmGetActivityBit is sending a LTR message to COP, if COP will receive the message then it will return the activity of particular MAC.
But the message sending from CP to COP failed. Reason for failing is because ltrSyncSend is returning LTR_INTERNAL_ERR(9), /* internal platform-dependent error */
ltrSyncSend tries to send the packet through transport layer mechanism (SyncTransportLayer) which in turn return the above error when it is not able to take the semaphore/resources.
This problem is likely not impacting the work of the device.
The reason for above message is: The main CPU is not able to send the message to COP (I/O blade CPU). The message sending mechanism failed to transfer the message due to lack of resources. Probably message sending mechanism is busy sending other higher priority messages in queue or there is already a big queue of message to be transferred.
(It will potentially happen if large tables exist on the device and there are timing issues aging these tables between CPU and R-Module.)
I have few questions for you:
1. How often you see this message
2. Where do you see this? I meant to ask weather it reflects on both passport at the same time?
3. Does it create any connectivity issues?
4. Is it throwing out this message continuesly?
5. What is the states of the ports at the time of message disply?
6. Even though the port states shows green, using a sniffer and check weather traffic flow is happening fine.
If the message is displayed once in a long while and if we are not having any connectivity issues, we don’t have to worry about that. The cpu will recover itself.Please send me the updates as soon as possible.
Thank you,
Legacy says
Hi Gene,
Am not really sure if it’s exactly the same issue posted on this blog However we been noticing a similar situation where the is not ARP entries on both 8600’s (Cluster) for some of the layer 2 Edge devices. it’s usually One or 2 edge switch at a time out off 600+
On the other hand there is NO service interruptions to the Business. EU’s are able to make phone, access business applications etc.. (CLI and JDM works fine when you are physically connect to the edge switch).
The temporary workaround is to ping the edge switch from one of the 8600’s and suddenly all is back to normal; sometimes it resolves itself without any action taken by us
Code Levels on 8600’s (Cluster): 5.1.3.1
Code Level Edge Switches: 5.1.1.017
Can you please advise if this is the same issue?
Thank You!
Michael McNamara says
Hi Legacy,
How many VLANs are involved in the problem? Are you running VRRP or RSMLT in those VLANs? Where there any changes to any of the VLANs recently?
When I had that problem in 4.1.6.x and 4.1.8.x the only long term solution was to reboot each of the ERS 8600 switches (one at a time of course). You can utilize the MAC/FDB aging fix to help the mitigate the issue, not sure if it will resolve the issue completely but it’s a good practice and helps save the CPU from wasting needless cycles.
config vlan 1 fdb-entry aging-time 21601
I will say that I have not personally seen this issue on 5.1.x software.
Good Luck!