I’m not talking out the song by 50 Cent… I’m talking about those days when you ask yourself, “when is the insanity going to end”!
I’m going to get on my soapbox here so I’m warning everyone right now. It’s no secret that I believe Nortel has some great products but I’m not a fan boy by any stretch of the imagination, ask my Nortel sales representative. I’m sitting here tonight asking myself what is up with the software on the Nortel Ethernet Routing Switch 8600. In the past four days I’ve experienced three priority 1 (network down) issues, one priority 2 (network severely impacted) issue at three physical locations all involving Nortel Ethernet Routing Switch 8600s.
I had thought that 4.1.8.x software was the savior of the Nortel Ethernet Routing Switch 8600. Those of us who have been around know the headaches that 4.1.6.x brought. Add to that Nortel removing 5.1 and 5.1.1 from their website and I’m left scratching my head wondering what’s going on. Next week Nortel is suppose to be coming in to discuss the new features in 7.x software, I’m still waiting for something in the 4.x and 5.x branch that’s reliable and rock solid.
A few weeks back I discovered the now infamous ARP issue on a ERS 8600 switch cluster running 4.1.8.0 software. Until that time it was believed (by myself anyway) that the ARP issue was resolved in 4.1.8.x software. The resolution was a reboot of the affected ERS 8600 core switch, thankfully since they were completely redundant to each other the reboot wasn’t visible to our customers.
This week has been the equivalent of hell week. On Tuesday we discovered a closet ERS 8600 switch (4.1.6.3) flooding the network with ARP packets at a rate of 500 per second. On Wednesday we had that same closet ERS 8600 switch introduce a loop in the network at 4:00AM in the morning, thankfully rate limiting was able to limit the packet flow and CP-Limit was able to isolate the closet by shutting down the uplinks to the affected closet. Later Wednesday morning we had a core ERS 8600 switch (v4.1.8.3) stop routing packets on a 10Gbps 8683XLR interface, unfortunately OSPF wasn’t told about the problem can continuing to route all WAN packets to a blackhole. Later Wednesday afternoon we had another core ERS 8600 switch (v4.1.8.0) where an 8684GTR card decided to up and restart on us.
I’ve always been cautious against building overly complicated networks for the simple reason that those designs themselves can introduce more downtime and issues than just taking a simpler approach. In this case stability and reliability are becoming major concerns of mine with respect to the Ethernet Routing Switch 8600.
What experience have you had with the Ethernet Routing Switch 8600?
Cheers!
bylie says
We also have an issue with our central Nortel 8600 cluster. Last month we upgraded our software to v5.0.1.0 after some stability issues. Last friday the entire cluster locked up on us. After some digging and opening a supportcase with Nortel we’re now quite sure that there was an issue between the current software and some leftover settings. The problem was related to the CPU buffer filling up completely after which the entire cluster stops pushing data around. Not amusing! It actually seems to be a bug because Nortel said it would be fixed in firmware v5.1.2.
Michael McNamara says
Hi Bylie,
Isn’t it frustrating how the switch can sometimes go half in the can. I would much rather prefer it went completely belly up, this way the redundancy could kick in. I must admit that I’m probably being a little harsh on Nortel. I manage a fairly large network so the law of averages dictates that I’m statistically inclined to experience more problems just because I have more equipment where problems can develop.
I have heard really good things about the v5.0.5 software release from a number of people now, so you might want to consider v5.0.5 as an option.
Thanks for the comment!
gby says
Hi
I have been using Nortel passport 8600 since 2000, and until now they were a very reliable and stable platform .
But to be honest
– I have been using them in a pure level 2 design (using MLT/SMLT), no level 3 features activated on them
– Level 3 are implemented on external devices (non Nortel ones)
– Design was simple (keep it simple , keep it simple, …)
– We had made until now one upgrade per year, waiting for others to debug issues
– We have then migrated from release 4.1.6.3 to release 4.1.8.3 a few weeks ago , and since this migration we had 2 major outage, one was due to Switch Fabric reset and other to bad save config behavior
Then it seems that this release has some issues that others had not
Of course recommandation from Nortel is to go on 5.x to correct these bugs
If release 5.1.x, as you state, is going to be suppressed by Nortel from Website, I suspect a great problem, in code quality since 6months … Something related to their capability to provide on-going support while beeing bankrupted ?
but FYI I have just checked and release 5.1.1 is still available for download
Gby
Michael McNamara says
Hi Gby,
I would generally agree with you concerning the reliability and stability of the platform, however, the recent software releases have left serious questions in the minds of customers like myself. I understand from Nortel that a majority of the code changes behind their IST/SMLT re-design were to accommodate larger networks and provide better scalability in larger configurations. I can understand their motivations but basic stability and availability is paramount for almost any network these days. I’m still happy with v4.1.8.3 but it seems there’s always some event that calls my faith into question.
Just as a note I’ve heard a few comments about LANE lockups on the 8612XLRS modules when running 5.1.1.1 software
Thanks for the comment
Ben says
We recently forklifted our ERS8600s out of our core in favor of a Cisco VSS 6509 pair. We battled issues for over two years related to instability of our core ERS8600s. It seemed like anytime we tried to make a change to the edge like for example taking down a fiber link for maintenance, we would through the cores into a tizzy. Generally the “solution” was to reboot our cores.
Michael McNamara says
Hi Ben,
How has the Cisco VSS solution worked out for you? I have a location with an existing Cisco Catalyst 6509 where I need to add a redundant core switch and VSS is an option.
The Nortel 8600s have worked very well, it’s just frustrating to hit the occasional operational snags every once and a while. The hardware has been very reliable but as I’ve mentioned the software releases ever since v4.1.5.4 have been terrible, especially if you are running in an IST/SMLT configuration. While we don’t run every protocol possible we do drive our cores pretty hard, 100+ VLANs, IST, SMLT, RSMLT, PIM-SM, OSPF, BGP, SYSLOG, etc.
I’ll soon be cutting my teeth on the Cisco’s Nexus 7000 switch so you might hear my screaming about them pretty soon.
Thanks for the comment!
Ben says
The Cisco VSS solution has worked excellently in our environment. Unlike the ERS8600 it operates 100% as one logical switch similar to a stack of Nortel 5510s or Cisco 3750s. We have redundant fiber to all of our closets and have them setup as Machine EtherChannels (MECs) which is similar to SMLT, but in my experience works a lot better. There is only one active supervisor between the two chassis and the active/active detection is done both by the VSS (SMLT) links as well as by a dedicated heartbeat link.
silvermachine says
We have one major customer running 5.1.1.1 code on a central Nortel 8600 cluster. The 5.1.1.1 code seems not be rock solid as 4.1.5.4 code. We faced a lot of upgrade problems and some configurations had to be changed to solve broadcasst storms/ cpu high utilization issues. The problems were related to DHCP-Relay Agents and NIC Teamings (MLT) misconfigurations.
Michael McNamara says
Hi Silvermachine,
I’m curious if your customers are running configuration with any 8612XLRS modules? I’ve heard some not to pretty feedback from a very large Nortel customer that they won’t allow 5.1.1.1 software back into their organization because it caused so many issue similar to LANE lock-ups. That same customer is now running 5.0.5 and has been very pleased with that software release and the 8612XLRS modules.
That’s an interesting comment concerning DHCP relay agents and NIC teaming configurations.
Thanks for the comment!
SilverMachine says
Hi Michael,
No, they are not runnig 5.1.1.1 code with 8612XLRS modules. No LANE lock-ups until now. They are running 5.1.1.1 code for more than 2 weeks without problems.
Concerning DHCP Relay Agents on a cluster running 5.1.1.1 code: This IP should not be the same as configured for Virtual IP Address (VRRP). It will start a broadcast storm (DHCP Offers packets) on the IST link!!
Best Regards!
udo says
Hi,
we use 2 ERS86xx in core without using SMLT but using IST. Every Switch at the Campus have just one uplink (no SMLT are used yet).
Three Month ago we update from 4.1x to 5.11 because in future we have to use some 8612LXRS Cards.
In Software 5.11 we have big trouble with the Arp-Table.
Some Hosts were unable to connect; some were able (all in the same VLAN). That was Hell!
The only way to fix this bug, was to switch off the IST between the ERS8600 (a big Thanks to Nortel for this very fast Support!)
Last week we upgrade from 5.11 to 5.111.
Now it looks like the bugs are gone away (IST are running)
Udo
Michael McNamara says
Hi Udo,
Thanks for the comment. It would seem that a lot of folks are happy with 5.1.1.1 software, however, I will warn you that I’ve heard about issues with the 8612XLRS modules. You might want to contact Nortel support to inquire about any known issues with the 8612XLRS modules if you haven’t already installed them.
Cheers!
Mark says
We are running 5.1.1.1 on clustered 8600’s for more than a month with no issues (yet). 5.0.2 was interesting. SLPP issue that would lock up the cpu’s and a lane lockup in the lane with the 10gig ports on the 8634XGRS. 5.1.1.1 seemed to fix both of these.
We went to the 5.x code stream just about a year ago in order to bumb our sever farm backhauls up to 10GigE with 5600 TOR and multiple 8612XLRS cards. Supermezz cards caused us many critical failures.
Our advice, if you are on 5.x code and you are seeing cpu lockups or IST issues, disable SLPP and the SuperMezz cards. To this day we have not re-enabled the supermezz cards.
Mark
Michael McNamara says
Hi Mark,
I’m happy that another Nortel customer is running 5.1.1.1 software without any issues. Also happy to hear that you haven’t experienced any LANE lockups on 5.1.1.1 software.
That’s an interesting comment regarding the SuperMezz cards and SLPP. Do you use VRRP in your configuration? Do you have unique VRRP IDs across the switch cluster? I ask because there is a known issue that we discovered last May that may be related. In short you need to have unique VRRP IDs across the switch, you can’t use VRRP ID 1 for VLAN 1 and VRRP ID 1 for VLAN 2, VRRP ID 1 for VLAN 3, etc.
Thanks for the comment!
Mark says
We only use VRRP when we have to. We use RSMLT which helps when SMLT converges when an 8600 in the cluster is coming back up. The problem is you can only have 32 SMLTs per RSMLT instance. We found this out the hard way. If you go over 32 it cause instability in the IST and things go nuts. A couple of VLAN (like the managment one) have way more than 32 SMLT instances.
We do have unique VRRP IDS across the switch.
Chuck says
We’ve been a Nortel customer for over a decade and have used the 8600 since its initial release. We’ve also beta tested just about every single release since then and have been through more than our share of problems. The SMLT rearchitecture of 4.1.8, 5.0.1 and 5.1 has been a rough ride. 5.1.1.1 finally fixes most of our issues with IST and box instability. We are still working on one more issue that seems to be related to HA-CPU & PIM/IP Multicast that causes the IST to go down when disconnecting an SMLT edge link. With PIM disabled, the issue disappears. We also have an 8612XLRS in use on 5.1.1.1 and have not experienced any LANE lockup issue.
Mark says
“ARP issue on a ERS 8600 switch cluster running 4.1.8.0”
Was fixed in 4.1.8.3 How can you manage so many versions of code? We’re always on the same code base with clusters and even stand alones now.
Mark says
bylie on December 4, 2009 – 2:26 am
Are you running SLPP. In 5.0.1 with SLPP enabled the cpu buffer (99%) filled up on one side of the cluster causing the peer CPU to spike at 100%. This caused instability in the IST. If you globally disable SLPP, the problem goes away. This was due to the SLPP MAC address being placed in the forwarding table twice.
We tested that this is fixed in 5.1.1.1 and 5.0.5
Mark says
Chuck
We were not experiencing actual LANE lockups on 81612XLRS with 5.0.1. The group of four 10gig ports in LANE one would stop transmitting packets, but was still receiving them. Links up. No errors. We were lucking VLACP logically brought the affected ports down (although they were physically still up) and the SMLT peer took over. Without VLACP the traffic would have been black holed.
Tom says
We use 4.1.8.3 on our ERS 8600’s and run ~60 VLANs, IST, SMLT, PIM-SM, RIP, SYSLOG, etc. Things have been fairly stable, but it’s because of reported problems as discussed in this thread that we have held off on upgrading to 5.x.x.x. Ever since last year when we found the VLACP timeout issue after installing an ERS 45xx software update (yes, we found that problem and it was a pain) – see http://www.michaelfmcnamara.com/files/VLACP%20timeout%20issue.pdf – I’ve been very careful and leery about software upgrades on core switches. We will continue that policy until I feel more confident that the Nortel ERS 8600 software releases are stable.
About the only problem we have had is some sort of bug with DHCP Guard which caused no end of problems. The solution was to disable DHCP Guard, but long-term, we need to figure this problem out.
But not all is lost. The latest ESM software release is very nice and fixed some nagging bugs. CS1000 6.0 looks very solid, and our SRGs are running with no problems at all. We are very committed to Nortel over the next 5-7 years, but we will continue to move with deliberation when we touch any of the core switches.