Comments on: When it rains it pours! https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/ technology, networking, virtualization and IP telephony Wed, 09 Dec 2009 19:48:16 +0000 hourly 1 https://wordpress.org/?v=6.7.3 By: Tom https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1511 Wed, 09 Dec 2009 19:48:16 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1511 We use 4.1.8.3 on our ERS 8600’s and run ~60 VLANs, IST, SMLT, PIM-SM, RIP, SYSLOG, etc. Things have been fairly stable, but it’s because of reported problems as discussed in this thread that we have held off on upgrading to 5.x.x.x. Ever since last year when we found the VLACP timeout issue after installing an ERS 45xx software update (yes, we found that problem and it was a pain) – see http://www.michaelfmcnamara.com/files/VLACP%20timeout%20issue.pdf – I’ve been very careful and leery about software upgrades on core switches. We will continue that policy until I feel more confident that the Nortel ERS 8600 software releases are stable.

About the only problem we have had is some sort of bug with DHCP Guard which caused no end of problems. The solution was to disable DHCP Guard, but long-term, we need to figure this problem out.

But not all is lost. The latest ESM software release is very nice and fixed some nagging bugs. CS1000 6.0 looks very solid, and our SRGs are running with no problems at all. We are very committed to Nortel over the next 5-7 years, but we will continue to move with deliberation when we touch any of the core switches.

]]>
By: Mark https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1506 Tue, 08 Dec 2009 22:16:01 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1506 Chuck

We were not experiencing actual LANE lockups on 81612XLRS with 5.0.1. The group of four 10gig ports in LANE one would stop transmitting packets, but was still receiving them. Links up. No errors. We were lucking VLACP logically brought the affected ports down (although they were physically still up) and the SMLT peer took over. Without VLACP the traffic would have been black holed.

]]>
By: Mark https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1505 Tue, 08 Dec 2009 21:58:48 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1505 bylie on December 4, 2009 – 2:26 am

Are you running SLPP. In 5.0.1 with SLPP enabled the cpu buffer (99%) filled up on one side of the cluster causing the peer CPU to spike at 100%. This caused instability in the IST. If you globally disable SLPP, the problem goes away. This was due to the SLPP MAC address being placed in the forwarding table twice.

We tested that this is fixed in 5.1.1.1 and 5.0.5

]]>
By: Mark https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1504 Tue, 08 Dec 2009 21:53:44 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1504 “ARP issue on a ERS 8600 switch cluster running 4.1.8.0”

Was fixed in 4.1.8.3 How can you manage so many versions of code? We’re always on the same code base with clusters and even stand alones now.

]]>
By: Mark https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1503 Tue, 08 Dec 2009 21:50:44 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1503 In reply to Michael McNamara.

We only use VRRP when we have to. We use RSMLT which helps when SMLT converges when an 8600 in the cluster is coming back up. The problem is you can only have 32 SMLTs per RSMLT instance. We found this out the hard way. If you go over 32 it cause instability in the IST and things go nuts. A couple of VLAN (like the managment one) have way more than 32 SMLT instances.

We do have unique VRRP IDS across the switch.

]]>
By: Chuck https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1501 Tue, 08 Dec 2009 17:01:00 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1501 We’ve been a Nortel customer for over a decade and have used the 8600 since its initial release. We’ve also beta tested just about every single release since then and have been through more than our share of problems. The SMLT rearchitecture of 4.1.8, 5.0.1 and 5.1 has been a rough ride. 5.1.1.1 finally fixes most of our issues with IST and box instability. We are still working on one more issue that seems to be related to HA-CPU & PIM/IP Multicast that causes the IST to go down when disconnecting an SMLT edge link. With PIM disabled, the issue disappears. We also have an 8612XLRS in use on 5.1.1.1 and have not experienced any LANE lockup issue.

]]>
By: SilverMachine https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1500 Tue, 08 Dec 2009 12:32:47 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1500 In reply to Michael McNamara.

Hi Michael,

No, they are not runnig 5.1.1.1 code with 8612XLRS modules. No LANE lock-ups until now. They are running 5.1.1.1 code for more than 2 weeks without problems.

Concerning DHCP Relay Agents on a cluster running 5.1.1.1 code: This IP should not be the same as configured for Virtual IP Address (VRRP). It will start a broadcast storm (DHCP Offers packets) on the IST link!!

Best Regards!

]]>
By: Michael McNamara https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1498 Tue, 08 Dec 2009 03:07:49 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1498 In reply to Mark.

Hi Mark,

I’m happy that another Nortel customer is running 5.1.1.1 software without any issues. Also happy to hear that you haven’t experienced any LANE lockups on 5.1.1.1 software.

That’s an interesting comment regarding the SuperMezz cards and SLPP. Do you use VRRP in your configuration? Do you have unique VRRP IDs across the switch cluster? I ask because there is a known issue that we discovered last May that may be related. In short you need to have unique VRRP IDs across the switch, you can’t use VRRP ID 1 for VLAN 1 and VRRP ID 1 for VLAN 2, VRRP ID 1 for VLAN 3, etc.

Thanks for the comment!

]]>
By: Ben https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1496 Tue, 08 Dec 2009 02:58:59 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1496 In reply to Michael McNamara.

The Cisco VSS solution has worked excellently in our environment. Unlike the ERS8600 it operates 100% as one logical switch similar to a stack of Nortel 5510s or Cisco 3750s. We have redundant fiber to all of our closets and have them setup as Machine EtherChannels (MECs) which is similar to SMLT, but in my experience works a lot better. There is only one active supervisor between the two chassis and the active/active detection is done both by the VSS (SMLT) links as well as by a dedicated heartbeat link.

]]>
By: Michael McNamara https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1495 Tue, 08 Dec 2009 02:56:55 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1495 In reply to udo.

Hi Udo,

Thanks for the comment. It would seem that a lot of folks are happy with 5.1.1.1 software, however, I will warn you that I’ve heard about issues with the 8612XLRS modules. You might want to contact Nortel support to inquire about any known issues with the 8612XLRS modules if you haven’t already installed them.

Cheers!

]]>
By: Michael McNamara https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1493 Tue, 08 Dec 2009 02:38:11 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1493 In reply to silvermachine.

Hi Silvermachine,

I’m curious if your customers are running configuration with any 8612XLRS modules? I’ve heard some not to pretty feedback from a very large Nortel customer that they won’t allow 5.1.1.1 software back into their organization because it caused so many issue similar to LANE lock-ups. That same customer is now running 5.0.5 and has been very pleased with that software release and the 8612XLRS modules.

That’s an interesting comment concerning DHCP relay agents and NIC teaming configurations.

Thanks for the comment!

]]>
By: Michael McNamara https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1492 Tue, 08 Dec 2009 02:31:31 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1492 In reply to Ben.

Hi Ben,

How has the Cisco VSS solution worked out for you? I have a location with an existing Cisco Catalyst 6509 where I need to add a redundant core switch and VSS is an option.

The Nortel 8600s have worked very well, it’s just frustrating to hit the occasional operational snags every once and a while. The hardware has been very reliable but as I’ve mentioned the software releases ever since v4.1.5.4 have been terrible, especially if you are running in an IST/SMLT configuration. While we don’t run every protocol possible we do drive our cores pretty hard, 100+ VLANs, IST, SMLT, RSMLT, PIM-SM, OSPF, BGP, SYSLOG, etc.

I’ll soon be cutting my teeth on the Cisco’s Nexus 7000 switch so you might hear my screaming about them pretty soon.

Thanks for the comment!

]]>
By: Michael McNamara https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1490 Tue, 08 Dec 2009 02:20:53 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1490 In reply to gby.

Hi Gby,

I would generally agree with you concerning the reliability and stability of the platform, however, the recent software releases have left serious questions in the minds of customers like myself. I understand from Nortel that a majority of the code changes behind their IST/SMLT re-design were to accommodate larger networks and provide better scalability in larger configurations. I can understand their motivations but basic stability and availability is paramount for almost any network these days. I’m still happy with v4.1.8.3 but it seems there’s always some event that calls my faith into question.

Just as a note I’ve heard a few comments about LANE lockups on the 8612XLRS modules when running 5.1.1.1 software

Thanks for the comment

]]>
By: Michael McNamara https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1489 Tue, 08 Dec 2009 02:11:49 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1489 In reply to bylie.

Hi Bylie,

Isn’t it frustrating how the switch can sometimes go half in the can. I would much rather prefer it went completely belly up, this way the redundancy could kick in. I must admit that I’m probably being a little harsh on Nortel. I manage a fairly large network so the law of averages dictates that I’m statistically inclined to experience more problems just because I have more equipment where problems can develop.

I have heard really good things about the v5.0.5 software release from a number of people now, so you might want to consider v5.0.5 as an option.

Thanks for the comment!

]]>
By: Mark https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1485 Mon, 07 Dec 2009 21:24:34 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1485 We are running 5.1.1.1 on clustered 8600’s for more than a month with no issues (yet). 5.0.2 was interesting. SLPP issue that would lock up the cpu’s and a lane lockup in the lane with the 10gig ports on the 8634XGRS. 5.1.1.1 seemed to fix both of these.

We went to the 5.x code stream just about a year ago in order to bumb our sever farm backhauls up to 10GigE with 5600 TOR and multiple 8612XLRS cards. Supermezz cards caused us many critical failures.

Our advice, if you are on 5.x code and you are seeing cpu lockups or IST issues, disable SLPP and the SuperMezz cards. To this day we have not re-enabled the supermezz cards.

Mark

]]>
By: udo https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1483 Mon, 07 Dec 2009 12:05:52 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1483 Hi,

we use 2 ERS86xx in core without using SMLT but using IST. Every Switch at the Campus have just one uplink (no SMLT are used yet).

Three Month ago we update from 4.1x to 5.11 because in future we have to use some 8612LXRS Cards.

In Software 5.11 we have big trouble with the Arp-Table.
Some Hosts were unable to connect; some were able (all in the same VLAN). That was Hell!
The only way to fix this bug, was to switch off the IST between the ERS8600 (a big Thanks to Nortel for this very fast Support!)

Last week we upgrade from 5.11 to 5.111.
Now it looks like the bugs are gone away (IST are running)

Udo

]]>
By: silvermachine https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1481 Mon, 07 Dec 2009 11:37:04 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1481 We have one major customer running 5.1.1.1 code on a central Nortel 8600 cluster. The 5.1.1.1 code seems not be rock solid as 4.1.5.4 code. We faced a lot of upgrade problems and some configurations had to be changed to solve broadcasst storms/ cpu high utilization issues. The problems were related to DHCP-Relay Agents and NIC Teamings (MLT) misconfigurations.

]]>
By: Ben https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1480 Sun, 06 Dec 2009 23:46:45 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1480 We recently forklifted our ERS8600s out of our core in favor of a Cisco VSS 6509 pair. We battled issues for over two years related to instability of our core ERS8600s. It seemed like anytime we tried to make a change to the edge like for example taking down a fiber link for maintenance, we would through the cores into a tizzy. Generally the “solution” was to reboot our cores.

]]>
By: Internets of Interest:6 Dec 09 | My Etherealmind https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1479 Sun, 06 Dec 2009 20:32:03 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1479 […] When it rains it pours! | Michael McNamara – Michael McNamara is get­ting weird­ness from his Nortel ERS8600 switches and wants to know if any­one else is hav­ing the same prob­lems. Please drop by and add your com­ments if you can. Even if you aren’t, that would be a help. […]

]]>
By: gby https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1477 Fri, 04 Dec 2009 12:07:49 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1477 Hi
I have been using Nortel passport 8600 since 2000, and until now they were a very reliable and stable platform .
But to be honest
– I have been using them in a pure level 2 design (using MLT/SMLT), no level 3 features activated on them
– Level 3 are implemented on external devices (non Nortel ones)
– Design was simple (keep it simple , keep it simple, …)
– We had made until now one upgrade per year, waiting for others to debug issues
– We have then migrated from release 4.1.6.3 to release 4.1.8.3 a few weeks ago , and since this migration we had 2 major outage, one was due to Switch Fabric reset and other to bad save config behavior

Then it seems that this release has some issues that others had not
Of course recommandation from Nortel is to go on 5.x to correct these bugs
If release 5.1.x, as you state, is going to be suppressed by Nortel from Website, I suspect a great problem, in code quality since 6months … Something related to their capability to provide on-going support while beeing bankrupted ?
but FYI I have just checked and release 5.1.1 is still available for download

Gby

]]>
By: bylie https://blog.michaelfmcnamara.com/2009/12/when-it-rains-it-pours/comment-page-1/#comment-1476 Fri, 04 Dec 2009 07:26:48 +0000 http://blog.michaelfmcnamara.com/?p=1143#comment-1476 We also have an issue with our central Nortel 8600 cluster. Last month we upgraded our software to v5.0.1.0 after some stability issues. Last friday the entire cluster locked up on us. After some digging and opening a supportcase with Nortel we’re now quite sure that there was an issue between the current software and some leftover settings. The problem was related to the CPU buffer filling up completely after which the entire cluster stops pushing data around. Not amusing! It actually seems to be a bug because Nortel said it would be fixed in firmware v5.1.2.

]]>