We generally perform software upgrades on all our routers and switches twice a year. It really helps to keep our network infrastructure current and it also helps to reduced unscheduled downtime.
Last fall we decided to skip the bi-yearly maintenance because there were just too many projects on the docket. This spring we came across a very interesting issue that we had never seen in the past. We started to notice that multiple Nortel Ethernet Switch 460/470 switches/stacks were rebooting themselves all over our network. It took us a few hours to realize that every switch that had rebooted had just eclipsed approximately 500 days of uptime. All the affected switches were running FW 3.6.0.6 with SW v3.6.4.08. The switches were literally rebooting themselves in the same order in which they had been upgraded almost 500 days earlier.
I’m currently trying to confirm with Nortel that this “bug” has been removed from the 3.7.x software release.
This was one occasion where the network was just too good for itself.
Cheers!
Update: Tuesday June 10, 2008
I received a formal response from Nortel today that included the following:
Analysis of the issue :-
When the BS-470 switches reaches 497 days the system time rolls over and during this period management communication will be lost. This is caused by the use of a 32 bit counter, which when it rolls back to 0, initiates an internal software synchronization to align all timers. This is only loss of IP management and not switching functionality.This issue still open and can be fixed by rebooting the switches before reaching the 497 day mark.
When I inquired if the problem had been resolved in the v3.7.x software release I was told it had not. It would seem that a lot of folks just don’t expect switches to be running that long these days.
Cheers!
Update: Wednesday November 4, 2008
Last week Nortel released a technical service bulletin entitled, “Ethernet Routing Switches: SysUpTime approaching 497 days can cause the switch or stack to behave in some unexpected way“. They also released a video that documents a workaround to the problem.
Let me save you the time and effort of downloading either. Nortel solution is truely masterful; reboot the switch.
While I’ve been know to defend Nortel there’s just no defense for this. I’m completely floored at Nortel’s response.
Cheers!
Fury says
This is interesting. We usually power cycle all switches every month during a short 1-2 hour maintenance window. We have had switches go over a year, but not 500 days. We’re running FW 3.6.0.7 and SW 3.7.2.13.
Unfortunately, the Known Issues Tool does not have any entries for the ES 470; it will be interesting to see what Nortel says about the problem you discovered.
Michael McNamara says
I’ve updated the original post documenting Nortel’s response. Another reason to make sure you perform at least a yearly software upgrade or reboot of the switch.
Cheers!
Michael McNamara says
I’ve updated the original port to include Nortel’s recent technical bulletin.
Heath Freel says
I have come across an issue that although different seems to be based on the same 32 bit counter. Our company manages a number of devices and provides reports that are generated using our custom SNMP polling engine. We have done extensive work with Cisco and Juniper and recently started monitoring some Nortel gear. In the specific case I am taking about I have three Nortel BCM400’s that will always returns the same value for the ifinoctet OID that I use for calculating BW stats. This value is 2^32 – 2147483647.
I was able to get one of the devices reporting ifoutoctets correctly as it hadn’t reached that value yet, however after a small about of time it reached this max value and stopped. Our system has been specifically writen to allow for the rollover back to zero and calculates correctly based on this, but for some reason the BCM just stops.
Has anyone seen a similar issue? Does anyone know of a fix?
Thanks,
Heath
Michael McNamara says
Hi Heath,
I would suspect that this problem has already been identified and resolved. What version of software is the BCM400 running?
While I don’t have any BCMs I do have a large number of SRGs (same hardware/software just a different license key that activates different features) within my network. Hah… looking over a large number of them none are near the 2^32 value for either ifInOctets or ifOutOctets. The BCM/SRG utilizes Linux as the underlying operating system so I would be surprised to hear that this has not been seen and addressed elsewhere.
If you have SNMP read access to the network Ethernet switches you could poll the ports that connect the specific device your interested in monitoring as opposed to poll the device itself, just poll the switch that the device is connected to.
Sorry I couldn’t be of more help.
Good luck!
Heath Freel says
Hi Michael,
Thanks for the info. The version of the BCM is 4.0.2.03 – I’m not sure but I think that is relatively recent. I have done a number of searches looking for a similar issue and your post is the closest I could find to a match. I also searched the Norel site extensivly but found no mention of such an issue. I will continue my search but focus more on Linux than just BCM to see if anyone else has encountered this before.
Thanks again,
Heath
Michael McNamara says
Thanks to Q here’s a new CSB on the issue;
http://support.avaya.com/css/P8/documents/100133199