It’s finally official… Nortel has released v4.1.8.2 software for the Ethernet Routing Switch 8600. This latest code promises to put all the ARP/FDB issues that surfaced in the 4.1.6.x software branch to rest. It also promises to provide increased efficiencies for those running switch clusters (IST). I’ve been running 4.1.8.0 software for the past 30+ days and believe it’s a stable release that customers can finally count on. The one word of warning for everyone out there revolves around VRRP IDs, you must make sure you have unique VRRP IDs across your entire switch.
Anyone considering an upgrade should read the release notes carefully since there are a number of significant changes to the code.
You can find a copy of the release notes here but you’ll obviously need to visit the Nortel site to download the software.
Here’s an excerpt regarding the changes around SMLT/RSMLT;
New Features in This Release
SMLT/RSMLT Operational Improvements (CR Q01764193/Q01769324/Q01776485)For previous SMLT operation, bringing the SMLTs Up/Down triggered the flushing of the entire MAC/FDB belonging to the SMLTs in both the IST Core Peer Switches. Flushing of the MAC addresses then causes the dependent ARP (for IP stations) to be re-resolved. For ARP resolution, ERS 8600 re-ARPs for all the SMLT learned ARPs. This created a major MAC/ARP re-learning effort. As the records were flushed, during the relearning period the exception (learning) packets will also be continuously forwarded to the CPU, thereby increasing the CPU load. This would further slow-down the SMLT re-convergence as well as the h/w record reprogramming. Since proper traffic flow with an ERS 8600 is completely dependent on the h/w records, this prior behavior could adversely affect convergence times, especially in very large networks (8000+ MACs/ARPs), and those networks also running with many multicast streams, as multicast streams often need to be forwarded to the CPU for learning, thereby also increasing CPU load. The SMLT changes in this release improve this operation significantly, and continue to allow all previous SMLT/RSMLT topologies to be supported. SMLT Operational Improvements will affect SMLT/RSMLT behavior in that the actual SMLT/RSMLT connection links on a powered-up IST Core Switch, will take longer to become active (link status up and forwarding) than with previous versions of code. During this time period the other Peer IST Core Switch will always be continuing to forward, therefore avoiding any loss of traffic in the network for all SMLT/RSMLT based connections. The SMLT/RSMLT associated links will not become active upon a boot-up until the IST is completely up, and a ‘new MAC/ARP/IP sync’ has occurred between the two Core IST Peer Switches in a Cluster.
Users may see occasional instances where the Remote SMLT Flag is False on both Peer Switches. This is normal, if the flag clears and is then set properly (False on one side, True on the other), once the FDB age-out for that associated VLAN has occurred. This behavior has no affect on user traffic operation – no user traffic loss or
disruption will be seen under this condition.For proper network behavior Nortel recommends to operate both IST switches with either the “new” or “old” SMLT architecture. Therefore SMLT operation between IST Peer Core switches with one switching operating with pre-4.1.8.x code, and the other operating with 4.1.8.x or later code is NOT supported. Additionally users will see some new informational log messages generated around this behavior. The new messages formats are listed below, along with the various situations they will be seen with.
Case 1: Switch running SMLT is reset. Upon switch coming up the below messages are displayed irrespective of the number of SMLTs:
CPU5 [06/05/08 05:05:45] MLT INFO SMLT MAC/ARP Sync Requested: CAUTION do not take ANY action with the peer at this time
CPU5 [06/05/08 05:05:46] MLT INFO SMLT MAC /ARP Sync is Complete : Peer can now be used normallyCase 2: System is up and running but SMLT UP event (from down) has happened. One sync message is displayed for every SMLT that went down and has come up. In the following example, 2 x SMLTs went down and came up:
CPU5 [06/05/08 05:05:45] MLT INFO SMLT MAC/ARP Sync Requested: CAUTION do not take ANY action with the peer at this time
CPU5 [06/05/08 05:05:45] MLT INFO SMLT MAC/ARP Sync Requested: CAUTION do not take ANY action with the peer at this time
CPU5 [06/05/08 05:05:46] MLT INFO SMLT MAC /ARP Sync is Complete : Peer can now be used normally
CPU5 [06/05/08 05:05:46] MLT INFO SMLT MAC /ARP Sync is Complete : Peer can now be used normally
NOTE: To determine which specific SMLT IDs are affect, look for the SMLT ID down/up log messages.
Case 3: When sync fails due to difference in IST Peer software version (pre-4.1.8.x and 4.1.8.x) where one peer supports MAC/ARP sync but the other does not. Or some other potential issue, such as a mis-configuration or IST Session not coming up. The system that is reset and is requesting sync, it will keep all the ports locked down (except IST_MLT) until the IST comes up properly and sync has occurred. After 5 minutes the below Log/Error messages will be displayed:
CPU5 [05/15/08 05:28:51] MLT INFO SMLT MAC/ARP Sync Requested: CAUTION do not take ANY action with the peer at this time
< After 5 min>
CPU5 [05/15/08 05:33:55] MLT ERROR SMLT initial table sync is delayed or failed. Please check the peer switch for any partial config errors. All the ports in the switch except IST will remain locked.
NOTE: All known failover times for SMLT/RSMLT operation are now, and always have been sub-second. With this release all known fail-back or recovery times have been improved, especially for very large scaled environments to be within 3 seconds, in order to provided required redundancy for converged networks. These values are for unicast traffic only. Not all IP Multicast failover or fail-back/recovery situations can provide such times today, as many situations depend on the IPMC protocol recovery. For best IPMC recovery in SMLT/RSMLT designs, the use of static RPs for PIM-SM is recommended, with the same CLIP IP address assigned to both Core IST Peers within the Cluster, and to all switches with a full-mesh or square configuration. Failover or fail-back/recovery times for any situations that involve high-layer protocols can not always be guaranteed. Reference the Network Design Guide for your specific code release for recommendations on best practices to achieve best results. In many situations, it is abnormal corner case events for which times are extended. As well for all best results, VLACP MUST also be used. The SMLT/RSMLT improvements noted here have been optimized to function always with VLACP. Therefore for best results a pure Nortel SMLT/RSMLT design is best. We still support SMLT designs with any non-Nortel devices that support some level of link aggregation, but fail-back/recovery times can not be guaranteed.
NOTE: VLACP configuration should now use values of 500 msec short timer (or higher) and a minimum timeoutscale of 5. Lower values can be used, but should any VLACP ‘flapping’ occur, the user will need to increased one or more of the values. These timers have been proven to work for any large scaled environments (12,000 MACs), and also provide the 3 second recovery time required for converged networks (5 x 500 = 2.5 seconds). Using these values may not increase re-convergence or fail-back/recovery times, but instead guarantee these times under all extreme conditions. (CR Q01925738-01 and Q01928607) As well, users should note that if VLACP is admin disabled on one side of the link/connection, this will cause VLACP to bring the associated remote connection down, but since the remote connection will keep link up, the side with VLACP admin disabled, will now have a black-hole connection to the remote switch, which will cause a drop of all packets being sent to it. If VLACP is disabled on one side of a connection, it MUST also be disabled on remote side or else traffic loss will likely occur. The same applies to LACP configurations for 1 port MLTs as well.
NOTE: If using VRRP with SMLT, users are now HIGHLY (MUST) recommended to use unique VRIDs, especially when scaling VRRP (more than 40 instances). Use of a single VRID for all instances is supported within the standard, but when such a configuration is used in scaled SMLT designs, instability could be seen. A [better] alternative method, which allows scaling to maximum number of IP VLANs, is to use RSMLT designs instead. See Section 10 in this Readme (page 10) for additional information on how to easily move from VRRP design to RSMLT design.
NOTE: For any SMLT design, for L2 SMLT VLANs, it is now HIGHLY recommended to change the default VLAN FDB aging timer from its default value of 300 seconds, to now be 1 second higher that the system setting for the ARP aging timer. FDB timers are set on a per VLAN basis. If using the default system ARP aging time, config ip
arp aging , of 360 (minutes) than the proper value for the FDB aging timer, config vlan x fdb-entry aging-time , should be 21601 seconds, which is 360 minutes (6 hours) plus 1 second. This will have the system only use the ARP aging timer for aging, versus the FDB aging timer. This value has been shown to work very well to assure no improper SMLT learning. The use of this timer has one potential side-affect. For legacy module, this limits the system to around a maximum of 12,000 concurrent MACs; for R-mode system, the limit remains at 64K, even with timer setting. With this timer, should an edge device move, the system will still immediately re-learn and re-populate the FDB table properly, and not have to wait for the 6 hour (plus 1 second) timer to expire. No negative operational affects are known when using this timer value. For non-SMLT based VLANs the default FDB aging timer of 300 maybe used or can be changed or even also set to 21601. For this reason the default value of the FDB aging timer will remain at 300 (seconds), within all code releases.
Cheers!