It’s finally official… Nortel has released v4.1.8.2 software for the Ethernet Routing Switch 8600. This latest code promises to put all the ARP/FDB issues that surfaced in the 4.1.6.x software branch to rest. It also promises to provide increased efficiencies for those running switch clusters (IST). I’ve been running 4.1.8.0 software for the past 30+ days and believe it’s a stable release that customers can finally count on. The one word of warning for everyone out there revolves around VRRP IDs, you must make sure you have unique VRRP IDs across your entire switch.
Anyone considering an upgrade should read the release notes carefully since there are a number of significant changes to the code.
You can find a copy of the release notes here but you’ll obviously need to visit the Nortel site to download the software.
Here’s an excerpt regarding the changes around SMLT/RSMLT;
New Features in This Release
SMLT/RSMLT Operational Improvements (CR Q01764193/Q01769324/Q01776485)For previous SMLT operation, bringing the SMLTs Up/Down triggered the flushing of the entire MAC/FDB belonging to the SMLTs in both the IST Core Peer Switches. Flushing of the MAC addresses then causes the dependent ARP (for IP stations) to be re-resolved. For ARP resolution, ERS 8600 re-ARPs for all the SMLT learned ARPs. This created a major MAC/ARP re-learning effort. As the records were flushed, during the relearning period the exception (learning) packets will also be continuously forwarded to the CPU, thereby increasing the CPU load. This would further slow-down the SMLT re-convergence as well as the h/w record reprogramming. Since proper traffic flow with an ERS 8600 is completely dependent on the h/w records, this prior behavior could adversely affect convergence times, especially in very large networks (8000+ MACs/ARPs), and those networks also running with many multicast streams, as multicast streams often need to be forwarded to the CPU for learning, thereby also increasing CPU load. The SMLT changes in this release improve this operation significantly, and continue to allow all previous SMLT/RSMLT topologies to be supported. SMLT Operational Improvements will affect SMLT/RSMLT behavior in that the actual SMLT/RSMLT connection links on a powered-up IST Core Switch, will take longer to become active (link status up and forwarding) than with previous versions of code. During this time period the other Peer IST Core Switch will always be continuing to forward, therefore avoiding any loss of traffic in the network for all SMLT/RSMLT based connections. The SMLT/RSMLT associated links will not become active upon a boot-up until the IST is completely up, and a ‘new MAC/ARP/IP sync’ has occurred between the two Core IST Peer Switches in a Cluster.
Users may see occasional instances where the Remote SMLT Flag is False on both Peer Switches. This is normal, if the flag clears and is then set properly (False on one side, True on the other), once the FDB age-out for that associated VLAN has occurred. This behavior has no affect on user traffic operation – no user traffic loss or
disruption will be seen under this condition.For proper network behavior Nortel recommends to operate both IST switches with either the “new” or “old” SMLT architecture. Therefore SMLT operation between IST Peer Core switches with one switching operating with pre-4.1.8.x code, and the other operating with 4.1.8.x or later code is NOT supported. Additionally users will see some new informational log messages generated around this behavior. The new messages formats are listed below, along with the various situations they will be seen with.
Case 1: Switch running SMLT is reset. Upon switch coming up the below messages are displayed irrespective of the number of SMLTs:
CPU5 [06/05/08 05:05:45] MLT INFO SMLT MAC/ARP Sync Requested: CAUTION do not take ANY action with the peer at this time
CPU5 [06/05/08 05:05:46] MLT INFO SMLT MAC /ARP Sync is Complete : Peer can now be used normallyCase 2: System is up and running but SMLT UP event (from down) has happened. One sync message is displayed for every SMLT that went down and has come up. In the following example, 2 x SMLTs went down and came up:
CPU5 [06/05/08 05:05:45] MLT INFO SMLT MAC/ARP Sync Requested: CAUTION do not take ANY action with the peer at this time
CPU5 [06/05/08 05:05:45] MLT INFO SMLT MAC/ARP Sync Requested: CAUTION do not take ANY action with the peer at this time
CPU5 [06/05/08 05:05:46] MLT INFO SMLT MAC /ARP Sync is Complete : Peer can now be used normally
CPU5 [06/05/08 05:05:46] MLT INFO SMLT MAC /ARP Sync is Complete : Peer can now be used normally
NOTE: To determine which specific SMLT IDs are affect, look for the SMLT ID down/up log messages.
Case 3: When sync fails due to difference in IST Peer software version (pre-4.1.8.x and 4.1.8.x) where one peer supports MAC/ARP sync but the other does not. Or some other potential issue, such as a mis-configuration or IST Session not coming up. The system that is reset and is requesting sync, it will keep all the ports locked down (except IST_MLT) until the IST comes up properly and sync has occurred. After 5 minutes the below Log/Error messages will be displayed:
CPU5 [05/15/08 05:28:51] MLT INFO SMLT MAC/ARP Sync Requested: CAUTION do not take ANY action with the peer at this time
< After 5 min>
CPU5 [05/15/08 05:33:55] MLT ERROR SMLT initial table sync is delayed or failed. Please check the peer switch for any partial config errors. All the ports in the switch except IST will remain locked.
NOTE: All known failover times for SMLT/RSMLT operation are now, and always have been sub-second. With this release all known fail-back or recovery times have been improved, especially for very large scaled environments to be within 3 seconds, in order to provided required redundancy for converged networks. These values are for unicast traffic only. Not all IP Multicast failover or fail-back/recovery situations can provide such times today, as many situations depend on the IPMC protocol recovery. For best IPMC recovery in SMLT/RSMLT designs, the use of static RPs for PIM-SM is recommended, with the same CLIP IP address assigned to both Core IST Peers within the Cluster, and to all switches with a full-mesh or square configuration. Failover or fail-back/recovery times for any situations that involve high-layer protocols can not always be guaranteed. Reference the Network Design Guide for your specific code release for recommendations on best practices to achieve best results. In many situations, it is abnormal corner case events for which times are extended. As well for all best results, VLACP MUST also be used. The SMLT/RSMLT improvements noted here have been optimized to function always with VLACP. Therefore for best results a pure Nortel SMLT/RSMLT design is best. We still support SMLT designs with any non-Nortel devices that support some level of link aggregation, but fail-back/recovery times can not be guaranteed.
NOTE: VLACP configuration should now use values of 500 msec short timer (or higher) and a minimum timeoutscale of 5. Lower values can be used, but should any VLACP ‘flapping’ occur, the user will need to increased one or more of the values. These timers have been proven to work for any large scaled environments (12,000 MACs), and also provide the 3 second recovery time required for converged networks (5 x 500 = 2.5 seconds). Using these values may not increase re-convergence or fail-back/recovery times, but instead guarantee these times under all extreme conditions. (CR Q01925738-01 and Q01928607) As well, users should note that if VLACP is admin disabled on one side of the link/connection, this will cause VLACP to bring the associated remote connection down, but since the remote connection will keep link up, the side with VLACP admin disabled, will now have a black-hole connection to the remote switch, which will cause a drop of all packets being sent to it. If VLACP is disabled on one side of a connection, it MUST also be disabled on remote side or else traffic loss will likely occur. The same applies to LACP configurations for 1 port MLTs as well.
NOTE: If using VRRP with SMLT, users are now HIGHLY (MUST) recommended to use unique VRIDs, especially when scaling VRRP (more than 40 instances). Use of a single VRID for all instances is supported within the standard, but when such a configuration is used in scaled SMLT designs, instability could be seen. A [better] alternative method, which allows scaling to maximum number of IP VLANs, is to use RSMLT designs instead. See Section 10 in this Readme (page 10) for additional information on how to easily move from VRRP design to RSMLT design.
NOTE: For any SMLT design, for L2 SMLT VLANs, it is now HIGHLY recommended to change the default VLAN FDB aging timer from its default value of 300 seconds, to now be 1 second higher that the system setting for the ARP aging timer. FDB timers are set on a per VLAN basis. If using the default system ARP aging time, config ip
arp aging , of 360 (minutes) than the proper value for the FDB aging timer, config vlan x fdb-entry aging-time , should be 21601 seconds, which is 360 minutes (6 hours) plus 1 second. This will have the system only use the ARP aging timer for aging, versus the FDB aging timer. This value has been shown to work very well to assure no improper SMLT learning. The use of this timer has one potential side-affect. For legacy module, this limits the system to around a maximum of 12,000 concurrent MACs; for R-mode system, the limit remains at 64K, even with timer setting. With this timer, should an edge device move, the system will still immediately re-learn and re-populate the FDB table properly, and not have to wait for the 6 hour (plus 1 second) timer to expire. No negative operational affects are known when using this timer value. For non-SMLT based VLANs the default FDB aging timer of 300 maybe used or can be changed or even also set to 21601. For this reason the default value of the FDB aging timer will remain at 300 (seconds), within all code releases.
Cheers!
michael gagnon says
“[I]NOTE: For any SMLT design, for L2 SMLT VLANs, it is now HIGHLY recommended to change the default VLAN FDB aging timer from its default value of 300 seconds, to now be 1 second higher that the system setting for the ARP aging timer. FDB timers are set on a per VLAN basis. If using the default system ARP aging time, config ip
arp aging , of 360 (minutes) than the proper value for the FDB aging timer, config vlan x fdb-entry aging-time , should be 21601 seconds, which is 360 minutes (6 hours) plus 1 second.[/I]”
so, i’m a bit confused. we will have to continue changing the FDB aging timer to 1second above the ARP timer for all VLANs? i was under the impression this was a temp. fix until 4.1.8.x arrived, and now i learn i still have to make these configuration changes across all VLANs? my biggest issue is somewhere down the road when me or another teammate is making changes, creates a new VLAN, and forgets to change the bridge/FDB timer.
Michael, what have you seen with 4.1.8.0? if you continue to set the FDB aging timer to 300 (default) by mistake, is there immediate interruption across the SMLTs?
if this is such a big issue, and something (the original ‘fix’) must still be implemented, then why can’t they create a flag or syntax to automatically change the FDB timer for any new VLAN that is created?
thanks,
Michael McNamara says
Hi Michael,
I thought that was kinda bizarre as well… when I spoke with a few of the designers/engineers their rational was quite simple. They felt that the workaround was so successful and prevented a large number of other issues that it should be a best practice configuration. The verbiage in the release notes is much stronger than I anticipated myself. It was suggested to me that if I had any ARP/FDB issues I should change the FDB aging timer to 21601 but this was only if I had any issues or problems. I understand your confusion and I’m not sure I would have included such a statement in the release notes, because it gives (as you stated) a perception that the problem wasn’t completely fixed.
I’ve been running v4.1.8.0, which is essentially the same as 4.1.8.2 with the exception of the MSTP/RSTP fixes that are in italics in the release notes, on three different switch clusters over the past 45 days. The ARP/FDB issues that were very visible in 4.1.6.x have not shown themselves whatsoever in 4.1.8.0. I’ve also been running with the default fdb-aging timer of 300 while testing release 4.1.8.0 and haven’t seen a single issue (I was inundated with issues when running 4.1.6.3).
It’s my opinion that 4.1.8.2 looks to be stable and a very positive step forward for the 8600 switch.
Thanks for the feedback!
michael gagnon says
“I’ve also been running with the default fdb-aging timer of 300 while testing release 4.1.8.0 and haven’t seen a single issue (I was inundated with issues when running 4.1.6.3)”
that is the information i was looking for, thanks! …however, i think i will continue to run the (ARP + 1s) FDB timer just to play it safe.
joe guenther says
What have you seen with version 5.0 software for the 8600’s? We are currently running 4.1.6 on dual redundant cores with Nortel 5500’s & 4500’s edge switches with RSMLT back to the core. We have significant multicast issues with this software release and are hoping big time that 5.0 will have fixed this.
Michael McNamara says
Hi Joe,
I would encourage you to read the release notes for the 4.1.8.2 software release. There were a lot of fixes in that release for a large number of issues. I was aware of a few multicast issues but I would advise you to try and see if your issues have been addressed in the release notes.
The good news is that I believe 5.0.1.0 will be GA within the next two weeks. In essence 5.0.1.0 will incorporate all the fixes and changes previously introduced in 4.1.8.2 software. Nortel is advising customers to move to 5.0.1.0 if possible since the 4.1.x software will the EOL in September 2009.
Cheers!
Mary says
Hi All,
I’ve read your posts concernig the ARP/FDB issue and also the recommandation of Nortel to change the FDB agging time.
I have the following problem in a customer’s network but I doesn’t exactly match this ARP/FDB issue as decsribed in the Nortel documentation (MAC address learned on the IST instead of the SMLT links after the MAC ages out ):
-for about 5 minutes, it was impossible to ping the VLAN 2 address of one of the ERS 8600 cores (classic IST/SMLT architecture)
-the customer sent me the output of “show ip arp info” and “show vlan info fdb-entry” while the problem was happening and I saw a strange thing
* one of the cores learned the VLAN 2 ip address of the other one via the IST link and the SMLT Remote flag was TRUE – this is correct
* the second core learned the VLAN 2 ip address of his neighbour via a SMLT link and the SMLT Remote flag was FALSE wich is not correct;
I’m wondering if changing the FDB agging time will help?
Did anyone saw this before ?
Thanks.
Mary
Michael McNamara says
Hi Mary,
It could be related although I’m not completely sure I understand the problem that occurred. Was there a client PC that was unable to ping one of the core ERS8600 switches? When you were talking about the FDB table, were you talking about the client PC MAC/FDB entry? When comparing the two ERS8600 switches one should have SMLT Remote = True while the other should have SMLT Remote = False. There was a “display” bug a few releases back where both switches would show SMLT Remote = True.
I would advise any customer to get themselves on 4.1.8.3 as soon as possible, especially if they are using IST/SMLT and/or any R blades.
Cheers!
James Martyniak says
Is the 4.1.8.3 a typo?
We are experiencing ARP inconsistencies between 2 8600s (IST/SMLT). It is almost like the ARP responses are not making it back to the 8600s. Chatty devices coming into the 8600 are not porblematic but VIP from F5s, UNIX servers, or iLO ports will not learn the ARPS consistently on the 8600s.
Originating the ARP request from the inconsistent 8600 will enter the ARP in the table. As of now we are placing static ARP entries for these servers. It is a pain right now!
–Jimmy
Michael McNamara says
Hi James,
No it’s not a typo. Nortel released 4.1.8.3 not long after 4.1.8.2 to address a few lingering issues that were discovered.
What version of code are you running today? I would highly recommend that you upgrade to 4.1.8.3. I would be very surprised if 4.1.8.3 didn’t resolve your ARP/FDB issues. It’s also recommended to tweak the FDB timing values, this can resolve a large number of issues.
If you resolve the ARP problem by issuing and ICMP ping from either ERS 8600 switch to the affected end device then you have the ARP/FDB issue. You can further confirm the issue by examining both the ARP and FDB/MAC tables.
Good Luck!
James martyniak says
Mike,
Thanks for the help. We just upgraded 14 8600s last week to 4.1.8.2.
FDB table is consistant the ARPs are not. The timers have also been modified previous to the last post.
The change control folks are going to love us!
Jimmy
Michael McNamara says
That’s painful to say the least. I would contact Nortel and confirm that your issue/problem is resolved in the 4.1.8.3 release before you go through all that effort. I’ve been running 4.1.8.2 and 4.1.8.3 on about 5 different clusters (10 switches) and haven’t noticed any ARP/FDB issues.
The only thing mentioned in the release notes for 4.1.8.3 is the following statement;
SMLT/Layer 2 switching
There can be a discrepancy in the ARP tables between IST peers where one peer has invalid or duplicate ARP entries, ARP entries with a negative TTL value or ARP entries which do not appear to be aging.
Are you sure you don’t have a configuration problem? As a side note these changes really only apply to switch clusters running IST/SMLT configuration. If the ERS 8600 is a standalone switch you most likely won’t experience any of these ARP/FDB issues.
I have about 26 standalone ERS 8600 switches running a mix of 4.1.5.4 and 4.1.6.3 without issue.
Good Luck!
Francois Mikus says
We have upgraded a lab with 2 pairs of 8600 from 4.1.4 to 4.1.8.3 and multicast stability has greatly improved for long running streams. Anyone using multicast should be running 4.1.8.3.
As for the default fdb-aging timer, there should be an option to set the default timer to the new recommended value. Any existing vlans would keep what they have now and would need to be manually changed.
If anyone opens a ticket with them on this issue, get an engineer to submit an enhancement request.
Torben Groenne says
Hi Michael
Thanks for a really useful helping blog, I have found some very good information on you site.
Q1:
We are running stack with 2 8600 and use IST , with lot of vlan, we got this problem with arp / vrrp request first from one vrrp address and then the other, without any return from the other end, and nothing in the mac table.
We are running ERS 4.1.8.2 (and after reading this page I would upgrade to 4.1.8.3 er newer)
I remember when we where upgrading to 4.1.8.2, we rebooted twice so there should not be any problems or ? maybe that is the problem that it never came up succesful after the upgrade.
We have open a case with Nortel, but in the beginning they have never hear of this problem before, and right now they are looking for issues or work-around.
Q2:
We use a cobber 1 Gb from this 8600 to a Nortel 5510 , where the t-lan, and e-lan for the call-server. (from this 5510 with patch-up to the callserver)
The problem here are that , could it be the above issus with the vrrp / arp boardcast.
We have office around in the Country where they are coming thougth a fiber (no bandwidth limit) to the 8600 and then to the 5510 to the callserver (Meridian) with VoIP phones.
They cannot hear what the person in the other end are saying , but it works fine the other way. So I am looking for configure Qos aswell.
Looking forward to hear from you or another with some help.
Thanks again for a very good blog
Torben
Michael McNamara says
Hi Torben,
I’m happy to hear that you’ve found this site useful, in the future please feel free to make additional posts over in the forums; http://forums.networkinfrastructure.info/nortel-ethernet-switching/.
I’m not quite sure I understand your first problem (question)… have you added/deleted any VLANS recently that may have caused this problem or did this problem just appear after the software upgrade? While I’m not exactly sure of your problem (you need to describe it better for me) you may have the now infamous ARP/FDB issue where ARP entries don’t get populated properly in one of the ERS 8600 switches. You can usually test for this problem by checking the ARP tables for the destination IP address that you are trying to communicate with. If you see an ARP entry in one switch but not in the other switch you’re going to have issues. You can then verify the problem by disconnecting (admin-down) the uplink on the ERS 8600 switch that doesn’t have the ARP table entry to the 5510 stack. If you can communicate with the destination IP address (ping) after doing this you have the ARP/FDB issue which was originally addressed in 4.1.8.2 and then finally put to bed in 4.1.8.3. A workaround to the problem (other than shutdown down one of the links) is to ping the destination IP address from the ERS 8600 switch itself (the one that doesn’t have the ARP entry) and it will populate the ARP table entry for 6 hours. Please be warned though that this workaround will only work for 6 hours until the switch expires the ARP table entry.
Your second problem (question)… are the calls IP to IP or are they IP to TDM? If an IP users calls another IP users do they hear the OWSP (one-way-speech-path) issue. If the same IP users calls a TDM user (person with a digital or analog phone or extension) do they hear the OWSP? If your problem is with IP to TDM you need to look at the IP path between the phones and the VGMC (Voice Gateway Media Cards). You don’t need to worry so much about the actual Succession 1000 (Meridian) Call Server and Succession Signaling Servers. If the 5510 stack that connects these resources is also connected to a switch cluster (SMLT) then try disconnecting (admin-down) one of the uplinks to one of the ERS 8600 switches and see if that helps… if not restore the link you just disconnected and try the other link. You can also trying pinging the IP phone from the VGMC itself. I would make sure you involve you voice reseller as they should be able to identify if the problem is with the network. So additionally the OWSP is not going to be a QoS fix… you should definitely have QoS configured and running but that won’t fix your problem (unless your running over a T1 line or something slower than broadband).
Please feel free to follow-up in the forums and I’ll try to walk you through the steps.
Cheers!
Jim says
Been running the 4.1.8.2 code on our 8600’s. Interesting issue just came up. Plugged in WebApp Firewall device from Imperva. Our 8600 switch did not like that at all. Immediate reaction and the switch became very unstable. Did notice a spike in non-unicast packets coming into the switch. (We have E-health from CA) We were thinking broadcast storm? Switched to one of our 5510’s and everything was fine. This is a great site. Thank you Michael!!!
Jim
Michael McNamara says
Hi Jim,
I happy to hear you like the site. That’s an interesting issue you had. Any thoughts as to the exact problem?
Perhaps you have rate-limiting enabled on the 5510 and not on the 8600?
Thanks for the comment!
Jim Famiglietti says
Decided to leave the 5520 in place for now with the WebApp device connected. You were correct. The 8610 did NOT have any rate-limiting set. (Set to FALSE). However, the 5120 was set to TRUE with with a percentage of 0. Someone on the team suggested that it could still work, but with a percentage set to 0? That’s a new one for me! Also thought about code upgrade but the 4.1.8.2 code has been very good for us. Burned on earlier releases!!!
THANKS AGAIN!!
Jim
Francois Mikus says
To those thinking of using PGM based multicast applications on ERS-8600 platforms (or ERS-1600) please beware of the following issues that affect all known releases:
PGM multicast flows will randomly stop getting forwarded. This interrupts the data flow for a few seconds to up to 6 or more minutes. This will happen every once in a while every 1-24 hours. This seems to affect some receivers while others will receive the same flows during the same period (same multicast groups). This is also with or without static multicast source groups. Configuring static multicast source groups reduces the occurrences.
As to the ERS-1600 platform, pgm.spm packets are dropped by the routers, breaking the PGM retransmission mechanisms. The ERS-1600 platform also suffers from the same random multicast flow interruptions but at a much higher frequency, minutes instead of hours!!
I expect not many people use PGM on Nortel platform!
I have two cases open for these problems, hopefully we will get a fix soon.
Thought I would give a heads up to anyone thinking of deploying PGM. PIM processes are more stable with 4.1.8.3 and regular multicast traffic is well handled. Though the PIM processes are still pretty touchy about bad things happening (loops, excessive mcast traffic or number of groups) and will not recover gracefully. They will cause high cpu usage until reset or the routers rebooted to gain back a stable platform.
Michael McNamara says
Hi Francois,
Thanks for the information.
As you mentioned I doubt many folks will be looking for PGM support. I’m currently running IGMPv3 on the edge along with PIM-SM in the core to feed the Nortel Contact Center Agent Display (ADD) along with a few other applications and so far so good.
Thanks again for the comment!
Thomas says
Hi Michal,
Good Day!
We are using Nortel 8600 Nortel switches with Software version 4.1.6.0.
We wud like to upgrade the Software version.We are having 256MB Flash memory.Currently in Market version starting 5. series has been released.We would like to know if we can directly upgrade the latest version with this flash memory capacity.
Also we would like to know that If it is not possible ,Which wil be the latest version that we can upgarde from 4.1.6.0.
Also please notify what all requirements we need to install the latest version.
Thanks and Regards
Thomas K Mathew
Michael McNamara says
Hi Thomas,
You can upgrade to v5.x software if you have an 8691SF/8692SF with at least 256MB of memory. You’ll need at least 16MB of flash memory in order to fit the code on the /flash filesystem.
You can start by looking at this post; http://blog.michaelfmcnamara.com/2009/05/nortel-ers-8600-software-51-available/
I would advise you to review the release notes carefully, along with the release notes of all interim software releases. If you are running in an IST/SMLT configuration you should especially careful as a special upgrade procedure is required to upgrade to 4.1.8.x or 5.1.x from any previous 4.x software or 5.x software.
Good Luck!
Carolina says
Hi Michael,
We have a similar problem with ARP/FDB with a 8310 (4.1.2.0).
Here is a brief description of our topology and the issues that we are having:
Topology:
SW1 –||— SW3
|| || ||
SW2 –||— SW4
I have 4 switches: two Notel 8310 (SW1 and SW2) that are logicaly one switch and two Nexus Switches (Sw3 and Sw4) that are logicaly one switch.
The two 8310s are connected via IST and running MLT with SMLT, interconect via MLT5/smlt5 with two Cisco N9Ks that are vPC peers. The Nortel switches are using a protocol similar to vPC.
I have servers connected to Sw3 and SW4 on vlan 1. The Nortel switches are the L3 gateway for VLAN 1. Servers connect to SW4 are reacheable from any network point, but servers connect to SW3 are not. For those that are connected to SW4 I just can ping it from SW1 but not from sw2 or any point of the network.
I can see that on SW1 and SW2 i have arp and fdb entry via MLT5 for the servers that are connecteded to SW4, but for the servers that are connected to SW3 i have arp and fdb entry via MLT5 on SW1 but not on SW2. On SW2 the arp entry is learn via MLT1 (the MLT that interconnect Nortel Swithes for the IST) and i have not fdb entry for those host.
I try configuring a static MAC on Nortel switch 2 to be learned by MLT5 and the ping works correctly.
It seems like SW1 is learning correctly the arp and mac address for ther servers that are connect to SW3 but for some reasons Sw2 is not.
In conclusion, arp and fdb entry for servers connect to SW4 are learned propertly on SW1 and SW2 via MLT5.
Arp and fdb entry for servers connect to SW3 are learned propertly on Sw1 via MLT5 but not on SW2 causing service affecting (on SW2 i have not fdb entry and the arp entry is via MLT1).
Please Help!!