I recently tried standing up a Cisco 3825 router attached to a Cisco 3750E switch which was in turn connected via vPC to a set of Nexus 7010 switches. I spent the better part of two days trying to get the BGP peers/neighbors to establish between the two Cisco Nexus 7010 switches and the Cisco 3825 router. It was really bizarre in that I was able to ping every interface involved so I had Layer 3 connectivity yet only one of the Nexus 7010 switches could establish a BGP neighbor with the 3825 router. The keepalive timer kept expiring on the second Nexus 7010 switch. After a few days I opened a case with Cisco and a week later I was informed that the configuration I was trying to implement was not supported (didn’t work).
Layer 3 and vPC Recommendations
I was provided a copy of the Nexus 7000 virtual Port-Channel Best Practices & Design Guidelines which clearly indicates on page 25 that routers should not be connected to a vPC link but should instead be connected via a Layer 3 switch port. Here are some bullet points;
- Use separate L3 links to hook up routers to a vPC domain is still standing.
- Don’t use L2 port channel to attach routers to a vPC domain unless you can statically route to HSRP address
- If both, routed and bridged traffic is required, use individual L3 links for routed traffic and L2 port-channel for bridged traffic
I was still currious to understand more of the inner-workings.. why didn’t it work or wasn’t it allowed? I only had to flip through the next few slides although I can really say that I completely understand just yet.
- Packet arrives at R
- R does lookup in routing table and sees 2 equal paths going north (to 7k1 & 7k2)
- Assume it chooses 7k1 (ECMP decision)
- R now has rewrite information to which router it needs to go (router MAC 7k1 or 7k2)
- L2 lookup happens and outgoing interface is port-channel 1
- Hashing determines which port-channel member is chosen (say to 7k2)
- Packet is sent to 7k2
- 7k2 sees that it needs to send it over the peer-link to 7k1 based on MAC address
- 7k1 performs lookup and sees that it needs to send to S
- 7k1 performs check if the frame came over peer link & is going out on a vPC.
- Frame will only be forwarded if outgoing interface is NOT a vPC or if outgoing vPC doesn’t have active interface on other vPC peer (in our example 7k2)
I’m not embarrassed to say that I followed everything up until step 11. Why exactly is it that frames will only be forwarded if the outgoing interface is NOT a vPC or if the outgoing vPC doesn’t have an active interface on another vPC peer? Isthere anyone that can shed any additional light on this topic?
I’ve never experienced such a restriction in all my years of working with the Avaya (formerly Nortel) Ethernet Routing Switch 8600 and their Split Multilink Trunking (SMLT) technology. I actually have a Cisco 3825 router connected via a SMLT attached Ethernet Routing Switch 5520 (Layer 2) with the Cisco 3825 and the Avaya 8600s all running BGP.
I have been studying the same deck as part of a deployment. As I understand it, this statement is in the context of a packet arriving via the peer-link and that it will not be forwarded due to the loop avoidance algorithm. If the packet arrived across the peer-link then it should not need to be forwarded out a vPC interface because the initial switch should have a direct connection to the end device on it’s own vPC link thus making the packet delivery from the secondary switch unnecessary. The only exception is if the secondary switch that receives the packet across the peer-link determines that the end device does not have a member port active on the original switch (orphan port) thus making this vPC interface the only way to reach this endpoint. Secondly, I believe this exact scenario only occurs when the mac address is not known on the switch and it has to be flooded to the entire vlan including across the peer-link since it is part of the layer-2 domain.
Michael McNamara says
Thanks for the detailed explanation!
The limitation is exactly as described in #11: frames which traverse the peer-link will not be forwarded out a vPC.
gjs’ loop-avoidance explanation is the standard explanation for the limitation, but I too have difficulty groking the problem.
The logic about “other chassis should have had a link to the end station” only holds if we’re talking about *L2* forwarding.
But we’re not. We’re talking about a frame that was *bridged*correctly* across the peer-link in order to be *routed* elsewhere. Refusing to forward the frame down a vPC after routing it between VLANs seems excessively paranoid.
You’ll probably run into similar pain dealing with “clever” end stations that don’t bother with ARP. When preparing a SYN/ACK for reply to a client, instead of ARPing for the MAC of their default gateway, these systems simply swap the position of src/dest addresses in the L3 and L2 headers. The SYN/ACK frame gets addressed to the Nexus that originated the prior SYN frame. Maybe you’re lucky, and it gets forwarded on the correct vPC member link, or maybe not.
That symptom will come to you from the storage guys: “Hey, I can only ping the Celera from odd-numbered client systems!” EMC calls this ‘packet reflect’. Switch it off.
You can work around all of this with the vPC ‘peer-gateway’ lever.
Thank you for all of your blogging about Nortel stuff. I’m working in a Nortel environment these days, and more than one google search has led me to useful information on your blog.
I agree there are definitely some caveats to implementing vPC but your comments are a bit misleading in telling the whole story. In your case, your really referring to traffic that is routing off the layer2 vPC domain through an SVI interface. In this case, the *expectation* is that the egress packet will have a destination mac address of the virtual address for HSRP and both switches are actively forwarding for this same mac address when vPC is enabled. Cisco makes this happen with it’s concept of “selective local forwarding” and avoiding the use of the peer-link by having both HSRP members actively routing upstream from the vPC vlan. ‘Peer-gateway’ definitely resolves the issue in cases where the packet is being sent to the physical router mac rather than the virtual mac just as you have stated. With that being said, I think your pointing the finger at the wrong device. Shouldn’t the endpoints always forward to their default gateway mac address and learn it properly via ARP rather than other non-standard means? I really don’t have an issue with their implementation in this case when you look at the routing architecture. Secondly, If you think about the amount of traffic that each Nexus box can carry and the link speed of the vPC peer-link, I think you will agree that the usage of it as a primary traffic path should be proceeded with caution hence the reason why that command is not on by default and only a option to be used as necessary.
I assume your comments are directed at me? Can’t really tell.
In any case, I think we’re in agreement here. EMC’s packet reflect is a nasty misfeature, and I said I think it should be switched off. I didn’t mean to direct any blame at the Nexus here. The EMC gets what it deserves for assuming it knows the address of the best gateway on the segment.
“In your case, your really referring to traffic that is routing off the layer2 vPC domain through an SVI interface.”
The topic of Michael’s post (bgp peering over vPC) and the scenario I laid out (EMC packet reflect) both match that description perfectly.
In both cases, the same thing is happening:
– a downstream device (Michael’s BGP router, or the EMC)
– has an IP packet that needs to be routed off-net
– wraps it in an Ethernet frame addressed to a specific Nexus (not the HSRP MAC).
– Etherchannel tosses a coin, forwards frame to the wrong Nexus.
– The wrong Nexus bridges it across the peer link.
– The right Nexus drops it, rather than sending it out a vPC link, after routing it to a new VLAN.
Sure, packet reflect is nasty. End stations should ARP for the gateway address. No doubt about that. When it came up in my environment, I had the storage guys fix their EMC behavior, rather than turn on peer-gateway.
But I still don’t see the loop risk that is mitigated by Michael’s step #11 when we’re talking about packets that need to be routed off net. It seems paranoid. This frame can’t loop. The frame should die right there, and the Nexus should wrap its payload (an IP packet) in an entirely new frame before sending it on its way in some other VLAN.
Fair point, I don’t really see a loop issue either in your case.
I think it more has to do with limiting traffic across the peer-link and leveraging/being consistent with their already optimized local forwarding on each chassis. Each chassis is fully capable of processing the packet for HSRP and making the routing decision, so there should be no case whereby a packet to the local router mac address needs to use the vPC peer-link. The local switch gateway is always active. As a side point, the recommendation from Cisco for west/east routing, is to deploy a separate port-channel with a dedicated p2p non-vPC vlan and establish a routing adjacency for cases when the upstream may have failed and traffic needs to be routed to the other switch. Additionally, for routing protocol scenarios over vPC links themselves, which is the original topic of this discussion, the issue gets further complicated by the fact that when packets arrive at a vPC peer the TTL is also decremented so OSPF/EIGRP/BGP will never come up since packets can be hashed to the wrong switch and will never make it over the peer-link with the ttl at 0. Bottom line is that these new spanning tree free topologies definitely challenge our traditional layer2/3 knowledge. I’m definitely still digesting vPC and the new architectures with Nexus. With the introduction of FabicPath/TRILL, there are more challenges to come as well. I guess this keeps things interesting:)
Great info gjs, exactly the same thing I ran into when we deployed our N7K’s last year where I assumed I can have OSPF over the PL and was informed that the recommendation was to use a different link altogether to avoid the TTL issues. As with SMLT, engineering vPC is very similar in that you try not to have any user traffic pass over the IST or the PL. The peer-gateway feature fixed that issue for vPC.
As for your comment about vPC being new… well that’s up for debate. vPC is a vastly inferior product to SMLT (nortel/avaya), which has been around for about 8 years now. SMLT is much simplier to configure and support both L2 and L3 without all the gotchas that you have to consider when implementing vPC. I got caught in the same trap as the author since I came from a SMLT background. But if it came down to which vendor to buy… as my Cisco SE likes to say “You can always buy better… but you’ll never pay more” ;)
TRILL/Fabricpath can be lumped in with 802.1aq (which has been around for a bunch of years in the carrier ethernet space), these will blow away our ideas of etherchannel, VPC, SMLT, etc. Cisco should’ve let Trill/FP die off since aq was well established… but now that they have accepted it, then it will probably become defacto like everything else they embrace. This means for us early adopters, we’re going to have to buy new hardward since the ASICs will have to changed to support these features (which is not true for aq).
that my 0.5C… I like to hear myself talk.
I am bit confused with this Nexus Vpc Layer -3 issue!!..Really need a help on it.Let me explain my scenario.I have 2 nexus 7k connecting to a pair of N5k with a double sided VPC.Now i have a router connected to access port of N5k and need a EIGRP neighbor ship with both N7k.It is single broadcast segment.The N7k 1 & 2 will have SVI s with same subnet as that of router.Above the N7k it will be only layer 3 links connected with no VPC.
So i need to know whether the router connected to N5k can form a routing protocol neighbor ship with the N7k through a VPC and reach the other destination above N7k on Layer 3.
Thanks in Advance….
Michael McNamara says
Since you have the Nexus 5010/5020s connected to the Nexus 7010s via a vPC you won’t be able use dynamic routing, supposedly you can use static routing.
Thanks for the reply..
Just want to know whether a workaround with two different interfaces from the router works.The INT 1 will be peering with N7k1 and INT 2 with N7k 2.
So you’re saying you have a traditional design with 7K’s connecting to 5K access layer with VPC between them. Then you have a router connected to the 5K’s where you’ll create a VLAN up to the 7K’s and do EIGRP peering?
– How is the VPC connected between the 7K’s and 5K’s… 2 links in a square? 4 links in full mesh?
– From the 7K’s how is it connected into the rest of your network?
– Any particular reason why the router is not connecting directly to the 7Ks? This isn’t the best way to do it….
– I’m gonna go ahead and say you will see the issue described in this article… the questions above will allow me to tell you why.
– This issue does not exist with SMLT ;)
– Because you have an intermediate hop between the peering routers, you won’t get the best failure detection and you might want to use a keepalive mechanism to have end-to-end failure detection, such as BFD. Otherwise your failure detection is waiting for your peer to time out…
Between N7k and N5k it will be a VPC with 2 links from each N5k to both N7k.The N7ks will be connected to a single 6500 with layer 3 routed interfaces and runs EIGRP.The router i am mentioning here is a server which need to run a routing protocol .The server design is similar to mainframe concepts.The server will have 2 interface connected to N5k on two separate vlans.
AS cisco recommends its not feasible for me to hook the server directly to N7k for running a routing protocol.The N7k has only 10 G interfaces and the server is having a 1 GIG!!!!!.
Steve S. says
The problem with running a dynamic routing protocol in a vPC environment is how the 7Ks handle TTL. If you have the peer-gateway feature configured on the 7Ks, any packets that may traverse the vPC peer-link will have their TTL decremented by one. Since most routing protocol traffic uses a TTL of 1 (some EIGRP implementations use a TTL of 2), obviously this traffic will be dropped and never seen by the peer 7K. So depending on the hashing on the vPC, one 7K may never even be aware there is a another router on the VLAN attempting to form an adjacency, hence, why this is unsupported by Cisco. Remember, a vPC =! a broadcast domain.
Thanks Steve..I got the issue…TTL decrementing prevents the routing protocol neighbor ship between the N7k and the router ..It might peer based on faith of routing Packet after L2mp Hashing…if its reaching direct N7k it peers and if not ttl get decremented while moving through peer link!!!..This happens coz of Peer-Gateway feature..Incase if you dont enable this ,its the VpC loop avoidance which drops the routing packets travelling through peer-links…
tn – Thanks for the information. I’m pretty much a Cisco guy so I didn’t know about the solutions you mentioned.
So thinking more about the routing issue. After recently getting some exposure to SRX clustering on their 5800s and thinking about how Cisco VSS works, I’m wondering why Cisco didn’t go with a single control plane on their 7k vPC multi-chassis solution. As i see it, that is fundamentally the issue with the 7k. Since you have two active control planes for routing, we have two two routers to deal with at layer-3 creating a conflict as compared to the layer-2 representation. I’m curious as to why they didn’t further pursue making a single control plane cluster solution on the Nexus 7k similar to VSS with a single RP active and representing both boxes from a layer-3 perspective. I haven’t worked to much with VSS but after working with the SRX cluster, I’m very impressed. I’m assuming someone much smarter than me thought it was a better decision to keep the control plane seperate but after working with a cluster, I’m definitely a fan of the single control plane for both boxes.
Anybody have any thoughts on the topic?
Steve S. says
Because all the Nexus gear was meant to support FCoE, and FC does not support the merging of fabrics (i.e. single control plane), it was best left as two distinct control planes. This is just my guess, by the way.
Interestingly this sounds exactly like the issue i experienced, albeit with Avaya and the IST link and SMLT links. From forum post: http://forums.networkinfrastructure.info/nortel-ethernet-switching/ers-5530-l3-packet-crossing-ist-won't-be-forwarded/
From Nortel guide:
“The ERS 5000 will not permit a Layer 3 packet that has traversed the IST to egress any SMLT or SLT port. The issue arises in the above configuration if the packets are hashed down the link to ERS 5530-1 while the next hop router is actually ERS 5530-2. Layer 3 packets are forced across the IST and will not be permitted to egress the SMLT/SLT link to the Edge. In order to circumvent this issue use VRRP with Backup Master on the ERS 5000 Switch Cluster and point static routes from the external router to the VRRP interface. With this configuration, Layer 3 packets will not traverse the IST and will flow as expected to the SMLT/SLT links.”
Seems similar issue to me?
Michael McNamara says
Yes this blog post cover the same issue but on the Cisco Nexus product line utilizing Virtual Port Channels. The restrictions are similar between the Avaya and Cisco implementations.