Before we get into the details, allow me to give credit where credit is due. First, thanks to Dan Parsons of IT Obsession for an article that jump-started the process with notes on the Cisco IOS configuration. Next, credit goes to the VMTN Forums, especially this thread, in which some extremely useful information was exchanged. I would be remiss if I did not adequately credit these sources for the information that helped make this testing successful.
There are actually two different pieces described in this article. The first is NIC teaming, in which we logically bind together multiple physical NICs for increased throughput and increased fault tolerance. The second is VLAN trunking, in which we configure the physical switch to pass VLAN traffic directly to ESX Server, which will then distribute the traffic according to the port groups and VLAN IDs configured on the server. I wrote about ESX and VLAN trunking a long time ago and ran into some issues then; here I’ll describe how to work around the issues I ran into at that time.
So, let’s have a look at these two pieces. We’ll start with NIC teaming.
Configuring NIC Teaming
There’s a bit of confusion regarding NIC teaming in ESX Server and when switch support is required. You can most certainly create NIC teams (or “bondsâ€) in ESX Server without any switch support whatsoever. Once those NIC teams have been created, you can configure load balancing and failover policies. However, those policies will affect outbound traffic only. In order to control inbound traffic, we have to get the physical switches involved. This article is written from the perspective of using Cisco Catalyst IOS-based physical switches. (In my testing I used a Catalyst 3560.)
To create a NIC team that will work for both inbound and outbound traffic, we’ll create a port channel using the following commands:
s3(config)#int port-channel1 s3(config-if)#description NIC team for ESX server s3(config-if)#int gi0/23 s3(config-if)#channel-group 1 mode on s3(config-if)#int gi0/24 s3(config-if)#channel-group 1 mode on
This creates port-channel1 (you’d need to change this name if you already have port-channel1 defined, perhaps for switch-to-switch trunk aggregation) and assigns GigabitEthernet0/23 and GigabitEthernet0/24 into team. Now, however, you need to ensure that the load balancing mechanism that is used by both the switch and ESX Server matches. To find out the switch’s current load balancing mechanism, use this command in enable mode:
show etherchannel load-balance
This will report the current load balancing algorithm in use by the switch. On my Catalyst 3560 running IOS 12.2(25), the default load balancing algorithm was set to “Source MAC Addressâ€. On my ESX Server 3.0.1 server, the default load balancing mechanism was set to “Route based on the originating virtual port IDâ€. The result? The NIC team didn’t work at all—I couldn’t ping any of the VMs on the host, and the VMs couldn’t reach the rest of the physical network. It wasn’t until I matched up the switch/server load balancing algorithms that things started working.
To set the switch load-balancing algorithm, use one of the following commands in global configuration mode:
port-channel load-balance src-dst-ip (to enable IP-based load balancing) port-channel load-balance src-mac (to enable MAC-based load balancing)
There are other options available, but these are the two that seem to match most closely to the ESX Server options. I was unable to make this work at all without switching the configuration to “src-dst-ip†on the switch side and “Route based on ip hash†on the ESX Server side. From what I’ve been able to gather, the “src-dst-ip†option gives you better utilization across the members of the NIC team than some of the other options. (Anyone care to contribute a URL that provides some definitive information on that statement?)
Creating the NIC team on the ESX Server side is as simple as adding physical NICs to the vSwitch and setting the load balancing policy appropriately. At this point, the NIC team should be working.
Configuring VLAN Trunking
In my testing, I set up the NIC team and the VLAN trunk at the same time. When I ran into connectivity issues as a result of the mismatched load balancing policies, I thought they were VLAN-related issues, so I spent a fair amount of time troubleshooting the VLAN side of things. It turns out, of course, that it wasn’t the VLAN configuration at all. (In addition, one of the VMs that I was testing had some issues as well, and that contributed to my initial difficulties.)
To configure the VLAN trunking, use the following commands on the physical switch:
s3(config)#int port-channel1 s3(config-if)#switchport trunk encapsulation dot1q s3(config-if)#switchport trunk allowed vlan all s3(config-if)#switchport mode trunk s3(config-if)#switchport trunk native vlan 4094
This configures the NIC team (port-channel1, as created earlier) as a 802.1q VLAN trunk. You then need to repeat this process for the member ports in the NIC team:
s3(config)#int gi0/23 s3(config-if)#switchport trunk encapsulation dot1q s3(config-if)#switchport trunk allowed vlan all s3(config-if)#switchport mode trunk s3(config-if)#switchport trunk native vlan 4094 s3(config-if)#int gi0/24 s3(config-if)#switchport trunk encapsulation dot1q s3(config-if)#switchport trunk allowed vlan all s3(config-if)#switchport mode trunk s3(config-if)#switchport trunk native vlan 4094
If you haven’t already created VLAN 4094, you’ll need to do that as well:
s3(config)#int vlan 4094 s3(config-if)#no ip address
The “switchport trunk native vlan 4094†command is what fixes the problem I had last time I worked with ESX Server and VLAN trunks; namely, that most switches don’t tag traffic from the native VLAN across a VLAN trunk. By setting the native VLAN for the trunk to something other than VLAN 1 (the default native VLAN), we essentially force the switch to tag all traffic across the trunk. This allows ESX Server to handle VMs that are assigned to the native VLAN as well as other VLANs.
On the ESX Server side, we just need to edit the vSwitch and create a new port group. In the port group, specify the VLAN ID that matches the VLAN ID from the physical switch. After the new port group has been assigned, you can place your VMs on that new port group (VLAN) and—assuming you have a router somewhere to route between the VLANs—you should have full connectivity to your newly segregated virtual machines.
Final Notes
I did encounter a couple of weird things during the setup of this configuration (I plan to leave the configuration in place for a while to uncover any other problems).
- First, during troubleshooting, I deleted a port group on one vSwitch and then re-created it on another vSwitch. However, the virtual machine didn’t recognize the connection. There was no indication inside the VM that the connection wasn’t live; it just didn’t work. It wasn’t until I edited the VM, set the virtual NIC to a different port group, and then set it back again that it started working as expected. Lesson learned: don’t delete port groups.
- Second, after creating a port group on a vSwitch with no VLAN ID, one of the other port groups on the same vSwitch appeared to “lose†its VLAN ID, at least as far as VirtualCenter was concerned. In other words, the VLAN ID was listed as “*†in VirtualCenter, even though a VLAN ID was indeed configured for that port group. The “esxcfg-vswitch -l†command (that’s a lowercase L) on the host still showed the assigned VLAN ID for that port group, however.
- It was also the “esxcfg-vswitch†command that helped me troubleshoot the problem with the deleted/recreated port group described above. Even after recreating the port group, esxcfg-vswitch still showed 0 used ports for that port group on that vswitch, which told me that the virtual machine’s network connection was still somehow askew.
Hopefully this information will prove useful to those of you out there trying to set up NIC teaming and/or VLAN trunking in your environment. I would recommend taking this one step at a time, not all at once like I did; this will make it easier to troubleshoot problems as you progress through the configuration.
Tags: Cisco, ESX, Interoperability, Networking, Virtualization, VLAN, VMware


38 comments
Comments feed for this article
Trackback link
http://blog.scottlowe.org/2006/12/04/esx-server-nic-teaming-and-vlan-trunking/trackback/
Tuesday, August 5, 2008 at 10:47 am
Pingback from Xsigo I/O Director Tips and Tricks - blog.scottlowe.org - The weblog of an IT pro specializing in virtualization, storage, and servers
Wednesday, August 6, 2008 at 12:04 pm
Pingback from New VLAN Article at SearchVMware.com - blog.scottlowe.org - The weblog of an IT pro specializing in virtualization, storage, and servers
Tuesday, January 2, 2007 at 7:21 pm
Tim Hollingworth
Chauncey wants his SAN back.
Wednesday, January 3, 2007 at 10:01 pm
slowe
Hey, I let Greg decide where the SAN was going, so don’t pin this on me!!
Toby said he thought he had enough equipment to send another unit over to Charlotte…he’ll probably end up with something newer and bigger than I have.
Scott
Friday, January 5, 2007 at 9:39 am
Ecio
Hi Scott,
Just a couple of notes based on my little experience (I’m doing my first tests in these days @work).
My config is ESX 3.0 (4 nic) + 2 Cisco 3750 (stack) + Netapp 3020C
After some testing (yesterday and today) i succeeded configuring NIC Teaming + VLAN Trunking on 2 of the NICs and then use this connection to transport networking data, iSCSI and so on. The ESX is now using a datastore on the NetApp via iSCSI.
That’s what emerged:
1) In my IOS Version 12.2(25)SEB4 the ’show etherchannel load-balance’ reports
EtherChannel Load-Balancing Operational State (src-ip):
so i’ve set “Route based on ip hash” on the nic teaming page.
2) i had NOT to use the “switchport trunk native vlan xxx” statement because when i left it on the config i wasnt able to use any VLAN different from xxx. When i deleted the statement i was able to use all of the vlans without problems
3) (this is iSCSI related) if you dont create a service console on the iSCSI network you wont be able to scan/found LUNs on the storage[*]. I think this is due to the fact that our iSCSI network is not routed so the ESX cant reach it without having a service console “foot” on that network (even though Vmkernel of course has one ip on that network).
[*] this means that if you delete the iSCSI service console everything works until you try to find another LUN or reboot the ESX server –> BOOM you cant see the storage anymore (fortunately i found this before going into production :D)
Ciao,
Ecio
PS sorry for my english
Friday, January 5, 2007 at 5:03 pm
slowe
Ecio,
Yes, I saw it documented somewhere about the need for ESX to be able to make a connection to the iSCSI target, so that means routed traffic or a Service Console connection on the same subnet as the iSCSI target. I can’t recall where I saw it documented, but I *know* I remember seeing it in the documentation somewhere.
Otherwise, glad to hear things are working well for you! Be sure to check out my recent article on recovering data inside VMs using NetApp snapshots:
http://blog.scottlowe.org/2006/12/30/recovering-data-inside-vms-using-netapp-snapshots/
Thanks,
Scott
Tuesday, July 31, 2007 at 11:22 pm
Joey
Hey Scott,
I really appreciate all the work you took into putting all this together. Thanks. We have just purchased VMware and we are looking to do all that you have talked about. BUT….i was wondiering if you have tried configuring the devices that are communicating via ISCSI to either a Netapp or EMC device to use jumbo frames. We are using Cisco 3750-48 Port Gig switches to hopefully do this.
This is really where I want to take it. Because we don’t have all of our equipment as of yet i am unable to test it to see if it will work.
Please email me let me know what you think. Thanks again.
Wednesday, August 1, 2007 at 6:01 am
slowe
Joey,
If you are using the software iSCSI initiator within ESX, jumbo frames are currently out of the question, as they aren’t supported. Hardware iSCSI initiators are, of course, a different story.
Good luck with your implementation!
Wednesday, August 1, 2007 at 10:26 pm
Joey
Hey Scott,
Are you saying the Software Iscsi, like Microsoft ISCSI that i would have installed on a wn2k3 server connected to a SAN via ISCSI, doesnt support jumbo frames? The servers that i have this installed on the nics do support jumbo frames. Or are you saying there is something with ESX?
I’m trying to find out all the pros and cons so if you could shed some more light on the situation for me i’d appreciate it.
Thanks.
Thursday, August 2, 2007 at 5:55 am
slowe
Joey,
The software iSCSI initiator that ships with ESX Server does not support jumbo frames, so even if the rest of your infrastructure (NICs, switches, storage array) supports jumbo frames it still won’t really matter.
The release notes for ESX 3.0.2 (which was just released yesterday) do not indicate any change in jumbo frame support.
If jumbo frame support is a *MUST HAVE* requirement for you, then you’ll need to look into hardware iSCSI initiators.
Hope this helps!
Friday, August 31, 2007 at 4:28 pm
mike
can i just configure the server for link aggregation without configuring etherchannel on the SW?
Friday, August 31, 2007 at 5:21 pm
slowe
Yes, you can. However, there is currently some debate as to whether this will distribute traffic across the various links as efficiently as using EtherChannel/LACP.
Wednesday, September 12, 2007 at 8:02 am
Kelly Olivier
I agree Scott. VMWare misleads people to believe that their nic teaming is LACP. However, ether channels work better per our testing. We also just use a etherchannel on the cisco side and config the esx box to use the ip hash. This has worked great.
Friday, November 16, 2007 at 5:03 am
Mimmus
It’s true: with LACP, ESX doesn’t balance traffic toward ONE iSCSI target. You need to configure multiple destinations, using virtual iSCSI IPs.
Friday, November 16, 2007 at 7:00 am
slowe
Mimmus,
It’s my understanding that’s true for any solution built using EtherChannel/LACP, not just a VMware limitation; specifically, the data flow between any two single endpoints (IP addresses) cannot exceed the bandwidth of a single link. The advantage comes in the distribution of multiple data flows across multiple links.
Thursday, February 21, 2008 at 11:54 am
Charlie Brown
Hey Scott,
You have mention that you have ran into some configuration issues and I was wondering if you would have idea about the issue that I’m having. We are using HP 685 blade enclosures with Pro-curve switches. We have 2 virtual switches created with 2 nics on each switch. we are also using vlan tagging on the switches. The issue that I’m having is that every once in a while when a VM migrates to antoher server it will lose network connectivity. This pops up more frequently when we do patching of the ESX hosts and migrate multiple VM around. One person on the team thinks it is due to an issue with Network Detection Failure set to Beacon Probing. I was wondering if you any suggestions. For the life of me, I have been trying to reproduce the problem but cannot.
Any suggestions woould be appreciated
Thursday, February 21, 2008 at 12:14 pm
slowe
Charlie,
There could be multiple issues at play here. I’ve heard of numerous issues with Beacon Probing, but I can’t definitely say that’s the problem. It could also be your vSwitch configuration and how you are sharing NICs for VM traffic, the Service Console, and VMotion.
Start with switching away from Beacon Probing to Link Status. Then moving to overriding the vSwitch failover policy for specific port groups so that the Service Console prefers one NIC over another, VMotion prefers the other NIC, etc.
Hope this helps!
Tuesday, February 26, 2008 at 7:17 am
vijaysys
Hi…
one help…
HOw to find MAC address for an ESX server ?
thanks * regards
VJ
Tuesday, February 26, 2008 at 7:40 am
slowe
Vijaysys,
You can find the MAC address for the Service Console of the ESX Server using the “ifconfig” command at the Service Console. MAC addresses for virtual machines hosted on an ESX Server can be determined by running the guest OS-specific commands, such as “ifconfig” or “ipconfig” within the guest.
Hope this helps!
Wednesday, March 26, 2008 at 10:59 pm
Brian
Scott,
Great post! We are looking to do this with ESX 3i on Dell 2950’s and a Netapp 2050 SAN. I want to mount my volume using NFS instead of iSCSI to avoid SCSI locking. Will your teaming and trunking config still help out with NFS?
Thursday, March 27, 2008 at 8:16 am
slowe
Brian,
Using NIC teaming will help with overall throughput to multiple NFS exports on separate IP addresses, but not with overall throughput to a single NFS export on a single IP address. You’ll also want to look at using multi-mode VIFs on the NetApp side as well:
http://blog.scottlowe.org/2007/06/13/cisco-link-aggregation-and-netapp-vifs/
http://blog.scottlowe.org/2008/01/08/lacp-with-cisco-switches-and-netapp-vifs/
Hope this helps!
Wednesday, May 14, 2008 at 3:32 am
John Flick
Question: If I am going from ESX to an Iscsi device, and my data flows are going from ESX to the Isci device am I really “bonding” for a full X gb of bandwidth. When I look at the nics that are sending data out (for example during a file copy from a guest that is copying from Internet storage (where the C: .vmdk is) to the iSCSI device (where drive d: .vmdk is) I notice only 1 NIC is getting in on the act…IE I’m not filling up all 4 nics I’ve “bonded” for iSCSI, hence I only get 1GB of throughput. Shouldn’t I see all 4 bonded NIC’s just pound away at sending that traffic to the Iscsi device?
Wednesday, May 14, 2008 at 7:04 am
slowe
John,
This is a common misconception. In most cases, any “bonded” link will only be able to use the bandwidth of a single member of the link for a data flow between two single endpoints. To increase throughput across the bonded links, you need multiple endpoints, i.e., a one-to-many or many-to-one data flow. The software iSCSI initiator in ESX 3.0.x is a bit limited in this regard, although it looks like ESX 3.5 improves that.
Wednesday, May 14, 2008 at 11:50 am
John Flick
So I guess I need 10Gbe then?
I have a SAN with 4 GB ports bonded and I have ESX with 4 ports bonded….but you’re right…only 1 is used at a time.
Wednesday, May 14, 2008 at 1:35 pm
slowe
John,
You _might_ be able to work around this by using multiple iSCSI targets (i.e., multiple IP addresses on the iSCSI storage array). Then, if you are using static LACP (EtherChannel) on both ends of the link, it may work better. No guarantees, though it may be worth a try.
Thursday, June 12, 2008 at 11:38 am
Alex
Just for anyone else looking, I had some major issues with vlan tagging on an IBM blade center with a server connectivity module.
Symptoms were no vlan tagging was working. Only vlan that WAS working was the native vlan.
Under vSwitch0 Properties on the Network Adapters tab, vmnic0’s observed IP range was again only the native vlan (same with the networks under the status box on the right)
THE SOLUTION was to start a web session to the switch module, click Non-Default Virtual LANs and add vlans to your port group.
So simple, but I missed it again and again.
Thanks for your blog Scott, it has been helpful time and time again
-Alex
Thursday, July 10, 2008 at 11:50 am
Calvin
Would you care to write a guide for the HP ProCurve switches? I’m trying to make an ESX server with 4 1GbE ports link-aggregate (2 ports, 2 links) but I’m always loosing connectivity if both ports are plugged in.
Thursday, July 10, 2008 at 2:12 pm
slowe
Calvin,
I’d love to, but HP seems to be dragging their feet getting me a ProCurve switch to use in my lab for testing and validation. As soon as I can get my hands on a ProCurve switch, I’ll be happy to test the config and post some information here.
Friday, July 11, 2008 at 7:36 am
PeterVG
Hi Scott,
This has been bothering me for some time & I can’t seem to find a clear answer, so here it goes: We have ESX servers with 4 nic’s for the VM’s vswitch. 2 of the NIC’s connect to switch A & the other 2 connect to switch B. Both switches are Cisco CAT6500. I would like to configure incoming load balancing by creating a portchannel on each switch with the 2 related NIC’s, effectively having 2 PortChannels connecting to my load balanced (ip hash) vswitch. Is this at all possible ? If not, why ?
Friday, July 11, 2008 at 9:57 am
slowe
PeterVG,
That’s a great question! My initial guess is that it won’t work because the vSwitch expects to apply the IP hash algorithm across all four uplinks when it really needs to be applied separately for each pair of uplinks. However, I’ve never tested this and I’ve never seen any documentation, so I could very well be wrong (it wouldn’t be the first time!). Unfortunately, I don’t have enough GigE-capable switches to actually run the test myself. Please do let me know if you get a firm answer.
Friday, July 11, 2008 at 11:37 am
PeterVG
Hi Scott,
I believe I might have an answer to this problem: http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1001938&sliceId=1&docTypeID=DT_KB_1_1&dialogID=14912551&stateId=1%200%2014916516
The firts line of the article states: ESX Server supports link aggregation only on a single physical switch or stacked switches, never on disparate trunked switches.
So I guess I’m out of luck…
Friday, July 11, 2008 at 12:08 pm
slowe
PeterVG,
Good find! I suspected that would be the case, but it’s good to see it documented.
Tuesday, July 15, 2008 at 12:42 am
Dominick
Scott - thanks for the blog. Basic question: I’m using ESX 3.5 with a pair of dual-Gig NICs connected to a 6509 and an iSCSI SAN. I was thinking of a single vSwitch, physical connections teamed & trunked on alternate NIC’s (for redundancy on the chipsets) in active/active mode, with port groups for VM’s, iSCSI/vmKernal and Service Consoles - each using 2 physical NICs (one channel). Does this make sense for maximizing performance and redundancy? Is there a better method? Thanks!
Tuesday, July 15, 2008 at 7:42 am
slowe
Dominick,
Good question! Your timing is impeccable; I’m just wrapping up some network configuration testing in my lab and will be publishing results within the next few days.
In any case, here we go…you won’t be able to use “Route based on IP hash” on your vSwitch because that requires a single physical switch (refer to the link in comment #29). So, for physical switch redundancy, we either have to a) go with two vSwitches, each with two uplinks; or b) use standard NIC bonding in ESX without any physical switch configuration. Option A has problems because you can’t really replicate the same port group configuration on both vSwitches; Option B has problems because depending upon the type of traffic, ESX doesn’t do a very good job of utilizing all the uplinks on the vSwitch. This leaves you in a difficult position.
However, unless you have multiple iSCSI targets, Option B will provide you the best balance of performance and redundancy, so that’s the route I’d take.
Good luck!
Tuesday, July 15, 2008 at 1:13 pm
Dominick
Thanks Scott - I’m using a single physical switch (4 NIC’s on the ESX box talking to a single 6509 - jumping blades for redundancy). I took a look at the link on #29 and that along with your comment about ESX not doing a great job utilizing the uplink’s makes me believe I would be best served using 802.3ad aggregation at switch (alternating port pairs) and leaving the ESX load balancing setting to ‘route based on the originating virtual port ID’. As for the iSCSI SAN we are stuck with the single VIP.
I have the luxury of time, so I can actually test a few configurations to see which works best before placing the system into production in Sept.
Looking forward to your test results - and thanks again for the response!
Tuesday, July 15, 2008 at 1:30 pm
slowe
Dominick,
I believe that if you are going to use 802.3ad on the physical switches, you’ll be required to use “Route based on ip hash” on the vSwitches in order for connectivity to work. Keep in mind that this will only help improve the distribution of traffic across the links, not necessarily improve the throughput of any single point-to-point connection.
Good luck!
Tuesday, August 5, 2008 at 12:31 pm
CoolRos
I too am attempting to connect an ESX server to redundant switches.
My thought was to have a Vswitch with two NICs (or more), each connected to redundant Cisco switches. The Cisco switches have an interconnect as well, creating a physical loop.
My concern is that I might create a bridging loop between the three switches, since the Vswitch does not support Spanning Tree Protocol.
From what I can find, VMWare says that STP isn’t necessary only because Vswitches don’t bridge to each other. They don’t seem to address the chaos that can ensue in a Cisco world.
1) Are my concerns substantiated?
2) If so, is there a way to enable STP on a Vswitch?
Tuesday, August 5, 2008 at 6:34 pm
slowe
CoolRos,
1. I wouldn’t be concerned about it. vSwitches can’t be linked to each other except by a VM.
2. No. There is no STP support on a vSwitch. I presume this is because you can’t link vSwitches together, so there’s no need to worry about a bridging loop.
I’m not sure what “chaos” may result in a Cisco world; I have many customers who are doing just exactly what you describe without any issues whatsoever. You won’t be able to do link aggregation/EtherChannel unless your switch supports cross-switch EtherChannel, but otherwise it should work just fine.
Thanks for reading!