Significant Networking Problem with Hyper-V

After the conclusion of VIR358, I went up to the front to speak with the presenters about the question I had during the session: what about NIC bonding or NIC teaming? You’ll recall that I wondered about that during the VIR358 session.

Well, it turns out that Hyper-V does not support any form of NIC teaming or NIC bonding. Yes, you read that right: you can’t link more than one NIC to a virtual switch in Hyper-V.

If you follow my del.icio.us linkstream, you will probably have noted that I recently bookmarked a Microsoft KB article that describes how using HP’s Network Utility can cause Hyper-V to stop responding. I guess this just goes to further support Hyper-V’s lack of support for NIC teaming or bonding.

In my opinion, that is a huge problem. How does one go about providing network link redundancy to guests hosted on Hyper-V? Surely using Failover Clustering and Quick Migration isn’t the answer here, is it? One of the presenters offered to get back to me with more information; I’ve already sent him an e-mail so he has my contact information. As soon as I hear something back, I’ll be sure to update this post.

Tags: , , , ,

The support for network card features is the responsability of the network card driver provider, in this case, the HW vendor.
This has been the fact for all windows releases i can remember. nothing new here.

Very poor title for the post btw…

Ricardo

Hi

When talking to so many people about Hyper-V, could you verify if it is true that a Quick Migration will disconnect all sessions to the VM?

This would also be a hughe problem in our data center where we have many multi-tier apps, that need reboots when they become disconnected to for example the SQL server.

In VMware VMotion you might lose 1 or maybe 2 pings, but never lose your session….

Gabrie

Yikes!
When selling my cow-orkers on the idea of virtualization redundancy was a big selling point. Correctly implemented virtual environments offer every VM protection or recoverability from hardware failure.
Of course M$ knows this, and is no doubt working on it, the question is when we’ll see it.

Ricardo,

I’m sorry, but I have to respectfully disagree with you on this matter. The fact of the situation is that network card vendors, such as Intel or Broadcom, and OEM vendors, such as HP, are shipping tools and products that provide NIC teaming and NIC bonding under Windows Server 2008 today. Yet, those functions are not supported when using Hyper-V. And why doesn’t Windows Server 2008, an enterprise-class OS that intends to compete head-to-head with Linux and Solaris (among others) provide its own NIC teaming or bonding functionality? IIRC, the competitors do.

Gabrie,

In all my discussions with Microsoft folks here, it is clear that Quick Migration most certainly WILL cause disconnected sessions, dropped connections, etc. The Quick Migration operation is the equivalent of suspending the VM on one host and resuming it on the other host. There WILL be downtime.

DavidG,

I did speak with some guys at the HP booth; they’re aware of the problem(s) between Hyper-V and HP’s Network Utility and are working to resolve it. I’m confident the other vendors are in similar situations. What we need, though, is a definitive product support statement from Microsoft on the use of these technologies, or we need Microsoft to provide NIC teaming/bonding functionality in the base OS.

I’d have to agree with Scott on this one. If Hyper-V is ever going to compete with VMWare, NIC Teaming and Bonding is a MUST. Sure, Microsoft and the hardware vendors have done things a certain way until now, but this process will have to change to compete.

I would be hard pressed to recommend Hyper-V in ANY production environment based on this fact. Network redundancy is a must for any production virtualization environment. The number one question a customer will ask, “Is this solution redundant and does it have any single points of failure?”

Ricardo - you’re right that “network card features is the responsability of the network card driver provider, in this case, the HW vendor.” However, what we’re talking about here isn’t a “network card feature.” Hyper-V manages multiple virtual machines, and it needs to manage multiple virtual network adapters as well.

Take VMware ESX for example: it manages the virtual networks, and you can put any two vendors’ network cards in the same virtual switch.

I think the point that Ricardo made in his first point is very valid. Device drivers are written by the device manufactures - this has generally been the case. NIC teaming is unique to each vendor’s NICs, so it would be impossible for Microsoft to write a single feature that would support all the different variables from each of the different vendors. Most of those teaming solutions play with the MAC address, giving the same address to two or more cards. This gets interesting when trying to map to them from the virtual side, particularly when each vendor has a different implementation of how they do it. I would not be surprised, though, if Microsoft is working on coming up with a standard that all the vendors can agree upon and then write to. They did this with their generic print driver, their Storport driver, their MPIO, and others. It takes time to bring the vendors together to get to a common standard because most of those vendors want to have features that set themselves apart from their competition. So, since Microsoft does not write the code (nor does it have access to the code in order to debug problems) there is no way that it can support NIC teaming. When Microsoft says it supports a product, it means that it can write hot fixes to correct problems. If it doesn’t have access to the code, it can’t write hot fixes.

VMware, on the other hand, has access to the code for those devices due to the nature of their Linux-based operating system. They can build that. Now, with that said, I agree that with Microsoft’s synthetic devices in their virtual machines, they might be able to create a NIC teaming capability within the virtual machine, but that doesn’t necessarily map to the NIC teaming on the physical host. We’ll have to wait to see on that one.

As to the clustering or Quick Migration issue, yes, there is an outage of the virtual machine while the machine goes into a save state, the LUN is moved, and the state is restored. This impacts some applications but not all. This is the same that has been happening in Microsoft clusters since they were introduced in NT4. Client applications should be written to be able to ride through a network outage. This generally means that if it attempts to access the server and it is unavailable, it needs to try again for a period of time. Since the state of the virtual machine is saved during a Quick Migration, the client information is saved. It only needs to reconnect. Some applications do this well. Other applications did not code good retry logic into them to be able to ride through a network outage. Other applications are written with tools that don’t handle network outages very well. Again, everything is still in the virtual machine; the client just needs to have the logic to ride through a network outage.

Tim,

I understand your position. I do not necessarily agree with you on all points, but I can see your point of view. In my opinion, while device manufacturers certainly do bear the responsibility to create device drivers, Microsoft bears the responsibility of making sure that its products meet the needs of the customers. If the device drivers can’t provide NIC teaming or bonding, then Microsoft needs to. And if Microsoft is indeed working on a “MPIO style” NIC teaming framework, then by gosh go public with it so that users will know that Microsoft has recognized the problem and is working on a resolution.

The end state is that customers need a supported way to provide network redundancy to virtual machines hosted on Hyper-V. Whether this means that the code for Hyper-V’s virtual switches needs to be changed to allow multiple uplinks, or Microsoft needs to write its own NIC bonding routines, or something else entirely needs to happen is irrelevant. The customers’ needs have to be addressed. Any way you slice it or dice it, any place you put the blame, the end result is that customers can’t provide network redundancy for VMs, and that’s a problem.

With regards to Quick Migration, those of us that have studied Quick Migration know and understand that this is merely an application of host clustering. Having worked with clusters in the past, I understand what is going on here and why. You’re absolutely correct in that some applications will “ride out” the brief outage. Many will not. Microsoft can’t say to a customer, “Your applications weren’t written well enough to ride out a brief network outage, so that will cause problems when you use Quick Migration”–that won’t fly. Instead, Microsoft has to clearly explain what Quick Migration *is* and what it *isn’t*. That part, in my opinion, has been lost in the marketing war against VMware, Citrix, and Virtual Iron’s live migration functionality. Once customers clearly understand the limitations of Quick Migration, they can apply the technology where it fits to meet their business needs.

Thanks for taking the time to comment, Tim. I appreciate your thoughts and your point of view!

Tim’s post is long, but wrong on several counts. NIC Teaming can be done above the driver, this is what ESX has been offering for years now. I really don’t understand why people are so excited about a Microsoft product which will still lack features that were available and stable years ago in VMware’s. Every day that an IT department waits for Microsoft to get its act together is a day where they could have been saving money by virtualizing using a proven solution.

Also, as has been stated millions of times. ESX is NOT Linux based. It used to come with a Redhat Linux Service Console, but this was never really a matter of convenience- the ESX kernel is entirely owned by VMware. You can also now buy ESXi which is the core ESX hypervisor and management software without any bundled Linux management OS.

Forgot something. Doing NIC teaming ABOVE the driver is the right thing to do, especially since driver software failures, when and if they happen, typically would affect all devices on the machine which are of the same type.

Thus if you have NIC teaming between e1000 and tg3, and your e1000 driver goes wonky for some reason and none of your e1000 devices work, your traffic still goes out through the tg3 and vice versa. If you were teaming 2 identical devices, you are not protected from driver failures such as this.

Randy,

Thanks for your comments. Personally, I agree with you–hence my “call to action” to Microsoft. They need to add this functionality to Windows so that Windows will be on par with ESX, Solaris, Linux, etc.

Most readers here are probably familiar with the fact that ESX isn’t based on Linux, but I appreciate you taking the time to clarify that nevertheless.

Thanks for reading!

I’ve been using nic teaming on a proliant dl360 with hyper-v RC0/RC1 for awhile now. The ‘workaround’ for the HP utility worked fine and I’ve had no further problems with the team. Sure it would be nice to have that built in to hyper-v without the need for vendor software, but this works too.

HP NIC teaming works fine. It is not supported at this time by the Hyper-V team but it does work. There was a bug with the HP teaming software and Hyper-V during the upgrade from the beta to RC0, but that is corrected in the RC1 installs. To me this follows somthing very similar to a 3rd party vendor like DoubleTake which you spoke of earlier. Microsoft does not support what DoubleTake does to enhance Hyper-V, but it works good.
I do believe Microsoft has some opportunities however. To me is seems that they could allow for multiple physical network bindings to a single virtual switch.