Why SR-IOV on vSphere?

Yesterday I posted an article regarding SR-IOV support in the next release of Hyper-V, and I commented in that article that I hoped VMware added SR-IOV support to vSphere. A couple of readers commented about why I felt SR-IOV support was important, what the use cases might be, and what the potential impacts could be to the vSphere networking environment. Those are all excellent questions, and I wanted to take the time to discuss them in a bit more detail than simply a response to a blog comment.

First, it’s important to point out—and this was stated in John Howard’s original series of posts to which I linked; in particular, this post—that SR-IOV is a PCI standard; therefore, it could potentially be used with any PCI device that supports SR-IOV. While we often discuss this in the networking context, it’s equally applicable in other contexts, including the HBA/CNA space. Maybe it’s just because in my job at EMC I see some interesting things that might never see the light of day (sorry, can’t say any more!), but I could definitely see the use for the ability to have multiple virtual HBAs/CNAs in an ESXi host. Think about the ability to pass an HBA/CNA VF (virtual function) up to a guest operating system on a host, and what sorts of potential advantages that might give you:

  • The ability to zone on a per-VM basis
  • Per-VM (more accurate, per-initiator) visibility into storage traffic and storage trends

Of course, this sort of model is not without drawbacks: in its current incarnation, assigning PCI devices to VMs breaks vMotion. But is that limitation a byproduct of the current way it’s being done, and would SR-IOV help alleviate that potential concern or issue? It sounds like Microsoft has found a way to leverage SR-IOV for NIC assignment without sacrificing live migration support (see John’s latest SR-IOV post). I suspect that bringing SR-IOV awareness into the hypervisor—and potentially into the guest OS via each vendor’s paravirtualized device drivers, aka VMware Tools in a vSphere context—might go a long way to helping address the live migration concerns with direct device assignment. Of course, I’m not a developer or a programmer, so feel free to (courteously!) correct me in the comments.

Are there use cases beyond providing virtual HBAs/CNAs? Here are a couple questions to get you thinking:

  • Could you potentially leverage a single PCI fax board among multiple VMs (clearly you’d have to manage fax board capacity) to virtualize your fax servers?
  • Would the presentation of virtual GPUs to a guest OS eliminate the need for a paravirtualized video driver, and would the lack of a paravirtualized video driver streamline the virtualization layer even more? The same goes for virtual NICs.

I’m not saying that all these things are possible—again, I’m not a developer so I could be way off base—but it seems to me that SR-IOV at least enables us to consider these sorts of options.

Regarding networking, this is where I see a lot of potential for SR-IOV. While VMware’s networking code is highly optimized, the movement of Ethernet switching into hardware on a NIC that supports SR-IOV has got to free up some CPU cycles and virtualization overhead. It also seems to me that putting that Ethernet switching on an SR-IOV NIC and then adding 802.1Qbg (EVB/VEPA) support would be a sweet combination. Mix in a hypervisor-to-NIC control plane for dynamically provisioning SR-IOV VFs and you’ve got a solution where provisioning a VM on a host dynamically creates an SR-IOV VF, attaches it to the VM, and uses EVB to provision a new VLAN on-demand onto that NIC. Is that a “pie in the sky” dream scenario? I’m not so sure that it’s that far off.

What do you think? Please share your thoughts in the comments below. Where applicable, please provide disclosure. For example, I work for EMC, but I speak for myself.

Tags: , , , , ,

  1. Josh Coen (@joshcoen)’s avatar

    Hi Scott,

    Good conversation starter. It really makes you think when you start talking about using some emerging standards in tandem to create an almost euphoric reality. Good stuff!

    I could be off base here, and without knowing if it’s possible to do natively, I’d be curious, when using a PCIe SSD card, if SR-IOV could be leveraged to do some sort of RDM from a VM to the SSD for a possible increase in performance

  2. Kyle Betts’s avatar

    Scott, I don’t think what you are describing is pie in the sky. Cisco Unified Compuying is already doing today alot of what you are describing. While UCS does not utilize the SR-IOV standard, PCI device emulation is a staple in UCS via the Virtual Interface CNA (aka Palo adapter). In fact, your theory of hypervisor-to-nic control plane for dynamic VIFs is exactly how VM-FEX in UCS works.

    http://infrastructureadventures.com/2011/09/29/deploying-cisco-ucs-vm-fex-for-vsphere-part-1concept/

    ” VM-FEX works by leveraging UCS vNICs (not to be confused with VMware vNICs), where each VM is assigned to a UCS vNIC or from a more technical aspect the traffic from each VM is tagged with a specific VNTag.”

    All of this is possible because of the Cisco VIC, which now supports up to 116 VIFs per card (some of Cisco’s marketing will say 256, but this is a little of a misnomer). These VIFs can be either vNICs or vHBAs. VM-FEX (virtual networking offloading) is the only iteration that fully supports vmotioning of VMs; currently vHBAs passed directly to VMs does restrict their ability to move, but we’re hoping Cisco enables this soon to match vNIC capabilities.

    To your point about passing HBAs to VMs . . . already did that with one of my UCS customers! They were deploying some EMC app or tool that needed direct access to the VMAX via the SAN fabric. They needed 12 of these servers and rather than stand up 12 physical servers configured with 24 HBAs, we opted to use VMs, create vHBAs on the UCS blades, and use VMDirectPath to pass those to the VMs. While it was understood that those VMs were now pinned to those hosts, it was a much more cost efficient solution that running 12 non-virtualuzed rack mounts.

    Again, UCS does not use the SR-IOV specification; its a propriatary Cisco technology. I’ve seen alot of people compare UCS and the VIC card to Xsigo’s solution, and I would admit they are similar in some ways, but there are some VAST differences that I won’t get into here. :)

    I must disclose that I do work for a Cisco partner; however, I came to work for this partner AFTER becoming a UCS convert and disciple. Other than UCS, I have no allegiences to Cisco. I only profess the greatness of UCS because I truely think it’s the best x86 blade system in the marketplace. But thats just my humble opinion ;)

  3. Dave Convery’s avatar

    I’m just wondering how SR-IOV will be handled in an HA event or DRS move. Obviously, it is tied to the host. So, you gain in flexibility at a hardware layer, but lose in flexibility at a virtual layer.

  4. David Pasek’s avatar

    Scott, I fully agree with you. SR-IOV can be very useful and handy for lot of thing but mainly for networking and storage virtualization in modern virtualized datacenter. I’m sure you are somehow familiar with CISCO VM-FEX (aka VN-link in hardware) in CISCO UCS + vSphere which is doing similar thing without SR-IOV. You can assign virtual interfaces (vNICs) instantiated in special NIC (aka VIC) directly to VM and even vMotion works. And you can see all your VMware VM’s vNICs in single upstream “switch/port extender” – redundant fabric interconnect. Yes that’s proprietary hardware solution based on CISCO UCS/FEX/VIC (virtualized interface card) and special VEM plugin to ESX. And as always proprietary solution is like PoC for visionaries and when industry will see significant benefits real SR-IOV implementations will come … during next few years :-)

    We both can see lot of benefits and I hope more and more people will see benefits of such concepts in future and hope VMware will support SR-IOV soon to enable such concept on commodity server hardware.

    David.

  5. Donny Parrott’s avatar

    Interesting points, but how does this compare with other SR-IOV capabilities currently on the market? Xsigo, Egenera, NextIO, Virtensys have alredy accomplished many of these capabilities. Although these solutions move the endpoint for SR-IOV to a resource aggregator, numerous “virtual” devices can be created/removed on the fly sharing a single physical resource.

    I do agree, however, that the integration with vCenter would be the great accomplishment. But then again, I belive that this should be a target for all datacenter resources (deploy resource, configure vCenter integration, provision). A single pane for all resource management.

    Back on topic, SR-IOV does yield significant performance improvement in virtualization since ESX4. I believe this is the distributed mainframe model. A compute node, a storage node, a fabric node, an I/O node. Each shared amongst the whole and configured for the segmentation and performance required.

  6. Colin Lynch’s avatar

    Hi Scott

    Interesting topic.

    As I’m sure you are aware the Cisco Palo adapter while not using SR-IOV as we now know, but is “SR-IOV” compatible (make of that what you will) is almost there with this, i.e. does allow vNICs to be passed through to a guest (VM-FEX) however it does not yet allow vHBA passthrough which would be a great feature, Last week I needed to pass through a vHBA to a guest for Tape Library access, needed to go bare metal in the end :-(

    One of the reasons this made it on to my top 5 UCS feature request list.
    http://ucsguru.com/2012/03/15/ucs-the-perfect-solution/

    Having SR-IOV native within the hypervisor may be the way to go, but manufactuers are now implementing it at the BIOS level, which sounds a better place for it, so the host just sees them as “Virtual Physical” PCI devices then just use plain old VMDirectPath I/O. Would work well for static devices, but full dynamic allocation from Cisco will be here soon no doubt (VM-FEX for both vNICs and vHBAs) and I’m sure other vendors will have there offerings too. So having SR-IOV capable HW and Hypervisors maybe a duplication. (But I guess better two options than none)

    Regards
    Colin

  7. Mike Sheehy’s avatar

    Scott,

    My concerns are more along the lines of management and I may be off base here so please correct me if i’m wrong. Since SR-IOV is essentially allowing us to split VF’s from supported hardware to VM’s, how would all this be managed? How do we set limitations..etc per hardware device?

    Essentially, if this would work similiarly or in tandem with Direct Path IO, then vSphere doesn’t really manage the resource, at least that we can see, anyway, and it would need to be managed at the VM level.

    Wouldn’t this become a nightmare around the Pool resource environment?

  8. Arne’s avatar

    fax modem ?

    really ?

    how would the physical “pstn” interface work ?

  9. Vijay Swami’s avatar

    Scott, interesting thoughts.

    One comment I wanted to make was regarding your last paragraph.

    “Mix in a hypervisor-to-NIC control plane for dynamically provisioning SR-IOV VFs and you’ve got a solution where provisioning a VM on a host dynamically creates an SR-IOV VF, attaches it to the VM, and uses EVB to provision a new VLAN on-demand onto that NIC. Is that a “pie in the sky” dream scenario? I’m not so sure that it’s that far off.”

    It’s definitely not a “pie in the sky” scenario, and these things are already possible in VMware w/o SR-IOV via the N1KV or Cisco’s VIC card in UCS systems… what Cisco will call Adapter-FEX/VM-FEX.

    I suppose what SR-IOV will allow is for more server platforms to take advantage of these types of features/designs.

  10. slowe’s avatar

    Josh, thanks for your comment. That’s a potentially interesting point…I have to think on that for a bit.

    Kyle, David, Colin, I am very aware of the Cisco UCS platform, including the VIC (aka “Palo”) and its capabilities, and have written about it quite a bit. While the VIC provides the same functionality, it does so without the use of SR-IOV—true SR-IOV requires OS/hypervisor support, and that support doesn’t exist today in vSphere. Yet you can use the VIC with vSphere. So it gives you the same functionality, but without using the SR-IOV standard. SR-IOV will allow us to get that same functionality across other platforms.

    Dave Convery, currently any sort of hypervisor-bypass sort of scenario does create a loss of mobility. However, take a look at the SR-IOV articles for Hyper-V; they’ve figured out a way to avoid that loss of mobility. So the idea of being able to gain flexibility without a loss of mobility is possible; it’s just a matter of whether VMware will truly embrace the idea of SR-IOV or treat it like NPIV.

    Donny, all of these products do not currently use SR-IOV. SR-IOV requires support at the hypervisor/OS level, and—aside from a few Xen builds and the upcoming version of Hyper-V—there is no support at the OS/hypervisor level. So, although they accomplish the same task, they use very different ways and very different technologies.

    Mike, an interface into vCenter Server—or whatever VMware’s future management solution is—would be critical here, naturally. However, the idea of more closely associating resources with VMs is not necessarily a bad thing; it carries a negative connotation with IT pros because we associate it with the way things were before virtualization. I believe that we *can* embrace a model like SR-IOV—and related initiatives like VMware’s Granular VM Volumes (see VMworld presentation VSP3205)—without sacrificing manageability.

    Arne, yes, really. Do you have a better solution for being able to virtualize workloads that still have a strong dependency on physical hardware?

    Vijay, UCS/VIC can indeed do something like this today. Embracing standards like SR-IOV and EVB would allow us to have this functionality on potentially any hardware platform (assuming the hypervisor/OS support is there, naturally).

    Thanks for all your comments!

  11. Manqing’s avatar

    In one of our storage platforms, we reply on vmDirectpath, which give us the direct control over storage HBA and disk drives. This allows us to be able to manage and provide storage services using a storage VM. Without directpath, we would have to rewrite our software, and lose many physical identity due to virtual layer. This feature allows us to offer server and storage services in a single physical chassis, and enable us to offer a more cost effective solution. For vmotion related features, we are transparent as our storage VM is cluster aware. For applications use storage service, they can take full advantage of vmotion etc.

  12. Donny Parrott’s avatar

    Scott,

    You are correct, they are different types of IOV (MR, etc).

    However, there is one great difference I have to ask the UCS team. How do you dynamically remove an interface? In my experience so far, adding or removing interfaces from a service profile forces a reboot of the associated server.

  13. Sherry Wei’s avatar

    I think there are some misunderstanding here. VM-FEX while saves CPU cycles by sending packets to external hardware switch and not using vSwitch or Next1000v switch, it may not provide the I/O performance comparing to the bare metal driver, because if a VM uses vxnet3 or e1000, hypervisor is invovled in moving the packets in and out of the server. That’s what SR-IOV come to picture, it allows NIC cards to directly DMA packets to guest OS readable memory, completely bypassing hypervisor all together. The advancement of SR-IOV over directpath is that multiple VM can share on physical NIC and still be able to do direct DMA. But I don’t think this capability is not just a hypervisor upgrade, the VMs must be upgraded to include SR-IOV capable device drivers. That to me seems to be a massive effort. I like to hear your thoughts on this part.

  14. Juan Tarrio’s avatar

    Scott, you mention two interesting use case:

    • The ability to zone on a per-VM basis
    • Per-VM (more accurate, per-initiator) visibility into storage traffic and storage trends

    These two have been possible since many years ago by use of NPIV. However, for reasons nobody has been able to explain to me, VMware has been unwilling to support NPIV with VMFS (the most commonly used deployment method for VMs) and they only support it with RDM. I realize the simplified management involved in assigning one giant LUN to a single initiator in the FC fabric, but there’s no reason that can’t also be done with NPIV, just add all the NPIV initiator WWNs to the same zone and voila.

    In any case, what I’m trying to say here is that they key to all of this is whether hypervisor vendors will embrace this or not. In this case, VMware. Surely the addition of SR-IOV into Hyper-V for Windows Server 2012 will put some pressure on VMware to add support for SR-IOV in ESXi soon.

    Another thing that is important to clarify is that IO virtualization is possible without SR-IOV. HP supports FlexNICs in their Flex10 adapters. Brocade supports vNICs in our Brocade 1860 Fabric Adapter. Cisco VIC supports multiple vNICs as well. The virtualization capabilities are more limited, obviously, and except for Cisco UCS–who must have worked very closely with VMware to enable it–nobody else can do direct mapping (VMDirectPath) of vNICs to VMs and preserve vMotion.

    VMware seems to be too reluctant to letting people bypass their hypervisor vSwitch in an efficient manner. There seem to be some people who think that server CPUs are powerful enough to handle all that virtual switching and I/O, with them getting more powerful every year and all, but I think there would still be a lot of benefits to allowing hypervisor bypass. And the only way for that to scale would be with SR-IOV. You also talk about VEB (embedded switch in adapter) and VEPA (802.1Qbg). Part of 802.1Qbg is the Virtual Station Interface Discovery Protocol (VDP) which would be the control plane between the hyperviros, the NIC and the external network. But unfortunately VEPA seems to be going nowhere in the industry.

    And if you start talking about the new trends in overlay networks for virtualized environments (VXLAN/NVGRE/STT), these all happen in the hypervisor vSwitch, adding more to what the server CPU needs to handle. So bypassing the hypervisor would render them useless, unless you could implement them in the adapters or the top-of-rack switches. This seems like very futuristic.

    Disclaimer: I work for Brocade, I speak for myself.

  15. slowe’s avatar

    Juan, you’re preaching to the choir, my friend! I wholeheartedly agree that the functionality I ascribed to SR-IOV is completely possible with NPIV. However, VMware’s “implementation” of NPIV is, quite honestly, a joke compared with other platforms. Microsoft’s addition of SR-IOV *as well as* NPIV to Hyper-V 3 means that—in my humble opinion—VMware had better get going.

    With regard to hypervisor bypass, VMware isn’t interested in this solution because they perceive that it takes away some of their value. That becomes even more true when you add in hypervisor-implemented features like VXLAN/NVGRE/STT (or whatever the IETF comes up with out of the NVO3 working group).

    Personally, I saw a lot of value in EVB/VEPA and VDP, but it seems that those technologies/standards are going the way of the dodo bird in favor of overlay networks. I’ll leave it to a networking expert to comment on the value (or lack thereof) of that trend.

    Thanks for your comment Juan—great discussion!

  16. Juan Tarrio’s avatar

    The way I see it, overlay networks and EVB/VEPA serve two different purposes and could be combined if you bring the overlay network encapsulation to the adapters or the top-of-rack switches and leverage SR-IOV to provide VM granularity in the networking policies.

    BTW, if VMware embraces SR-IOV they will have no choice but to embrace NPIV, since the only way multiple SR-IOV devices can log in to a single physical SAN port is by using NPIV… Of course, they could support SR-IOV only for networking and not for storage, like Hyper-V is doing at least in their first iteration…

  17. Rick M’s avatar

    Good afternoon all

    Its good to see some in depth discussions regarding SR/MRIOV. Id like to take a moment to address a few details to ensure factual conversation going forward, and even possibly offer additional response if requested. This is off the cuff so please read loosely, the devil is in the details and unfortunately that many details would necessitate a whole lot more than just a comment.

    I work for NextIO (www.nextio.com), the people that invented SR/MRIOV and help create what donated by NextIO to PCI SIG. Below are a few quick notes I feel pertain to the conversation thus far.

    In order to take advantage of SRIOV, whatever device you are using must be SRIOV capable; there is no free ride to date. This list of SRIOV capable devices is actually quite short but there are a few companies that continue to release and assist to push the SRIOV envelope. The MRIOV devices/drivers are MUCH fewer due to multiple reasons which is completely off topic at this moment. A few examples of SRIOV capable devices fall in the form of 10Gb NICs, HBA’s, and SAS. Currently no NandFlash PCIe cards are SRIOV capable.

    Some vendors choose to virtualize the underlying physical transport, NextIO takes an entirely different approach. Rather than virtualizing the physical transport (for example, by running Ethernet or Fibre Channel over InfiniBand), NextIO virtualizes PCI Express (PCIe) at the hardware level.

    All three Hypervisors have focus and see benefits to utilizing an SRIOV NIC, as an example, which almost brings performance in the form of reduced latency and CPU cycles back to a physical card by almost completely bypassing the hypervisor and acting as a pass-thru NIC to those assigned. With the latest version of VMware DRS and vMotion are supported, as is HA and FT. Those actual were bridges that were only recently crossed and work quite well…now.

    Our vNet appliance is actually taking SRIOV capable devices and emulating them into MRIOV devices in which physical hosts can now receive shared IO resources and then allow the hypervisor split it once again. To date I think we are the only ones who can do this, but times change and all this means means is we are currently in the lead. :) Ya, a new game is about to played out in this field.

    Hope this helps and spurs on further discussions.
    Cheers!!

  18. David’s avatar

    Question: In the Cisco implementation, do they suffer the same lack of mobility that VMWare is struggling with in SR-IOV? In other words, if I carve up the VIC and pin a virtual NIC to the guest OS to improve I/O….am I really causing that guest to no longer being vMotioned, HA’d, DRS’d, or SRM’d dynamically?

    I can see where Microsoft may have sidestepped the mobility issue with Live Migration…at one point it tended to operate more like a boot-reboot type migration…but I’m not 100% sure if that is still the case. Is it?

  19. Jeff Woolsey’s avatar

    @Juan Tarrio & Scott
    “The ability to zone on a per-vm basis.” NPIV and SR-IOV are orthogonal, you don’t need to SR-IOV to do a proper NPIV implementation with virtual fibre channel. Windows Server 2012 Hyper-V allows you to do EXACTLY THIS with NPIV. How do you zone and mask your physical servers? That’s also how you zone and mask virtual fibre channel with Windows Server 2012 Hyper-V VMs and it all works with Live Migration.

    @David and others…
    Microsoft did NOT sidestep the mobility issue, quite the opposite. As a core engineering tenet, we made the conscious decision that all Hyper-V features work with Live Migration including SR-IOV. Windows Server 2012 Hyper-V is the only virtualization solution to date that supports SR-IOV and Live Migration together. Furthermore, there is a reason that other vendors are struggling to make SR-IOV and Live Migration work. It’s no small feat and far from trivial. While SR-IOV hardware has been in the market for years, we found numerous issues in chipsets and firmware and worked in the industry to get these issues resolved.

    @Vijay Swami
    This isn’t pie in the sky. You can do this with Windows Server 2012 Hyper-V today and it all works with Live Migration.

    Jeff Woolsey
    Windows Server & Cloud

  20. Juan Tarrio’s avatar

    @Jeff Oh, I agree. NPIV does NOT require SR-IOV. This has existed for years and you could zone and mask to your VMs just the same you would to your physical servers. But SR-IOV DOES require NPIV. It’s the only way multiple devices can log in to a single physical FC switch port.

  21. David’s avatar

    @Jeff,

    I appreciate your points, but does the Live Migration process still function as more of boot/reboot type process? Seems like architecturally, that would be an advantage in engineering in support for bypass type technologies.

    I am a fan of what effort MS is putting into the hypervisor.