VMware HA, VMware FT, and OS Clustering

With the release of VMware vSphere 4 earlier this year, VMware officially introduced VMware Fault Tolerance (VMware FT), a new mechanism for providing extremely high levels of availability to virtual machine workloads. As I’ve talked with customers, I’ve noticed a growing number of customers who are unaware of the differences between the types of high availability that VMware provides (in the form of VMware HA and VMware FT) and operating system-level clustering (such as Microsoft Windows Failover Clustering). Although both types of technology are intended to increase availability and reduce downtime, they are very different and offer different types of functionality.

Consider these points:

  • While using VMware HA will protect you against the failure of an ESX/ESXi host, VMware HA won’t—by default—protect you against the failure of the guest operating system. An OS-level cluster, on the other hand, does protect against the failure of the guest operating system. +1 for OS-level clustering.
  • VMware clusters that are using VMware HA can choose to use VM Failure Monitoring and gain some level of protection against the failure of the guest operating system, but you still won’t get protection of the specific application within the guest operating system, unlike an OS-level cluster. +1 for OS-level clustering.
  • These same arguments also apply to VMware FT. VMware FT won’t protect you against guest operating system failure—a crash of the OS in the primary VM generally means a crash of the OS in the secondary VM at the same time—and it won’t protect you against application failure. +1 for OS-level clustering.
  • You can’t failover between systems using VMware HA or VMware FT in order to perform OS upgrades or apply OS patches. +1 for OS-level clustering.
  • Similarly, you can’t failover between systems using VMware HA or VMware FT in order to do a rolling upgrade of the application itself. +1 for OS-level clustering.
  • Of course, the VMware technologies do have some advantages. Both VMware HA and VMware FT are far, far simpler to enable and configure than an OS-level cluster. +1 for VMware.
  • Both VMware HA and VMware FT don’t require any application support in order to protect the VM and its workloads. +1 for VMware.
  • Neither VMware HA nor VMware FT require that you license specific editions of the guest operating system or application in order to be able to use their benefits. +1 for VMware.
  • VMware HA can produce higher levels of utilization within a host cluster than using OS-level clustering. +1 for VMware.
  • VMware FT can provide higher levels of availability than what is available in most OS-level clustering solutions today. +1 for VMware.

This is not a knock against any of technologies listed—VMware HA, VMware FT, or OS-level clustering—but rather an exploration of their advantages, disadvantages, similarities, and differences. Hopefully, this will help readers who might not be as familiar with these products make a more informed decision about which technologies to deploy in their data center. (Hint: You’ll probably need all of them.)

Tags: , , , , ,

23 comments

  1. Gabrie van Zanten’s avatar

    Hi Scott

    Very good post. Especially the points where you talk about failover to do OS and application updates are strong points.

    Do keep in mind that FT will require 2 Windows licenses !!!

    Gabrie

  2. cgb’s avatar

    Hi Scott,

    In your first point:
    “While using VMware HA will protect you against the failure of an ESX/ESXi host, VMware HA won’t”

    Did you mean FT in your first reference to HA?

    cgb

  3. cgb’s avatar

    Gabrie,

    Is that really the case? I found http://communities.vmware.com/thread/206059 which doesn’t have any authoritative answer with references, but I would be inclined to think it is only one license required since the second instance is not readily usable without a failure event so it’s not unlike many other DR scenarios where you wouldn’t need multiple licenses.

    cgb

  4. slowe’s avatar

    CGB,

    Read the post again, more carefully this time. You’ll see that it says “VMware HA does not—by default—protect you against the failure of the guest operating system.”

  5. cgb’s avatar

    Ah OK I can see how it can be parsed that way.. I’d just suggest that HA doesn’t protect from a ESX failure at all.. It _recovers_ from a failure by booting a VM elsewhere.. Which is why I’d expected you meant FT because that really does protect you from a ESX failure by maintaining availability/uptime.

    cgb

  6. Adrian Bool’s avatar

    Scott,

    By counting points on either side you make it sound like its an either or choice between VMware HA/FT and OS level clustering. I presume that this is not really the case and there is no reason why one can’t use both levels of protection?

    The VMware HA/FT would be there, in an N+1 configuration, to protect against physical tin failure with OS level clustering sitting above to provide all the benefits you describe in your post..?

  7. slowe’s avatar

    Adrian,

    You can use VMware HA to protect MSCS clusters; however, you cannot use VMware FT to protect MSCS clusters.

  8. Phil’s avatar

    Disclosure: I’m the product marketing manager at Stratus.

    Or you could run VMware on a Stratus ftServer and eliminate all the additional hardware and software costs and get better then 5-nines uptime right out of the box on an Intel platform. Better yet, run VMware HA on the Stratus and you’ve got any unplanned hardware and OS downtime covered as well as individual VM/application. And VMotioning from a Stratus to any other x86/x64 is not different then anything else – Intel to Intel. Simpler, better SLA and better TCO. A winner all around.

  9. Dan Libonati’s avatar

    Scott,

    Great article… and very timely as I have a customer who asked what the pluses and minuses are for Clusters and vmware’s implementation of HA & FT.
    But I have some general questions.
    Given the way FT works; could a user turn off the FT on a VM…snap or clone the VM then do an update to the apps or OS ….then turn FT back on? This would provide some failback / availability value and still have the original VM running… While I have not done this as a lab exercise …shouldn’t this work in theory?

  10. slowe’s avatar

    Dan,

    Good question! When you turn off FT, the secondary VM is destroyed. Even if the secondary VM weren’t destroyed, vCenter Server doesn’t expose the secondary VM in a way that users can manipulate it, so you wouldn’t have any way to power it on in the event that the updated primary didn’t work properly. Further, IIRC, you can’t have FT enabled on VMs that have VMware snapshots present.

    All that being said, you could use array-based snapshotting or cloning to get the same sort of effect. There are numerous potential issues there as well, many of which I’ve addressed in previous articles on how to use NetApp cloning technology. Have a look at http://blog.scottlowe.org/tag/NetApp for more NetApp-related articles, many of which discuss the interaction between array-based snapshots and VMware virtual machines.

    Thanks!

  11. James Hess’s avatar

    Well, guest OSes don’t cluster themselves by default.. if HA can easily be configured to protect against guest OS failure that should be good.

    I think VMware should add an API/option through VMware tools, where a guest can declare itself to have failed at a certain point in time, VMware will kill the VMX process, and ensure that VMware HA/FT will react accordingly.

    There really should be an admin control too
    Right click a VM choose “Fail this VM”

    To ungracefully kill it (as if the host died), and allow HA to bring it back up somewhere else.

    But for guest OS patched/maintenance, that’s what vMotion is for.
    Failing intentionally as a routine procedure seems like a dangerous strategy, granted in a guest OS-level cluster it’s your only option.

  12. Russell’s avatar

    You can already enable guest monitoring which sends a heartbeat probe to VMware tools periodically and listens for a response. If it doesn’t receive it in a given threshold it will power cycle the VM.

  13. shao’s avatar

    A very clear article… yesterday i had the same argument with some one. I disagreed that using FT creates a higher uptime for an application. F.i mail, my view is that you can create several level of redundancy (virtual) hardware, operating system and application. As i can see FT (and HA) is only valid for the hardware and operating system part. But these tools are not able to ‘monitor’ several mail (ms exchange) services, like information store.

    For this one could use ms clustering … in clustering you can create dependent cluster resource(s). If the main goal is to have a high availability of an application then FT and HA aren’t sufficient to archieve this goal. Or am i missing a point?

  14. Rob’s avatar

    Scott,

    In reading VMware’s ‘Setup for Failover Clustering and Microsoft Cluster Service’, the vSphere MSCS Setup Limitations section states that none of the VMware high availability features (HA, FT, DRS or vMotion) are supported when using MSCS clustered virtual machines. Similar limitations exist with EMC AutoStart as well.

    As a provider of turnkey systems that rely on having highly available applications, this seems to nullify most of the benefit of virtualization. Are there any options for providing application (Windows service) level fault tolerance that will allow for continued use of the VMware high availability features?

    Thanks,
    Rob.

  15. David’s avatar

    Scott,

    In your book, Mastering VMware vSphere 4 (pg 458-459), you seem to indicate that virtual machines in a clusered configuration are not valid candidates for VMotion, and they cannot be part of a DRS or HA cluster. True? One of your posts seems to indicate that you can use VMware HA with MSCS.

    Thanks
    David

  16. slowe’s avatar

    David,

    The book is correct; my earlier comments are not. VMware HA cannot be used to protect MSCS clusters. vSphere Update 1 allows the VMs to reside within an HA/DRS cluster by disabling HA/DRS for those VMs alone.

    Rob,

    I’m not aware of any other Windows service-level fault tolerance options that will co-exist nicely with HA, DRS, FT, and VMotion. They probably exist, but I’m not familiar with them or how they would interact with vSphere’s functionality.

  17. Phil Maynard’s avatar

    Rob,

    If you’re looking for a product that integrates with HA/FT and vMotion, detects and automatically repairs Windows system and application failures, you might want to check out vAppHA: –

    http://www.neverfailgroup.com/virtualization/vapphatrial.html

  18. raja’s avatar

    what is background of HA.how it is work.what is AAM,vpxa,vmap.please tell me

  19. preux’s avatar

    hi i m preux pursuing b.tech. my last year thesis will be on the data stored in the vmdk file ?

    firstly i will tell you how much i done …

    i will created 2 different vmdk

    1 -sparse.vmdk

    2 -flat.vmdk

    open it into a hexeditor which provide me the hexadecimal structure of the vmdk files .by using virtual disk format 5.0 which is avilable on the

    vmware site i get some knowledge of the structure of the vmdk file .

    when i read the structure of the -flat vmdk its similar to the hard disk structure i can easily seen the MBR(master boot record) ,PBR(partition boot

    record) and $MFT(master file table) in the -flat vmdk?

    but when i go for the sparse its quite different from the -flat i am using the formula provided in the virtual disk format 5.0 through this i can only

    seen the MBR and PBR but when we aplly this fomula for $MFT(master file table) i found nothing ?

    so my question is that how i can reach this master file table in -sparse.vmdk on my way (byte by byte reading)so that i can complete my thesis and

    degree also..?

    thanxs in advance

    preux

  20. Dave’s avatar

    Doesn’t MSCS clustering require RDMs? And, so doesn’t this mean that VADP cannot be used for backups?

  21. kiran’s avatar

    Its good article clears the picture.

  22. Madhu’s avatar

    Hi Sir,

    I have S1200 BTS Mother board , only one uplink (NIC) working normally.

    But , I need MCSC fail over cluster in that box. can you please send me

    any step by step

    Thanks
    Madhu

Comments are now closed.