More Discussion on VMware HA Failover Capacity

A few other bloggers have picked up VMwarewolf’s article about calculating VMware HA failover capacity, which I wrote about a few days ago.

Thomas Bishop over at scalethemind.com (a fellow Planet V12n blogger) has this to say in his blog posting:

As I would expect, HA errors on the side of caution when determining the capacity (uses the host with the least amount of RAM and the guest with the most amount of RAM as the basis of the calculation).

I can certainly see his point; after all, VMware HA is all about planning for unexpected downtime.  How is VMware HA going to know which VM is going to fail, and which hosts—if any—will have capacity to run the failed VM(s)?  From that perspective, VMware HA almost must take a worst-case scenario approach in order to be prepared for a situation in which the VM with the largest amount of configured or reserved memory must be restarted on a host with the least amount of physical RAM.

Unfortunately, the white paper to which Thomas linked (found here on VMware’s web site) doesn’t do a very good job of providing any additional detail on the calculation of VMware HA failover capacity; in fact, it seems to contradict VMwarewolf’s settings to a certain extent.  For example, take this statement from the tech doc:

When computing required failover capacity, HA first considers the host with the largest capacity to run virtual machines with the highest resource requirements.

Unless I’m reading that statement incorrectly, that flies directly in the face of VMwarewolf’s posting, which states just the opposite.  However, the document goes on to say:

HA might therefore be quite conservative in its estimates if the hosts in your cluster have a wide variance in the individual resources they provide.

In addition, the tech doc recommends the use of more uniform systems in HA clusters, so as to avoid issues such as what we’ve been discussing (where a 32GB host might be treated as a 16GB host for the purposes of calculating VMware HA failover “slots”).  Otherwise, organizations may find themselves in this boat and VMware HA won’t be able to accurately protect them against physical host failure.

I’ll be sure to post more information here as soon as I have anything new to share.  Likewise, if anyone can shed some definitive information to corroborate VMwarewolf’s statements—just to validate them and ensure us that we aren’t creating a storm of discussion over nothing—that would be great.

Tags: , , ,

  1. Mike M’s avatar

    Good article. I just dont understand why anyone would build an ESX cluster using hosts of different specs. Maybe I am fortunate but all clusters I have built have been with a minimum of 2 identical hosts. I run balanced VM’s on them and each host can handle the full load. If the VM’s running on host A consume 8gb, the vm’s running on host B will consume the same or less. It would make sense to me that I would need 16gb of RAM if each host was to have the total capacity needed to run all VMs at max RAM utilization. Am I wrong?

    Mike M
    http://www.blatbox.com

  2. slowe’s avatar

    Mike M,

    I have more than a few customers who want to reclaim hardware and repurpose it running ESX Server in order to reduce their overall hardware costs. In that kind of situation, it’s very possible to end up with ESX hosts with differing specs.

    One key thing about this issue–how VMware HA calculates failover capacity–is that it’s not about total RAM utilization. In your situation, if you had two hosts–Host A and Host B–with 16GB of RAM, 16 VMs at 512MB each on Host A, and 4 VMs at 2GB each on Host B, you would be running each host at 8GB RAM utilization. However, VMware HA would calculate failover capacity at 8 slots (16GB / 2GB = 8 slots) and if Host A failed, not all of your VMs would restart on Host B due to “insufficient failover capacity.” That’s EVEN THOUGH from a maximum RAM utilization perspective you still have enough RAM.

    I hope this helps clarify why this issue is critical.

  3. skbl’s avatar

    Hi,

    Interresting article, but I don’t understand what’s happening with my setup : I have a cluster (1 host failover) build with two identical servers, with 16GB each.

    As I used DRS for initial placement, there are running VMs on both machines, with memory configurations ranging from 512MB to 2048MB.

    So, I should also have 16/2=8 available slots, isn’t it ?
    On the first server, there are 8 running machines (and two powered off) and 10 on the second one ( and one powered off).
    If I try to poweron one more, failover restrictions don’t allow me to do so.

    Strange figures ???

  4. slowe’s avatar

    Skbl,

    Check out this article:

    http://blog.scottlowe.org/2008/01/07/vmware-ha-clarification/

    Which may help clear things up (or not). Despite all the blogging about it, it still doesn’t seem clear exactly how VMware HA calculates failover capacity.