VMware HA Problem with Update 3

A problem has been identified with VMware ESX 3.5 Update 3 when using VMware HA and VM failure monitoring. This problem results from a delay in the transmission of a heartbeat from a VM to VMware HA; VMware HA then detects this as a VM failure and restarts the VM. It appears that this problem affects both VMware ESX and VMware ESXi.

More information on the problem is available in this KB article.

To fix the problem, users have two options:

  1. Disable virtual machine failure monitoring within the VMware HA cluster.
  2. Reconfigure the host to change the heartbeat delay.

To reconfigure the host to change the heartbeat delay, follow the steps below:

  1. Disconnect the host from VC (right-click on the host in the VI Client and select “Disconnect”).
  2. Login to the VMware ESX server via SSH and obtain root permissions. Remember that best practices specify not to allow root SSH login, so you’ll need to login as an ordinary user and then use “su -” to become root.
  3. Using a text editor such as nano or vi, edit the file “/etc/vmware/hostd/config.xml” and set the value of heartbeatDelayInSecs to 0, like this:

  4. Restart the management agents on the VMware ESX server.
  5. Reconnect the host in VC (right-click on the host in the VI Client and select “Connect”).

No information is yet available on when this issue will be fixed.

  1. Tom’s avatar

    Scott, Thanks for the tip. I’m going to check these settings.

  2. Rick Vanover’s avatar

    These are discomforting, I am having a similar issue with the file sync service (LGTO_Sync), but have found that while the behavior is similar, this particular random reboot is not HA driven. “I am told” ESX 3.5 Update 4 (and some individual updates) will have this all corrected.

  3. Andy Daniel’s avatar

    Although not fully disclosed, we have been told by support services that this same issue is what has caused several of our Update 2 VM’s to randomly reboot recently.

