This article started life as something entirely different. I was reviewing some of the VMworld 2007 slide decks, looking for “nuggets of knowledge,” as I like to call them (these are the small details that are often far more significant than they might seem) when I came across some information on VMware HA isolation response. I was actually looking for something else but as is typically the case when you’re looking for something, you find everything but the one thing for which you’re looking.
In any event, I wanted to take some time to better understand isolation response, so I decided to perform some experiments in my lab with VMware HA and isolation response. For those that aren’t familiar with it, isolation response is the term used to describe what an ESX Server in a DRS/HA cluster will do if it loses connectivity to all the other servers in the cluster, i.e., if it becomes isolated. Isolation response is set on a per-VM basis, and the default (I believe) is to power off. What this means is that when an ESX host becomes isolated, it will power off the VMs that are currently running on that host.
There’s a great deal of debate as to whether this is the right setting or not, which I won’t really delve into right now. In any case, how does a host determine if it is isolated, or if the rest of the cluster is just down? That’s what got me started down this path. The VMware HA agent (which is really the Legato Automated Availability Manager, or AAM, agent—hence the AAMClient stuff in esxcfg-firewall) uses the Service Console’s default gateway as its isolation address. Basically, what this means is that if a host can’t get to any of the other hosts in the cluster and can’t get to the isolation address, then it assumes it is isolated and initiates the isolation response. If it can’t get to other nodes in the cluster but can reach the isolation address, then it is not isolated and should continue operation (perhaps even restarting some VMs locally since this would indicate host failures in the cluster).
The stuff I found in the VMworld 2007 slides talks of using a second isolation address, which provides the VMware HA agent with another means of verifying isolation before initiating the isolation response. Before I proceeded with setting this second address, however, I wanted to be sure I understood the operation of isolation response in the current configuration. Once I’d tested that and then tested the second isolation address, I was going to write it up here.
To make a long story not quite so long, I found that isolation response was not working as expected. What happened is that other hosts in the cluster would detect the “host failure” (the isolation of my test host) and try to restart the VM before the test host detected isolation and tried to shutdown the VM. This was evidenced by these lines in /var/log/vmkernel:
Oct 5 13:13:36 esx02 vmkernel: 38:20:29:34.025 cpu3:1305)WARNING: NFSLock: 1883: disk is being locked by other consumer
Oct 5 13:13:36 esx02 vmkernel: 38:20:29:34.025 cpu3:1305)NFSLock: 2479: failed to get lock on file vswim01-flat.vmdk 0x5a1b6a0 on 192.168.31.51 (192.168.31.51)
(Yes, I’m running my VMs on NFS. Yes, I did try iSCSI to see if the behavior was different. No, I did not try Fibre Channel. Yes, I got the same results in both cases.)
To make things even more interesting, I found that the test host failed to successfully shut down a Linux VM when the isolation response was finally triggered, but was able to successfully power down a Windows guest. Both VMs had the latest version of the VMware Tools installed.
Since that time, I’ve been combing the Internet searching for more information on the VMware HA agent, the AAM ftcli utility, behaviors, workarounds, configuration tweaks, etc. Thus far, it has been an abysmal failure. There are lots of VMware Community threads, but almost every one of those is a “double-check your DNS and /etc/hosts” thread.
So, any VMware gurus out there have some useful information to share? Anyone else having VMware HA problems? Anyone know where I can find some actually useful information on VMware HA and the AAM client? I’d love to get some more detailed information and be able to put this thing to rest (and be able to advise others on how to put it to rest as well).
Tags: ESX, NFS, Virtualization, VMware, VMwareHA
-
HA, in it’s current implementation is very flawed. I have opened many cases with platinum support based on more than a couple of outages. My boss affectionately calls it the “Low Availability” feature of VMware.
It’as all about the timing. When the test host loses connectivity to the other hosts in the cluster, it counts 12 seconds, then powers down all of it’s VMs using a stop trysoft hard. This is to allow the other hosts to power on the VMs. When the other hosts lose the heartbeat to the test isolated host, they started counting and at 15 seconds they will attempt to power on the VMs. If the isolated host resumes heartbeats between 12 and 14 seconds, you are stuck with a bunch of powered off VMs. I once had a switch problem that caused all of my hosts to become isolated and it powered off hundreds of VMs. Very Cool. I am not sure exactly what it is you needed answered but I hope that helps.
With regards to the Linux guest that did not power down, you should be able to trouble shoot that by running a vmware-cmd stop trysoft hard against it.Mike
SoCal -
This gives AAM prompt
“FT_DIR=/opt/LGTOaam512 /opt/LGTOaam512/bin/ftcli -d vmware”
and I have used this for configuring root id for HA.
Didn’t get time explore much with this command. -
Hi Scott,
How did you get on with this in the end? Did you manage to find out anything more about the AAM agent and isolation response? Did you find out any nuggets of info to aid in a more in-depth way of troubleshooting, and tuning it?Cheers,
Warren
-
Well, I just had an interesting HA experience that I thought I would share.
I setup port security with violation shutdown on my switches that do the VMware traffic and had kind of forgotten about it. When our physical server running virtual center died, I decided it was time to re-build it as a VM since I hate our physical servers. I added a VM port group to the console switch, built the VM, and then ‘connected’ the virtual NIC from the VM to the Service console.
As I learned latter this caused my port security to shutdown the port (MAC address violation) which caused the host to think that it was isolated. Sure enough it powered down the VM’s that I had running on that server (luckily only 2 at the time) and then powered them back on on the other ESX server.
That all sounds good until you realize that it powered on my virtual center VM which caused a port-security violation on the second ESX server, which then caused it to become isolated. Mind you that all of my VMs still had full access to the production network.
Very fortunately, and I would like to confirm this with others, despite the isolation response being ‘shutdown’ the second ESX server in the cluster (the only other one) did NOT power down all of the VMs after becoming isolated due to my bad planning.
So it seems that isolation response will power down unless it is the last host in the cluster?
-
Oh, and I should add that I changed all of my port-security to protect
-
We’ve decided to change the setting for isolation response from it’s default of power off to leave powered on. Out of all the failures we’ve had in a cluster of hosts, we have never had just one host become isolated, it’s always been the entire network and every host, which leaves all our vm’s powered off. Probably a flaw in our network design, but there it is. We’ve never had one host go isolated in a cluster, it’s always been every host in the cluster.
-
hi
I know this is a little off topic, but I thought someone might have some inputs regarding a situation that recently accoured to one of our customers.
the have an ha enabled cluster with four hosts.
the use a separate stack of switches for the production and the management lan. each host has only one service console which is connected to the management lan. recently they had to reboot the management switch stack, which left them with 80 powered off vms, even though the production lan was never interrupted. according to what i read here it makes perfect sense, since the only heartbeat is through the service console. any thoughts on this? would a second service console on the production lan solve this problem, and would this be a good solution?many thanks for your help!
sean
-
hey scott!
thanks for your reply.
the idea behind the hole thing was actually not to have the service console on the production lan. I so far considered this to be best practice.. but I guess this is the downside to it. would you put a service console in the production lan?
tell me if I’m wrong, but if I add a second service console wouldn’t it check for other hosts or the default gateway of the second console without having to add a second isolation address? the way I understand it is that the second address would be to check for another address (a switch for example) in addition to checking the gateways. it should check the gateways of each sc with the default setting in my opinion.. -
Hi all,
I got a problem regarding the isolation in ESX3.5. I have setup 3 ESX hosts in one cluster w/ 1 VC Server. I found that those 3 hosts will be isolated unregularly. How can I solve it? or How can I troubleshoot it? Do I need to add the secord isoloation address in VC?
Many Thanks,
Cyrus -
Hi Slowe,
I have configured to use multiple NICs bound to the vSwitch (2 x Active NICs). And the default gateway also responds to ping.
I will try to add the secord isolation address and see if they will become stable or not. Many thanks for your recommendation.
-
Hi Slowe,
I got another issue regarding VMware HA. Dont know why my 3 hosts in the cluster always prompt a event “HA agent on host1 in cluster has an error” every hour. But they will resume normal within 1 minute and those hosts wont be isolated. Dont have any other side effect. Just a error event. I have set das.failuredetectiontime to 60000 and added das.isolationaddress2. Do you have any idea on it?
Many thanks,
Cyrus -
Hi Slowe,
Where can I check the detail log for HA?
Thanks.
Cyrus -
I am running VMWare i3 and get the “Cannot reach isolation address” when I try and start a host in a cluster. The problem seems to be that I have 2 nics on the user LAN for redundancy (10.10.10.0/24 -> default gateway 10.10.10.254, which is the isolation address). This works fine, until I add a VMotion network on another subnet (192.168.1.0/24) and try and start the host. So I guess it’s trying to reach 10.10.10.254 from 192.168.1.1 and failing . Any ideas?
-
Thanks for your assistance. Hope this is clear…
First Server:
vswitch0
vmnetwork –vmnic2–>LAN switch
-vmnic0–>LAN switch
VMKERNEL Port
Management Network
10.10.10.1vswitch1
VMkernel port
vmotion –vmnic4 –|
192.168.1.1 | Hard wired
|
Second server: |
Same as, except .2 |
10.10.10.2 |
vmotion –vmnic4 –|
192.168.1.2 -
I think I have the solution. On the vmotion switch NIC teaming configuration, there is a fallback drop down, that needs to be set to “no”. It defaults to yes.
Thanks for your help.
-
Sorry – spoke too soon. That does seem to get round the isolation issue, but then I get hit with an inconsistent IP address, as per http://communities.vmware.com/thread/118186. I must be doing something very wring, but I don’t see what. I get an error saying “Configuration of host IP address is inconsistent on host xxx: address resolved to 192.168.1.1.”. There is nothing in the hosts files, except the loopback entry and I am using DNS, which resolve correctly. If I remove the VMotion network everything is OK, except, obviously, that I don’t have a VMotion network. Any thoughts?
-
Not quite getting this part..
Once I create a second SC connection, on a different vswitch on a different subnet I have to: 1) add the das.isolationaddress in Advanced options, and make that the SC gateway. 20 Then add das.isolationaddress2 and – here’s what I don’t get – add what address? The gateway address for that subnet, or the actual address of the second console? Or something else entirely?
-
Create a small script on my ESX server to list the HA settings. Might be usefull to some
The file is named ha_list.sh#!/bin/sh
# Created by Erik Bussink (July 2008)
#
# requires the FT_DIR variable
# add the following line to you’re /root/.bashrc
# export FT_DIR=/opt/vmware/aam
## List current HA members, their Roles and their Status
/opt/vmware/aam/bin/ftcli -domain vmware -connect localhost -port 8042 -timeout 60 -cmd listnodesecho
# List current HA members settings
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “la -l” -
And a 2nd one that will continiously monitor the status
ha_watch.sh
#!/bin/sh
# Created by Erik Bussink (July 2008)
#
# requires the FT_DIR variable
# add the following line to you’re /root/.bashrc
# export FT_DIR=/opt/vmware/aam
#
watch ‘/opt/vmware/aam/bin/ftcli -domain vmware -connect localhost -port 8042 -timeout 60 -cmd listnodes’ -
Hiya,
I also create a larger script to help get some additional information from HA.
ha_status.sh
!/bin/sh
# Erik Bussink (July 2008)
#
# requires the FT_DIR variable
# add the following line to you’re /root/.bashrc
# export FT_DIR=/opt/vmware/aam
## Monitor AAM Health (Command taken from VMworld 2006 – Effective DRA and HA in production by Nitin Suri)
/usr/bin/perl /opt/vmware/aam/ha/aam_config_util.pl -z -cmd=listnodes -domain=vmware# List current HA members status
echo
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “la -l”
echo# List status of AAM Agent on each node
echo
echo -[ esx11 ]———————————————————————
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “status esx11″
echo -[ esx12 ]———————————————————————
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “status esx12″
echo -[ esx13 ]———————————————————————
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “status esx13″
echo -[ esx14 ]———————————————————————
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “status esx14″
echo -[ esx15 ]———————————————————————
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “status esx15″
echo -[ esx16 ]———————————————————————
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “status esx16″Sure I have not really optimized the bash shell lines near the end, but it’s easy and it can be usefull.
Erik
-
Great post, Scott – exactly the same behaviour I’m seeing (except the vmkernel log, I will need to test that to see if I’m seeing the same there).
I was wondering if this has been logged as a bug with VMware? Clearly if the working hosts are trying to power on the VM before it has been powered off, the software has almost created a split-brain scenario without any mis-configuration – surely a bug?
I’m going to do some testing with ESX 3.5 Update 2 (assuming the timebomb fix is stable) – will post results.
Cheers,
TK -
I think we have the same situation like TomK: Working hosts are trying to power on VMs which hasn’t been powered off. But the network connection has reconnected.
It seems that each esx server “memorizes” all VMs which have to be restarted as response to the isolation event.Killing the “wrong” VMs on each esx would be a solution, but with more than 60 VMs and 6 esx it’s quite a lot of work.
What possibilites do we have to clean up the situation? What will occur if we would disable HA? Will it cleanup the isolation response and no VM will be restarted unexpectedly?
Daniel
-
Cyrus,
About your machines giving u the “HA error”… do you have time service properly configured ?
-
I have an ongoing issue with one of my ESX servers. The error reads: AAM ERROR message: node isolated from network” and it goes on to say that in order to fix this problem I have to use the option -noiso to override the fault. I have no clue how to write the syntax to get this command to work. I believe this will solve my problems and get server to start up. Any help is appreciated.




36 comments
Comments feed for this article
Trackback link: http://blog.scottlowe.org/2007/10/05/troubleshooting-vmware-ha-isolation-response/trackback/