This article started life as something entirely different. I was reviewing some of the VMworld 2007 slide decks, looking for “nuggets of knowledge,” as I like to call them (these are the small details that are often far more significant than they might seem) when I came across some information on VMware HA isolation response. I was actually looking for something else but as is typically the case when you’re looking for something, you find everything but the one thing for which you’re looking.
In any event, I wanted to take some time to better understand isolation response, so I decided to perform some experiments in my lab with VMware HA and isolation response. For those that aren’t familiar with it, isolation response is the term used to describe what an ESX Server in a DRS/HA cluster will do if it loses connectivity to all the other servers in the cluster, i.e., if it becomes isolated. Isolation response is set on a per-VM basis, and the default (I believe) is to power off. What this means is that when an ESX host becomes isolated, it will power off the VMs that are currently running on that host.
There’s a great deal of debate as to whether this is the right setting or not, which I won’t really delve into right now. In any case, how does a host determine if it is isolated, or if the rest of the cluster is just down? That’s what got me started down this path. The VMware HA agent (which is really the Legato Automated Availability Manager, or AAM, agent—hence the AAMClient stuff in esxcfg-firewall) uses the Service Console’s default gateway as its isolation address. Basically, what this means is that if a host can’t get to any of the other hosts in the cluster and can’t get to the isolation address, then it assumes it is isolated and initiates the isolation response. If it can’t get to other nodes in the cluster but can reach the isolation address, then it is not isolated and should continue operation (perhaps even restarting some VMs locally since this would indicate host failures in the cluster).
The stuff I found in the VMworld 2007 slides talks of using a second isolation address, which provides the VMware HA agent with another means of verifying isolation before initiating the isolation response. Before I proceeded with setting this second address, however, I wanted to be sure I understood the operation of isolation response in the current configuration. Once I’d tested that and then tested the second isolation address, I was going to write it up here.
To make a long story not quite so long, I found that isolation response was not working as expected. What happened is that other hosts in the cluster would detect the “host failure” (the isolation of my test host) and try to restart the VM before the test host detected isolation and tried to shutdown the VM. This was evidenced by these lines in /var/log/vmkernel:
Oct 5 13:13:36 esx02 vmkernel: 38:20:29:34.025 cpu3:1305)WARNING: NFSLock: 1883: disk is being locked by other consumer
Oct 5 13:13:36 esx02 vmkernel: 38:20:29:34.025 cpu3:1305)NFSLock: 2479: failed to get lock on file vswim01-flat.vmdk 0×5a1b6a0 on 192.168.31.51 (192.168.31.51)
(Yes, I’m running my VMs on NFS. Yes, I did try iSCSI to see if the behavior was different. No, I did not try Fibre Channel. Yes, I got the same results in both cases.)
To make things even more interesting, I found that the test host failed to successfully shut down a Linux VM when the isolation response was finally triggered, but was able to successfully power down a Windows guest. Both VMs had the latest version of the VMware Tools installed.
Since that time, I’ve been combing the Internet searching for more information on the VMware HA agent, the AAM ftcli utility, behaviors, workarounds, configuration tweaks, etc. Thus far, it has been an abysmal failure. There are lots of VMware Community threads, but almost every one of those is a “double-check your DNS and /etc/hosts” thread.
So, any VMware gurus out there have some useful information to share? Anyone else having VMware HA problems? Anyone know where I can find some actually useful information on VMware HA and the AAM client? I’d love to get some more detailed information and be able to put this thing to rest (and be able to advise others on how to put it to rest as well).
Tags: ESX, NFS, Virtualization, VMware, VMwareHA


32 comments
Comments feed for this article
Trackback link
http://blog.scottlowe.org/2007/10/05/troubleshooting-vmware-ha-isolation-response/trackback/
Monday, October 8, 2007 at 2:23 pm
Mike
HA, in it’s current implementation is very flawed. I have opened many cases with platinum support based on more than a couple of outages. My boss affectionately calls it the “Low Availability” feature of VMware.
It’as all about the timing. When the test host loses connectivity to the other hosts in the cluster, it counts 12 seconds, then powers down all of it’s VMs using a stop trysoft hard. This is to allow the other hosts to power on the VMs. When the other hosts lose the heartbeat to the test isolated host, they started counting and at 15 seconds they will attempt to power on the VMs. If the isolated host resumes heartbeats between 12 and 14 seconds, you are stuck with a bunch of powered off VMs. I once had a switch problem that caused all of my hosts to become isolated and it powered off hundreds of VMs. Very Cool. I am not sure exactly what it is you needed answered but I hope that helps.
With regards to the Linux guest that did not power down, you should be able to trouble shoot that by running a vmware-cmd stop trysoft hard against it.
Mike
SoCal
Monday, October 8, 2007 at 2:44 pm
slowe
Mike,
Keep in mind what HA is designed to do, and that’s reboot workloads when a host fails. What I’m discussing here, and what you’ve experienced, is in regards to isolation response, i.e., a host becomes isolated but is not down per se (losing the Service Console makes a host isolated, but VMs continue to work as expected). I have had good experiences with HA outside of isolation response; refer to an earlier article I wrote:
http://blog.scottlowe.org/2007/01/22/vmware-ha-in-action/
I would suspect that for you, utilizing the das.isolationaddress2 (new in VC 2.0.2) to add a second isolation address (preferably on a separate physical NIC on a separate physical switch) would help things out. In my case, it sounds like the timing is all off, and what I’m really seeking is *WHERE* someone can go to see the configuration of the AAM client and try to troubleshoot it.
I did try running vmware-cmd stop trysoft hard myself as a troubleshooting technique; that didn’t stop the VM, either. Again, what I’d *REALLY* love to see is some way of configuring AAM so as to understand what commands it’s trying to run when it detects that a host is isolated. Thus far, there has been VERY little documentation found on AAM and it’s implementation within VMware HA.
Wednesday, October 10, 2007 at 9:18 am
Kenon Owens
Scott,
I don’t know the solutions to the problems you discussed here, but I would like to work with you and our Engineering department to figure out the answers. If you could email me offline, we could work this out.
Kenon
VMware
Monday, October 22, 2007 at 11:21 pm
Srini
This gives AAM prompt
“FT_DIR=/opt/LGTOaam512 /opt/LGTOaam512/bin/ftcli -d vmware”
and I have used this for configuring root id for HA.
Didn’t get time explore much with this command.
Sunday, January 6, 2008 at 2:52 pm
Warren Walker
Hi Scott,
How did you get on with this in the end? Did you manage to find out anything more about the AAM agent and isolation response? Did you find out any nuggets of info to aid in a more in-depth way of troubleshooting, and tuning it?
Cheers,
Warren
Sunday, January 6, 2008 at 3:58 pm
slowe
Warren,
As it turns out, I haven’t been able to make any real progress on this issue. I hope to be able to focus a bit more on this, as well as try to reproduce the issue with ESX 3.5 now that the lab has been upgraded. As soon as I have some additional information, I’ll be sure to post it here. Thanks!
Monday, January 14, 2008 at 5:45 pm
Bill
Well, I just had an interesting HA experience that I thought I would share.
I setup port security with violation shutdown on my switches that do the VMware traffic and had kind of forgotten about it. When our physical server running virtual center died, I decided it was time to re-build it as a VM since I hate our physical servers. I added a VM port group to the console switch, built the VM, and then ‘connected’ the virtual NIC from the VM to the Service console.
As I learned latter this caused my port security to shutdown the port (MAC address violation) which caused the host to think that it was isolated. Sure enough it powered down the VM’s that I had running on that server (luckily only 2 at the time) and then powered them back on on the other ESX server.
That all sounds good until you realize that it powered on my virtual center VM which caused a port-security violation on the second ESX server, which then caused it to become isolated. Mind you that all of my VMs still had full access to the production network.
Very fortunately, and I would like to confirm this with others, despite the isolation response being ’shutdown’ the second ESX server in the cluster (the only other one) did NOT power down all of the VMs after becoming isolated due to my bad planning.
So it seems that isolation response will power down unless it is the last host in the cluster?
Monday, January 14, 2008 at 5:46 pm
Bill
Oh, and I should add that I changed all of my port-security to protect
Thursday, January 31, 2008 at 12:07 pm
Brett
We’ve decided to change the setting for isolation response from it’s default of power off to leave powered on. Out of all the failures we’ve had in a cluster of hosts, we have never had just one host become isolated, it’s always been the entire network and every host, which leaves all our vm’s powered off. Probably a flaw in our network design, but there it is. We’ve never had one host go isolated in a cluster, it’s always been every host in the cluster.
Tuesday, March 18, 2008 at 8:49 am
sean c
hi
I know this is a little off topic, but I thought someone might have some inputs regarding a situation that recently accoured to one of our customers.
the have an ha enabled cluster with four hosts.
the use a separate stack of switches for the production and the management lan. each host has only one service console which is connected to the management lan. recently they had to reboot the management switch stack, which left them with 80 powered off vms, even though the production lan was never interrupted. according to what i read here it makes perfect sense, since the only heartbeat is through the service console. any thoughts on this? would a second service console on the production lan solve this problem, and would this be a good solution?
many thanks for your help!
sean
Tuesday, March 18, 2008 at 9:51 am
slowe
Sean,
Your comment is actually rather on-topic, since the behavior you are seeing is isolation response kicking in. The best answer to your situation is indeed a second service console; in addition, you’ll want to add the das.isolationaddress2 parameter in VirtualCenter (on the properties of the HA cluster) and reconfigure HA. This will provide the HA agent with a second IP address it can use to check for isolation.
Good luck! Let me know how it turns out.
Wednesday, March 19, 2008 at 7:11 am
sean c
hey scott!
thanks for your reply.
the idea behind the hole thing was actually not to have the service console on the production lan. I so far considered this to be best practice.. but I guess this is the downside to it. would you put a service console in the production lan?
tell me if I’m wrong, but if I add a second service console wouldn’t it check for other hosts or the default gateway of the second console without having to add a second isolation address? the way I understand it is that the second address would be to check for another address (a switch for example) in addition to checking the gateways. it should check the gateways of each sc with the default setting in my opinion..
Wednesday, March 19, 2008 at 11:29 am
slowe
Sean C,
I understand your concerns about putting another Service Console interface on the production LAN, but there isn’t a whole lot you can do about it. You need redundancy for the VMware HA agent, or you’ll run into the same situation again next time.
My understanding is that the VMware HA agent won’t automatically look for more than one isolation address, although it should look for the default isolation address across all available interfaces. So, you’ll need to add the second isolation address only if you want the HA agent to try more than one device when attempting to determine if it has been isolated.
Wednesday, April 2, 2008 at 5:19 am
Cyrus
Hi all,
I got a problem regarding the isolation in ESX3.5. I have setup 3 ESX hosts in one cluster w/ 1 VC Server. I found that those 3 hosts will be isolated unregularly. How can I solve it? or How can I troubleshoot it? Do I need to add the secord isoloation address in VC?
Many Thanks,
Cyrus
Wednesday, April 2, 2008 at 9:00 am
slowe
Cyrus,
I would ensure that your Service Console has redundant network connections (multiple NICs bound to the vSwitch and/or a second vswif interfaces on a different vSwitch with different NICs as uplinks). In addition, be sure that your default isolation address (the default gateway of the Service Console) responds to ping.
As for adding the second isolation address in VC, you can right-click the HA cluster, select Edit Settings, select VMware HA, then click Advanced Options. From there you can add the das.isolationaddress2 value with the IP address of a second, different device that the HA agent should use to determine if it has become isolated. If I am not mistaken, you’ll need to reconfigure the cluster for HA after making that change.
Good luck!
Wednesday, April 2, 2008 at 9:18 pm
Cyrus
Hi Slowe,
I have configured to use multiple NICs bound to the vSwitch (2 x Active NICs). And the default gateway also responds to ping.
I will try to add the secord isolation address and see if they will become stable or not. Many thanks for your recommendation.
Friday, May 23, 2008 at 3:24 am
Cyrus
Hi Slowe,
I got another issue regarding VMware HA. Dont know why my 3 hosts in the cluster always prompt a event “HA agent on host1 in cluster has an error” every hour. But they will resume normal within 1 minute and those hosts wont be isolated. Dont have any other side effect. Just a error event. I have set das.failuredetectiontime to 60000 and added das.isolationaddress2. Do you have any idea on it?
Many thanks,
Cyrus
Saturday, May 24, 2008 at 2:33 pm
slowe
Cyrus, anything being reported in the logs anywhere?
Monday, May 26, 2008 at 3:25 am
Cyrus
Hi Slowe,
Where can I check the detail log for HA?
Thanks.
Cyrus
Monday, June 23, 2008 at 10:19 am
Bernie
I am running VMWare i3 and get the “Cannot reach isolation address” when I try and start a host in a cluster. The problem seems to be that I have 2 nics on the user LAN for redundancy (10.10.10.0/24 -> default gateway 10.10.10.254, which is the isolation address). This works fine, until I add a VMotion network on another subnet (192.168.1.0/24) and try and start the host. So I guess it’s trying to reach 10.10.10.254 from 192.168.1.1 and failing . Any ideas?
Monday, June 23, 2008 at 10:57 am
slowe
Bernie,
Can you share your vSwitch configuration–which NICs uplinked to which vSwitches, and how the port groups are configured? That would help us troubleshoot.
Monday, June 23, 2008 at 11:24 am
Bernie
Thanks for your assistance. Hope this is clear…
First Server:
vswitch0
vmnetwork –vmnic2–>LAN switch
-vmnic0–>LAN switch
VMKERNEL Port
Management Network
10.10.10.1
vswitch1
VMkernel port
vmotion –vmnic4 –|
192.168.1.1 | Hard wired
|
Second server: |
Same as, except .2 |
10.10.10.2 |
vmotion –vmnic4 –|
192.168.1.2
Monday, June 23, 2008 at 11:43 am
Bernie
I think I have the solution. On the vmotion switch NIC teaming configuration, there is a fallback drop down, that needs to be set to “no”. It defaults to yes.
Thanks for your help.
Monday, June 23, 2008 at 2:46 pm
Bernie
Sorry - spoke too soon. That does seem to get round the isolation issue, but then I get hit with an inconsistent IP address, as per http://communities.vmware.com/thread/118186. I must be doing something very wring, but I don’t see what. I get an error saying “Configuration of host IP address is inconsistent on host xxx: address resolved to 192.168.1.1.”. There is nothing in the hosts files, except the loopback entry and I am using DNS, which resolve correctly. If I remove the VMotion network everything is OK, except, obviously, that I don’t have a VMotion network. Any thoughts?
Tuesday, June 24, 2008 at 9:17 am
slowe
Bernie,
Do you have two VMkernel ports? Looking back at your comments, I see mention of a VMkernel port on vSwitch0 as well as on vSwitch1. Unless you are using IP-based storage on the 10.10.10.x network, you don’t need a VMkernel port on vSwitch0, just a Service Console port group. That would leave you with a single VMkernel port on vSwitch1. Or am I misunderstanding your configuration?
Feel free to move this to e-mail if you want. You can get my e-mail address from the About page on this site.
Thursday, June 26, 2008 at 12:35 pm
Russ
Not quite getting this part..
Once I create a second SC connection, on a different vswitch on a different subnet I have to: 1) add the das.isolationaddress in Advanced options, and make that the SC gateway. 20 Then add das.isolationaddress2 and - here’s what I don’t get - add what address? The gateway address for that subnet, or the actual address of the second console? Or something else entirely?
Thursday, June 26, 2008 at 1:49 pm
slowe
Russ,
You don’t need to set das.isolationaddress (which defaults to the SC’s default gateway) unless that address does not respond to ping, or unless you’d like to use some other device on the network.
For das.isolationaddress2, set it to be any device that responds to ping, is available via that network adapter, and could be used as a way for the host to determine if it is isolated.
Hope this helps!
Friday, July 11, 2008 at 6:18 am
Erik Bussink
Create a small script on my ESX server to list the HA settings. Might be usefull to some
The file is named ha_list.sh
#!/bin/sh
# Created by Erik Bussink (July 2008)
#
# requires the FT_DIR variable
# add the following line to you’re /root/.bashrc
# export FT_DIR=/opt/vmware/aam
#
# List current HA members, their Roles and their Status
/opt/vmware/aam/bin/ftcli -domain vmware -connect localhost -port 8042 -timeout 60 -cmd listnodes
echo
# List current HA members settings
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “la -l”
Friday, July 11, 2008 at 6:20 am
Erik Bussink
And a 2nd one that will continiously monitor the status
ha_watch.sh
#!/bin/sh
# Created by Erik Bussink (July 2008)
#
# requires the FT_DIR variable
# add the following line to you’re /root/.bashrc
# export FT_DIR=/opt/vmware/aam
#
watch ‘/opt/vmware/aam/bin/ftcli -domain vmware -connect localhost -port 8042 -timeout 60 -cmd listnodes’
Friday, July 11, 2008 at 9:58 am
slowe
Erik,
Good stuff–thanks for sharing it here!
Friday, July 25, 2008 at 10:11 am
Erik Bussink
Hiya,
I also create a larger script to help get some additional information from HA.
ha_status.sh
!/bin/sh
# Erik Bussink (July 2008)
#
# requires the FT_DIR variable
# add the following line to you’re /root/.bashrc
# export FT_DIR=/opt/vmware/aam
#
# Monitor AAM Health (Command taken from VMworld 2006 - Effective DRA and HA in production by Nitin Suri)
/usr/bin/perl /opt/vmware/aam/ha/aam_config_util.pl -z -cmd=listnodes -domain=vmware
# List current HA members status
echo
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “la -l”
echo
# List status of AAM Agent on each node
echo
echo -[ esx11 ]———————————————————————
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “status esx11″
echo -[ esx12 ]———————————————————————
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “status esx12″
echo -[ esx13 ]———————————————————————
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “status esx13″
echo -[ esx14 ]———————————————————————
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “status esx14″
echo -[ esx15 ]———————————————————————
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “status esx15″
echo -[ esx16 ]———————————————————————
/opt/vmware/aam/bin/ftcli -domain vmware -cmd “status esx16″
Sure I have not really optimized the bash shell lines near the end, but it’s easy and it can be usefull.
Erik
Wednesday, August 13, 2008 at 6:28 am
TomK
Great post, Scott - exactly the same behaviour I’m seeing (except the vmkernel log, I will need to test that to see if I’m seeing the same there).
I was wondering if this has been logged as a bug with VMware? Clearly if the working hosts are trying to power on the VM before it has been powered off, the software has almost created a split-brain scenario without any mis-configuration - surely a bug?
I’m going to do some testing with ESX 3.5 Update 2 (assuming the timebomb fix is stable) - will post results.
Cheers,
TK