Random Reboots with VMware ESX 3.5 Update 3

I’ve been communicating with a reader who is experiencing random reboots of virtual machines on his HA/DRS-enabled cluster running VMware ESX 3.5 Update 3. At first, I thought his problem was related to the bug with VM failure monitoring that I discussed here, but upon further discussion the random reboots are continuing to occur even when VM failure monitoring is disabled. The only relief the reader has been able to find thus far has been to completely disable VMware HA on his cluster, which—to be honest—is a less than acceptable solution.

After a little bit of digging around, I turned up this VMware Communities thread, in which several other users also indicate they are seeing the same kinds of problems. The thread closes out by referencing this post by Duncan Epping regarding the VM failure monitoring bug. Clearly, though, this bug should not be affecting users who do not have VM failure monitoring enabled. I also found this blog post about another user having the issue, although it sounds like his problem was solved by disabling VM failure monitoring.

Further research turned up this KB article on a post-Update 3 patch that may address some of the random reboot issues. Judging from the KB article, it looks like the random reboots may be caused due to an unexpected interaction between VMotion and an option to automatically upgrade VMware Tools. This is just speculation, of course, but the symptoms seem to fit.

Have any other users out there experienced this problem? If so, what was the fix, if any? It sounds like there may be more to this issue than perhaps I first suspected.

Tags: , , , , ,

  1. Rick Scherer’s avatar

    The VMotion/VMware Tools upgrade theory might be a possibility. Remember, when doing a VMotion the Virtual Machine is actually started on a new host, if VMware Tools is configured to automatically upgrade during boot it is quite possible a “hole” left open in U3 could turn this into an unwanted feature.

    VM Running -> Automatically Upgrade VMware Tools Enabled -> VMotion to New Host -> VMware Tools Upgrade Initiated -> Guest O/S Reboots

    This also makes me think about how in U3 of the Free Edition of ESXi the programmers left the Read/Write RCLI commands in there….. Whats going on over at the VMware QC department ?

    This is all speculation – I would be curious in knowing if it is the same VMs rebooting, or do they reboot once then never again. This would be more icing on the cake for the VMware Tools theory if it were the latter.

  2. Roger L’s avatar

    If I see this, i’ll let you know.

    Thanks for the heads up.

    Roger L

  3. Iben Rodriguez’s avatar

    Hi Scott,

    We have a client with this issue too. After updating to U3 some of the VMs were rebooting. We disabled the VM failure monitoring feature and so far so good. Many of our VMs (over 80) need to have their VMware Tools updated since the U3 update of the hosts so that may be contributing to the issue. We are monitoring the situation and do not have a root cause determined at this time.

    Good to know about the patch – we will investigate.

    Thanks!

    I b e n

  4. slowe’s avatar

    NiTRo, as I mentioned in the post, the reader has already disabled VM failure monitoring, so that can’t be the fix for him. There is definitely something else going on in this situation.

  5. Rick Vanover’s avatar

    NiTRo: That isn’t much of a fix by the way.

  6. Mike Karolow’s avatar

    We’ve seen something similar on ESXi 3.5U3. For us, the guests stay up, but the host becomes completely unmanageable. This is apparently caused by the AAM service itself.

    VMware has told us it’s an known bug, that it’s been fixed, but they don’t know when it will be released. Right now, they are targeting U4. Needless to say, we’re a bit pissed that they don’t want to release it as a standalone patch, since we’re forced to disable HA.

    Here’s the note from tech support:

    …For your information, we have a new discovered an issue regarding HA on ESXi 3.5U3 and VC 2.5 U3, which causes the ESXi host to be unresponsive due to a large amount of error logs generated by AAM…

    …If you want to confirm that this is the issue that you are facing, next time that you experience the issue, log into the Tech Support mode (or call us right way) and run the following command:

    #stat -f /
    File: “/”
    ID: 0 Namelen: 127 Type: visorfs
    Block size: 4096
    Blocks: Total: 46792 Free: 15838 Available: 15838
    Inodes: Total: 2048 Free: 950

    If the line that indicates the number of free Inodes is reporting 0 free inodes, you have been affected by this issue. The issue does not affect all users and has been fixed, but the patch has not been released yet…

  7. kyle’s avatar

    Why is it that VMWARE has so many issues with HA and DRS. They sure do want to charge for it and ever since we got on board since v3.0 we have seen many many ha/drs issues.

  8. slowe’s avatar

    Kyle, I’ll admit that VMware has had (in my opinion) too many problems with VMware HA, although I haven’t seen many problems with VMware DRS (they are different features, after all, even though they are often used together). Keep in mind, though, that VMware HA was first introduced in VMware ESX 3.0, and I believe that many of the “problems” that were seen with VMware HA were really more a by-product of VMware failing to properly document the technology more than a defect of the technology itself (e.g., failing to educate users about the need for secondary Service Console connections, that sort of thing).

  9. Dipak’s avatar

    I am seeing this on ESXi 5.0 build 1254542/1311175, i have the “Check and Upgrade tools during power cycling”. so i have 3 ESXi hosts on build 125xxx and other 4 on 131xxx, when the VM moved from lower version to higher version, it’s rebooting the VM within 3 to 4 minutes of being on the new host, any fix to this apart from un-checking the VM option to upgrade the tools?

Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>