We ran into an interesting set of problems at work this week, all of them related to running virtual machines on VMware ESX over NetApp NFS. While we haven’t yet determined the root cause for all of the problem, we did uncover the root cause for one of the issues, and additional testing seems to indicate that one of the other problems may also be suffering from the same condition.
The problem seems to lie within some (now outdated) recommendations by NetApp for using NFS with VMware ESX. As late as June of this year, NetApp was recommending—via TR-3428, “NetApp and VMware Virtual Infrastructure 3 Storage Best Practices”—that users disable NFS locking on the storage system by setting the NFS.Lock.Disable to 1. Version 4.0 of TR-3428 was the version published in June of this year, and it still contained this recommendation.
As I understand it, this recommendation stemmed from a bug that existed in VMware ESX that caused long delays when removing VMware snapshots. As a workaround, NetApp recommended setting the NFS.Lock.Disable setting. Supposedly, VMware has fixed that bug with this patch, and the current version of TR-3428 (version 4.2, published in September 2008 and available from NetApp’s web site) no longer contains the recommendation to set NFS.Lock.Disable.
Unfortunately, customers who followed (then current) best practices may have set this value, and in so doing may have exposed themselves to a “split brain” scenario in which a VM is simultaneously registered and running on multiple VMware ESX hosts. Users may be particularly at risk if they also run VMware HA and have not protected themselves against isolation response (these articles are tagged ‘VMwareHA’ and will provide more information on isolation response). We’ve had one large customer be affected by this problem, and another large customer who is having problems that we believe are related to this same issue.
There are a couple of important things to note:
- This is not a NetApp problem, but rather the result of a recommendation from NetApp. At that time, it was believed that this setting was necessary.
- In most cases VMs affected by this split-brain scenario will get corrupted and must be rebuilt or restored from backup.
We’re not the only ones who have seen this behavior, either:
VMWare and NFS on NetApp Filers
NFS Datastores and what was their BIG issue…
The key takeaway from this is the following:
- If you haven’t yet applied ESX350-200808401-BG, you should apply it. Don’t wait. Now.
- If you did follow NetApp’s recommendations and set NFS.Lock.Disable, then review the instructions below (credit to Rick Scherer and several others, thank you!) to remedy the situation.
The instructions you should follow to restore NFS locking are here (these instructions include applying the patch):
- Download patch ESX350-200808401-BG
- Identify an ESX server against which you will apply the patch
- Use VMotion and migrate the running VMs to other VMware ESX nodes
- Install this patch on the selected VMware ESX server
- In the Advanced Configuration settings ensure that NFS file locking is enabled with NFS.Lock.Disable=0 (you can also use the esxcfg-advcfg command to set this value)
- Edit the /etc/vmware/config file and add this line:
prefvmx.ConsolidateDeleteNFSLocks = "TRUE" - Save the changes to the file and reboot the VMware ESX host
- Use VMotion to move VMs back to patched/NFS lock enabled/prefvmx enabled host
- Repeat the steps above on each host in the cluster
So, again, if you haven’t applied ESX350-200808401-BG, please do so as quickly as possible, and then ensure that NFS locking is enabled. If you’ve previously disabled NFS locking, then follow the steps above to restore NFS locking and protect yourself against this possible split-brain scenario.
-
Thank you for the mention Scott, definitely did not expect that. One thing I wanted to discuss is how VMware HA functions, based from my point of view.
From what I can see when you configure VMware HA the first (4) nodes configured are marked as primary, every host after the fourth is considered a backup node. In the event of a HA fail-over the primary nodes will all attempt to start the VMs that were running on the failed node. It appears they rely on the VM locking to determine if the VM is actually down or not. So what this means is regardless of Isolation Response the VM can actually be powered on multiple times. In fact, in the couple times this has happened to me I had the same running VM on up to (3) hosts at once.
You can also see some strange behaviors in VirtualCenter, such as the number of Virtual Machines registered in each host will jump up and down within seconds. I would look at the summary of one of my hosts and see the Virtual Machine count go from 20 to 35 to 28 to 40 and so on.
The only true way to clear this up is from the service console do a `vmware-cmd -l` to see the registered VMs on each of your hosts in that cluster, you can write the output of this to a file for each server (ie: vmware-cmd -l > server1). Then once you have all the files do a sort on them (ie: sort server1 server2 server3 server4 > newfile). This new file will have all of your registered VMs in a sorted order, so you can see which VMX files have multiples. Then you can go to each host and do a `ps aux | grep VMX-FILE` to find the PID then use kill -9 PID to kill it.
I’ll work on writing a how-to page on this sometime this weekend, but I thought you should know.
Thanks again & Good Luck on your customers sites. Hopefully they had NetApp snapshots!
-
Hi Scott,
We have experienced this issue with our customers, and are currently going through contacting all that use Netapp NFS/VMware to make them aware of the update. This is an even bigger problem since ESX 3.5 update 2, as the default HA response is to leave VM’s powered on. We have also had a customer that has experienced an issue due to this combination of factors. This is critical for Netapp/VMware customers.
-
Patch Validation. How can you tell if you have a patch/ update installed… beside proper change control! I know I can use update manager and do a baseline for an update/patch, but is there a faster way?
-
Hello.
Luckily we didn’t remove the locks on our installation and we have installed the patch as soon as it was released. Actualy I requested the patch from Vmware BEFORE it was release and their support was confused because they didn’t find it.But whats not clear to me from you text is if we should apply the “prefvmx.ConsolidateDeleteNFSLocks” setting or not in my scenario.
Would the change futher help us when working with snapshots?As it is today, I do notice that sometimes the deletion of snapshots tend to timeout, but the snapshot is gone after a while.
-
prefvmx.ConsolidateDeleteNFSLocks, makes the deletion of a snapshot a 2 ping loss rather thatn a ten minute ping loss
-
Just got bit by this and still recovering. We run ESXi 3.5 and I’m having a hard time finding the ESXi equivalent of “ESX350-200808401-BG”. Anyone know what it is or if it exists?
-
I have several 3i hosts as well, but they’re all Update 3. Anybody know if ESX350-200808401-BG is included in update 3? We are running into exactly this problem, where when using SMVI on an NFS datastore the snapshots timeout and the VM becomes unavailable for a longer time.
-
Ok I’ve verified, I believe once you have Update 3 the patch is no longer needed.




10 comments
Comments feed for this article
Trackback link: http://blog.scottlowe.org/2008/10/18/important-note-regarding-vmware-over-nfs/trackback/