Recovering Data Inside VMs Using NetApp Snapshots

Network Appliance Snapshots—point-in-time copies of a file system that can be created almost instantaneously and which generally require much smaller amounts of storage to keep—are an integral part of NetApp’s value over other storage systems.  These snapshots make it far easier and quicker to recover from data loss or corruption than a tape backup system.

But how do we go about recovering individual files from a snapshot when those files are stored in a virtual disk (VMDK) file used by a VM?  After all, VMware proponents tout the encapsulation property of virtualization as a benefit: “One file to back up and you get a backup of your entire server!”  Fortunately, there’s a way to continue to reap the benefits of encapsulation while still allowing for the ability to recover individual files from a snapshot of the VM’s virtual disk file.  Here’s how.

The trick here is to take advantage of LUN cloning, a feature on the NetApp storage systems that allows you to take a snapshot—which is a read-only point-in-time copy of the file system—and create a clone, which is a read-write point-in-time copy of the file system.  This clone takes only seconds to create, like the snapshot on which it is based, and requires only enough storage to store the changed blocks, i.e., the “deltas” between the clone and the original.  We can then present that clone back to VMware ESX Server to manipulate in whatever way we see fit.

There are three parts to this process.  First, we configure ESX Server to recognize snapshot LUNs on the SAN (this is a one-time configuration change).  Then, we take the snapshot on the NetApp storage system, create a LUN clone from the snapshot, and present that LUN clone back to the ESX servers.  Finally, we manipulate the LUN clone within ESX in order to retrieve the specific data we need.

Enable Resignaturing on ESX Server

In the ESX SAN Configuration Guide (found here on VMware’s site), there is this blurb about resignaturing:

VMFS volume resignaturing allows you to make a hardware snapshot of a VMFS volume and access that snapshot from an ESX Server system.

This is the functionality that allows us to use LUN clones on the NetApp storage system in ESX Server.  Without this functionality, the LUN clones aren’t properly recognized by ESX Server and can’t be utilized to allow us to perform data recovery.

To enable VMFS volume resignaturing, set the LVM.EnableResignature option to 1 (on).  This option can be set in VirtualCenter using these steps:

  1. Set the ESX Server host for which you want to enable VMFS volume resignaturing.
  2. Go to the Configuration tab for that host, then select Advanced Settings.
  3. Change the LVM.EnableResignature to 1 (on).  The default is off.

After this option is set, you’ll be able to present LUN clones (or other hardware snapshots) to ESX Server and it will recognize them as such.

Now we’re ready to move to the NetApp storage system.

Taking a Snapshot and Making a LUN Clone

By default, snapshots are already enabled and scheduled, so unless you’ve modified the configuration, the NetApp storage system is already taking snapshots of the volumes that hold the LUNs where the VMware VMFS partitions (and thus the VMDK virtual disk files) are stored.

We can view the list of snapshots like this:

filer> snap list vol_name
Volume vol_name
working...

  %/used       %/total  date          name
----------  ----------  ------------  --------
  0% ( 0%)    0% ( 0%)  Dec 30 08:00  hourly.0
  1% ( 0%)    0% ( 0%)  Dec 30 00:00  nightly.0
  1% ( 0%)    0% ( 0%)  Dec 29 20:00  hourly.1
  1% ( 0%)    0% ( 0%)  Dec 29 16:00  hourly.2
  1% ( 0%)    0% ( 0%)  Dec 29 12:00  hourly.3
  2% ( 1%)    0% ( 0%)  Dec 29 08:00  hourly.4
  2% ( 0%)    0% ( 0%)  Dec 29 00:00  nightly.1
  3% ( 0%)    0% ( 0%)  Dec 28 20:00  hourly.5

Now, we can make a LUN clone from one of these snapshots and map it to an igroup (this would normally all be on a single line, but I’ve wrapped it here for readability):

filer> lun clone create /vol/vol_name/lun0_clone -b /vol/vol_name/lun0_vmfs nightly.1
filer> lun map /vol/vol_name/lun0_clone igroup_name 0

The LUN clone has now been created and presented back to the igroup named igroup_name as LUN ID 0.  A rescan of the storage adapters in ESX Server (iSCSI was being used in this case) will now show the LUN clone as “snap-00000001-lun0_vmfs” (the number will change depending upon how many snapshot LUNs have been presented to the server farm).  Now that we have access to the VMFS, we can do any number of things:

  • We can create a new VM with the same configuration as the original VM and boot it up to recover data from the VM in that manner (be cautious of networking issues, such as duplicate IP addresses).  You’ll just need to select the existing VMDK (or VMDKs, if there are more than one) on the snapshot VMFS LUN instead of creating a new virtual disk file when creating the VM.
  • We can attach the VMDK(s) to an existing VM running the same operating system (or an operating system that will read the file system used inside the VMDKs in question) and then browse the file system to retrieve data stored inside the VM.
  • We could shut down the original VM and boot up the clone VM instead.  This might be helpful if you needed to recover data but also needed network connectivity, or if the two VMs couldn’t be running at the same time.  (In theory, this might work for Microsoft Exchange, if you aren’t using SnapManager for Exchange.)

As you can see, this allows us to take full advantage of encapsulating the server in the VMDK file(s) but also allows us to retrieve individual files or groups of files from a snapshot of the VMDK file(s).

In future articles, I’ll touch on restoring entire VMs using NetApp snapshots, as well as talk about getting consistent snapshots of the VMs.

Other Information

This process was performed on a Network Appliance FAS810 running Data ONTAP 7.1.1.1 and servers running VMware ESX Server (both 3.0.0 and 3.0.1) with the software iSCSI initiator.  VMs running Windows Server 2003 R2 were used for testing.

Tags: , , , , , , ,

Hi Scott,

I am a little bit curious about snapshot. Did you use any scripts to take consistent snapshot for VMFS 3 ? I guess the snapshots that you used are regular scheduled snapshots and if it works for VMware 3, it is great.

Regards,

Cem

Cem,

I use a small script to help ensure a consistent snapshot. Until NetApp releases SnapManager for VMware, a script such as this is definitely beneficial. I plan to post another article with more information about the script and making sure your snapshots are consistent.

Scott

I’ve used this quite a few times with FC and iSCSI and haven’t had an issue. The warning I always say is that if its a heavily used SQL server, then you might run into problems, i.e. the Netapp snapshot occurs in the middle of a transaction. But in this example, you using this for an individual file restore, not an entire machine restore. Although snap restore works well like this as well if you want to revert the entire machine.

Chris,

Certainly, ensuring the integrity/consistency/validity of a snapshot is a critical task, and this technique isn’t the best to use in all instances. But as you point out, it comes in mighty handy at times!

When I present the cloned lun back to the igroup and then re-scan, I don’t see the lun show up as snap-00000001-lun0_vmfs, it just shows up as a vmhba device. Actually when trying to attach the lun it says the disk is blank…..

Hello Bob! Good to hear from you.

I would assume that you have already modified the ESX hosts to allow the cloned LUNs to be seen (enabled resignaturing). When you rescan, be sure to rescan for new storage devices as well as new VMFS datastores. If that doesn’t work, post back or e-mail me here (my e-mail is scott dot lowe at scottlowe dot org) or at work (you probably already have that address).

Good luck!

Scott,

Do you have any information if NetApp is working on or planning to release a ‘SnapManager for VMware’. That would be fantastic if they did. It would greatly simplifiy things and take the scripting out of the picture to ensure consistency.

Thanks!

Shane,

There are conflicting stories about NetApp’s work on “SnapManager for VMware”. Some people say that there is one under development, but others say that there isn’t. I honestly can’t tell you one way or another, but I agree that it would be extremely helpful and would definitely strengthen their value proposition as a storage vendor for VMware deployments.

Snap Manager for VMware is due for release Q1 2008 due to massive customer demand.

Hi
In the above example for making a LUN Clone, is FlexClone license required?

No license is required for LUN cloning. For FlexVol cloning, you need the FlexClone license.

Hope this helps slowe, ive created this note for myself and others to recover snaps should i not be available at my office. pretty simple

ESX Snapshot Recovery

Snapshot Recovery on local side using snapshots on local filer which run every 6 hours by script integrating Netapp snapshots with quiesced VMware snapshots (without memory)

1. Identify Snapshot to be recovered. In most cases this will be “vmware_recent”
2. Identify a target Volume on Netapp to accommodate the new LUN that will be created based on the snapshot. If necessary, expand volume to accommodate the new LUN.
3. Command Line on filer
a. Lun clone create /vol/ESX7/ESX7/esx7clone.lun –b /vol/ESX7/ESX7/esx7.lun vmware_recent
b. LUNs – Manage –
i. Verify that LUN was successfully created and exists.
ii. Go to the new LUN and click on “No Maps”
iii. Add the ESX hosts to the initiator group by selecting “Add Group to Map”
iv. Add all ESX host(s) in the cluster to the initiator group.

NOTE: If you do not add all hosts to the initiator group, than when you register the VM, you will experience a failure to register because all hosts cannot see the same LUN.

4. From VirtualCenter Client locate the ESX Recovery host
a. From the “Configuration” – “Storage Adapters” - under“iSCSI Software Adapter”, right click “vmhba40” and select Rescan to recognize the new cloned LUN presented. SS
b. Find the problematic (failed) Virtual Machine and right click and select “Remove from Inventory” SS
c. Browse the datastore of the new Snapshot Clone LUN which should be labeled (snap-00000002-Store7) and find the configuration file (vmname.vmx). Right click and select “Add to Inventory” SS
NOTE: Do NOT delete from Disk as a precaution if additional information is needed from the failed Virtual Machine
d. Select the Virtual Machine and from “Snapshot Manager” select “Delete All” to revert the the “Quiesce” Snapshot.
e. Power on the Virtual Machine
f. Answer the uuid question by selecting “Keep” Note: This step may be eliminated on the local side. Will definitely be eliminated on the DR side by adding a parameter to the configuration file. See Ionic Capital documentation.
g. Verify network and virtual machine functionality.
5. When all functionality is verified and during scheduled maintenance window) you can do a “cold migration” of the recovered VM back to the original LUN which in this case is “Store7” and power on the virtual machine.

Hi.

I am using an Iomega disk for virtual pc. There was an error (windows file corrupted) and i can’t lunch aplications on virtual pc. Does anyone knowing how can i get data from that disk?