Storage Array Snapshots with VMware

A new article of mine has been published on SearchVMware.com! This article discusses the use of storage array snapshots with VMware, specifically focusing on ensuring that storage array snapshots are consistent and usable:

When used in conjunction with a VMware infrastructure, storage array-based snapshots are touted for their ability to create point-in-time pictures of virtual machines (VMs) for business continuity, disaster recovery and backups. While this can be true, it’s important to understand how virtualization affects storage array snapshot use. Incorrect usage can render storage array snapshots unreliable and generally defunct.

The article provides a few guidelines on making sure that storage array snapshots are usable. Keep in mind, too, that some storage array vendors have applications that are specifically designed to help with this particular issue. NetApp, for example, has SnapManager for Virtual Infrastructure; this product is specifically designed to address this problem (among other problems). I would imagine that other vendors also offer a software solution to this problem, but I’m not particularly familiar with those. I’d love to hear from readers as to their experience or knowledge with any such software solutions.

Tags: , , ,

  1. Wade Holmes’s avatar

    This post is very timely for me, as I have just had discussions on this topic with both VMware and NetApp. What is the take on the use of VMware’s method to quiesce the filesystem. I have seen some issues with the filesystem flush and database backed apps such as AD, exchange, SQL, etc.

  2. KP’s avatar

    I agree that cold or warm snapshots probably give the best guarantee on consistent snapshots.

    As to the difference between hot snapshots and vmware+array snapshots, in practice we’ve seen little difference on recoverability. If anything, virtual machines that have associated VM snapshots give extra problems if you flexclone them and import the cloned VM.

    We perform a hot (Netapp) snapshot on all our datastores every hour. We used to “VM snapshot” the VMs before taking the NetApp snapshot, but this impacted performance, and even from time to time locked up a VM. Now we only take Netapp snapshots.

  3. slowe’s avatar

    Wade,

    First, see KP’s very informative comment; in his experience, adding VMware snapshots to the mix doesn’t seem to help much. I will agree that not all workloads are good candidates for the use of VMware snapshots; this also precludes the use of most hot-backup technologies including VCB.

    Second, I’d recommend application-aware utilities such as those offered by NetApp (SnapManager for Exchange, SnapManager for SQL, etc.); they can provide the guaranteed consistency and data integrity that you need for these types of applications, and their use is–as I understand it–fully supported within virtual machines.

    Thanks for reading!

    KP,

    Thanks for your very helpful comment. Your experience in using VMware snapshots along with NetApp snapshots is very relevant. Thanks for reading!

  4. Gary Kuyat’s avatar

    Other than vendor recommendations, does anyone have any direct evidence that VMFS3 filesystems snapshotted are any worse than a power failure?

    VMFS2 had significant issues with crash consistancy, but in my experience NTFS on VMFS3 survives fairly well when you pull the plug (or recover the snapshot).

    For super-critical financial applications, I am not recommending this, but three snapshots a day in this style has very rarely failed.

  5. Customer X Factor’s avatar

    Does anybody know if their is a “SnapManager for Virtual Infrastructure” from Dell Equallogic coming out?

  6. slowe’s avatar

    I haven’t heard anything along those lines.

  7. Rich’s avatar

    Scott,

    Great post(s)!

    I added my “2 cents” on my blog. It ended up being too much to post in a comment here.

    http://vmetc.com/2008/06/07/avoid-hot-vmware-snapshots-when-using-storage-array-snapshots/#more-414

    -Rich

  8. Nick Triantos’s avatar

    The combination of VMsnaps+NetApp snaps has been heavily debated within NetApp the past few months.

    Some of us believe we should provide, within SMVI, the option to disable VMsnaps while at the same time and during initiating a NetApp snap have a Warning window pop up,cautioning the user as to the potential ramifications of such practice. Bottom line, let the user decide the value of their data and make the appropriate decision.

    Others believe that even if there’s a 1% chance for FS corruption, then we ought to do whatever’s possible to avoid putting customers in this situation which means utilize VMsnaps as part of the process.

    For those who are not familiar with the process SMVI follows here’s how it works:

    1) Initiate NetApp snapshot request
    2) SMVI call to VC over port 443 – https. VC call to lgtosync.sys driver inside the Guest. Quiesce, flush fs buffers
    2) VMsnapshot
    3) NetApp Snapshot
    4) After NetApp snapshot completion, remove VMsnapshot

    The whole process takes somewhere between 5-6 secs. Because it’s so quick, there’s barely any I/Os accumulated in the Redo Log so deleting the VMsnap is a rather quick process.

    Mind you that there’s SRM integration with SMVI. The NetApp SRA (Site Recovery adapter) will leverage SMVI snapshots and create FlexClones from SMVI snapshots for site recovery testing while replication is still intact and ongoing. Furthermore, the above can occur with FC or iSCSI or a combination of both (i.e FC on Production site and iSCSI on recovery).

    Cheers

  9. Chad Sakac’s avatar

    Great post as always Scott. EMC offers a tool that does this (along with application-integrated replication for Exchange 2003/2007, SQL Server 2005/2008, Oracle 10g/11g).

    It’s called Replication Manager. It’s supported VMware for a long time (for customers running apps in VMs), but recently added VMFS integration (for filesytem-flushed datastore level point in time copies and instant restore for datastores or VMs)

    For NetApp folks – it’s equivalent to the Snap Manager family (SME, SMS, SM for Sharepoint, SMO), but is a single tool.

    Dell/EqualLogic will create a similar tool at some point, as they recently started creating tools that integrate with Exchange (VSS) and SQL Server (VSS/VDI). Only NetApp and EMC have been doing this for a long time.

    I’ll be doing a post on this on my blog in a bit with screenshots and a demo, as it seems to be a hot topic.

  10. slowe’s avatar

    Chad,

    Thanks for the feedback and the information on EMC’s product offerings in this area. I appreciate it!

  11. Jeffrey Wall’s avatar

    Chad,
    I am being told by a few outside SEs that Replication Manager really works best on RDMs and not VMFS data stores. Currently, all my storage is in VMFS format and I don’t relish the idea of converting it over to RDM. Moreover, I cannot find much evidence to back that statement up. Since the entire I/O debate between RDMs and VMFS seems to have finally died can you help me understand why they would assume RM works so much better with RDM volumes? Are we back to application consistency versus crash consistency? Thank you. ~ Jeffrey

  12. Chad Sakac’s avatar

    Jeffrey – sorry for the long delay, I don’t get a ping when a comment is posted.

    What you’ve heard is a bit of old news (takes time for information to get out there) Replication manager works equally with VMFS, RDMs (and very shortly) NFS datastores.

    the debate on RDMs is over – they should be used very sparingly (MSCS/WSFC use cases, Oracle databases where you want to move from V to P easily to repro things, when you can’t ensure that the VMware admin won’t oversubscribe the VMFS volume on a mission-critical workload, and a few other cases – but these are rare, not the primary design point)

    With “datastore” use cases – it operates very much like SMVI (and the now shipping Dell/Equallogic tool) – it uses the VI SDK to figure out what VMs are in what datastores, and the array APIs to correlate the datastore to the storage object. It then uses ESX snapshots to get a guaranteed point in time restartable image, then takes an array snapshot, then deletes all the ESX-layer snapshots.

    Not sure if this can be done in the other tools, but in EMC Replication Manager, you can also select NOT to take the ESX-level snapshots.

    This is important – as you get loads of VMs in a datastore, the process of doing the ESX snapshots (which are serialized per datastore) can get long. In those cases, it’s the equivalent of a “crash consistent” copy – as someone pointed out – very reliable – **but not 100% reliable**.

    Lastly – it can do App-level integration for apps in VMs, on VMFS datastores, but there are some important caveats, which are covered in the docs (if you have multiple SQL server databases in one datastore – you can only do an app-integrated restore for one of them, for example).

    Thanks!