NFS

You are currently browsing articles tagged NFS.

As sometimes happens, I’ve been collecting a variety of virtualization-related links that, while interesting, don’t necessarily warrant a full blog posting by themselves.  Since I don’t want to just copy someone else’s content and not provide any value of my own, I’ve decided to start irregularly publishing a “Virtualization Short Take”.  Each of these will just be a few links with a brief commentary or thought attached.  Perhaps these will lead to broader and more active discussions around some of these topics.

Without further delay, then, here’s the first-ever Virtualization Short Take:

  • Via Duncan, a new and undocumented feature in VCB’s config.js file that allows for non-quiesced snapshots of virtual machines.  I’d be interested in more discussion around why some workloads might be better served by non-quiesced snapshots, so if anyone has some insight please speak up.
  • Also via Duncan, fixes for crashes caused by installing the Converter plugin to VirtualCenter.  Two solutions are available; try this one, then try this approach.
  • Via Thomas, it looks like VMware has published a storage performance white paper comparing Fibre Channel, hardware iSCSI, software iSCSI, and NFS.  I just finished reviewing the document myself, and plan to go back and look at it again more thoroughly.  It’s useful information for virtualization architects or engineers responsible for designing storage solutions for VMware.
  • In case anyone’s interested in creating their own bootable ESX 3i USB drive, here’s more information.  This is something I need to take a closer look at myself, so when I have the chance to run through these instructions I’ll try to post some feedback on how well they worked.
  • Based on some interaction with a customer yesterday, it looks as if VirtualCenter 2.5 may require the Power User role to be directly assigned to the Hosts & Clusters object in order for users to be able to attach local (client-side) ISO files to a VM.  Has anyone else seen this behavior?  I’m going to try to recreate the problem myself, but if you’ve seen this please speak up in the comments.
  • I’ve also recently received word from a friend that the thin-provisioned disks VMware uses by default on an NFS datastore may not be honored when cloning VMs from a template.  Again, I plan to test this myself, but if anyone else out there has seen this behavior I’d love to hear about it.

As always, I’d love to hear your thoughts, so feel free to add your voice in the comments below.

Tags: , , , ,

Proving VMware Over NFS

I’m a fan of using NFS for VMware; I’ve mentioned it before. I’m not the only one, either; there have been a number of recent blog entries from various people regarding the use of NFS for VMware:

Virtual Optics: Why VMware over NetApp NFS

Storage Nuts & Bolts: VMware over NetApp NFS: A Customer’s Testimonial

It’s great that this is receiving more attention in the spotlight, but I still have one question: where are the statistics proving NFS’ value in VMware deployments?

If NFS is equally as good as Fibre Channel or iSCSI—and personally I agree with Nick that most deployments would be hard-pressed to tell the difference—then where are the stats that show this? Or is it impossible to demonstrate that a VMware deployment on NFS is “just as good” as one on Fibre Channel or iSCSI? Does the value of NFS come in subjective measurements that can’t be quantified, and perhaps that’s why we haven’t seen any hard proof of the value of NFS when compared to Fibre Channel or iSCSI?

If you have some insight, please share it in the comments. I’d love to hear everyone’s thoughts on the matter.

Tags: , , , , ,

Oddity with Windows NFS Server

I was reconfiguring NFS on the Windows Server 2003 R2-based file server providing both CIFS and NFS storage to the lab at the office when I ran into a strange issue.

I had started configuring some NFS exports, testing them to ensure that they worked as I expected.  Along the way, I decided I wanted to restructure the directory hierarchy I was using, so I deleted the folders and rebuilt them from scratch.  When I went to re-export one of the folders via NFS, a very non-descriptive and non-helpful dialog box popped up, indicating only that “the alias specified was already in use”.

I immediately suspected the NFS services, but no amount of restarting the services or the whole server would fix the issue.  The NFS server management console was not useful; it does not provide a list of current exports so that you can see where you may have already exported a path and used an alias.

After a few minutes of messing around, I recreated the original path.  Upon immediately recreating the bottom-level folder, it immediately became an NFS export.  What had happened was that I had created a folder, exported it, and then deleted the folder without removing the export.  The export information stayed in the Registry (location below) even when the file system location was not present.  In addition, the service did not log any information to the event logs indicating that a folder was missing or couldn’t be found.  The only error was the “specified alias is already in use” dialog box that appeared when attempting to export a different folder with the same name.

Upon digging around in the Registry, I found this location for NFS export data:

HKEY_LOCAL_MACHINE\Software\Microsoft\Server for NFS\CurrentVersion\Exports

Under this key is a zero-based list—where the first export starts as 0—that contains the NFS export information.  I deleted the key and now exporting the new folder with the original alias works as expected.

I suppose this behavior is expected; after all, the NFS server can only export a single path with any given name.  I just expected more information to be available somewhere along the way, be it a warning in the Event Log that states the exported path is no longer available, or a message in the dialog box that indicates what path is currently exported with the desired alias.  If the export was created months or even years ago, or created by a different administrator, it would be impossible to know where to look or where to go without digging into the Registry.  If Microsoft wants to continue to warn users about directly modifying the Registry, then they’d probably better provide some non-Registry tools to manage this stuff.

Tags: ,

I guess I’m on a bit of a NetApp kick this week.  After discussing (or perhaps revisiting) the idea of recovering files inside VMs using NetApp Snapshots (first here late last year, then again here), I wanted to take a closer look at full VM recovery using NetApp Snapshots.

First of all, it should go without saying that you should never use any of the procedures I’m describing here without first testing them yourself.  While they worked fine for me, they may not work fine for you.  Don’t just assume they will!  Do the due diligence and test it in your environment first; you’ll be glad you did.

Second, before using NetApp Snapshots to recover VM data (file-level or full VM), be sure you are getting good, consistent Snapshots.  The Network Appliance Technical Reports Library has a number of excellent articles on this subject; I’ll defer you there for more information.

I’ll break this article into two sections, one for block-level storage (I’m using iSCSI, but the process should be almost identical for Fibre Channel) and one for NAS/NFS.  Please note that I’m not focusing so much on the specific steps that are required as I am on general concepts and any gotchas that may arise during the process.

Full VM Recovery using Block Storage

To recover a full VM using block-level storage, a number of steps have to be taken:

  1. Create a LUN clone (or a FlexClone) of the original LUN based on a Snapshot. 
  2. Enable resignaturing on the ESX host(s) that will need to see the cloned LUN.
  3. Mount the cloned LUN(s) on the ESX host(s) and copy the appropriate VM files from the clone to the production LUN.

For the first two steps, I’ll refer you back to one of my first articles on VMware data recovery with Snapshots, which has more information on the necessary commands and settings.

For the third step, you’ll need to login to the Service Console (typically via SSH) and copy the desired VM(s)—and all their files—from the cloned datastore to the production datastore, overwriting whatever is in the destination (you typically wouldn’t need to recover a full VM unless the production VM was hosed, right?).  Once the file(s) have been copied back over to the production datastore, dismount the cloned datastore and destroy it.

You should now be able to boot up your VM at the state it was in at the time of the Snapshot used to recover it.  Unless the Snapshot was a cold Snapshot (taken while the VM was powered off), the VM will perform a file system check (chkdsk or fsck) when it boots up.

Full VM Recovery using NFS

The procedure for recovering full VMs when using NFS is even easier:

  1. Using an NFS client, mount the NFS export and navigate to the hidden “.snapshot” directory.
  2. In the “.snapshot” directory, find the Snapshot from which you wish to recover the VM.
  3. Copy that VM’s files (the entire folder) out of the “.snapshot” directory into the production filesystem, replacing the current contents (again, this assumes that what’s in the production filesystem is no good, else why would you be recovering a full VM?).
  4. Unmount the NFS export from your NFS client.

The recovered VM should now boot and be back to the point in time at which the Snapshot was taken.  Again, unless the Snapshot was a cold Snapshot, the VM will likely perform a file system check upon boot.  This is normal and not unexpected.

I suppose you could even do this second procedure from a CIFS client, assuming that CIFS and NFS were both configured on the storage system and an appropriate CIFS share existed.  (Please note that I’ve never tried this, so I can’t tell you what the results might be.)  In that case, use the “~snapshot” directory instead of “.snapshot”.

And that’s it—there you have two ways of recovering entire VMs using Network Appliance Snapshots.  As always, feel free to hit me up in the comments with any questions, thoughts, corrections, or rants (just keep the rants on-topic, please!).  Thanks for reading!

Tags: , , , , , , ,

Last year, I wrote an article about using NetApp Snapshots and LUN clones to enable the recovery on individual files within a VM.  This time around, I’d like to have a quick at that same process, but this time using NFS instead of block-level storage.

As I mentioned a couple of weeks ago, NFS is getting more and more attention as a key storage enabler for Virtual Infrastructure implementations.  I do still plan to conduct some tests of my own between iSCSI and NFS.  (Since they are both IP-based storage protocols, I figure that makes the playing field as level as possible.)  In any case, with regards to file-level recovery within VMs, NFS does possess at least one advantage.

Using any sort of clones (LUN clones or FlexClones) within VI3 currently requires resignaturing enabled, or else the ESX Servers don’t even see the clones.  While enabling resignaturing is not difficult (can be done via the command line or via VirtualCenter), it is not the default configuration and VMware appears not to recommend it (per the SAN Configuration Guide, pages 112 through 115).  With NFS, it’s only necessary to create a FlexClone and set up a new NFS mount; no other configuration is required.

By the same token, using NFS for file-level recovery within VMs also has one key disadvantage:  LUN clones are free, whereas the use of FlexClone requires a license.

With these advantages and disadvantages in mind, let’s have a look at the what the process would look like to recover files inside VMs using NFS for VM storage with NetApp Snapshots.

First, we’d review the list of available Snapshots using the snap list command, as shown below:

filer> snap list nfs_volume1
Volume nfs_volume1
working…
 
%/used %/total date name
———- ———- ———— ——–
0% ( 0%) 0% ( 0%) Oct 08 12:00 hourly.0
0% ( 0%) 0% ( 0%) Oct 08 08:00 hourly.1
0% ( 0%) 0% ( 0%) Oct 08 00:00 nightly.0
0% ( 0%) 0% ( 0%) Oct 07 20:00 hourly.2
0% ( 0%) 0% ( 0%) Oct 07 16:00 hourly.3
0% ( 0%) 0% ( 0%) Oct 07 12:00 hourly.4
0% ( 0%) 0% ( 0%) Oct 07 08:00 hourly.5
0% ( 0%) 0% ( 0%) Oct 07 00:00 nightly.1

Once we identify the Snapshot that contains the data we need to recover (based on the date/time of the Snapshot), we create a FlexClone using that Snapshot as its backing:

vol clone create nfs_volume1_clone -s file -b nfs_volume1 nightly.0

This creates a FlexClone named “nfs_volume1_clone” based on the nightly.0 Snapshot of the volume nfs_volume1.  If you immediately run the exportfs command, you’ll see that the new clone is already shared via NFS, too.

From here, the process is pretty straightforward:

  1. Create a new NFS datastore within VirtualCenter, using the new NFS mount as the destination.  This makes the data inside the FlexClone visible to the existing VMs.
  2. Add one of the VMDKs on the cloned NFS datastore to an existing VM as an additional hard drive.  You should be able to do this on the fly without shutting down the VM.
  3. Extract the files you need and place them back where you want them.

When you’re done recovering files, the clean-up process looks like this:

  1. Remove the VMDK(s) from the VM to which it/they was/were added.
  2. Remove the NFS datastore from VirtualCenter.
  3. Destroy the FlexClone using the vol offline and vol destroy commands.

Overall, this process is rather similar to the technique described using LUN clones, although a bit simpler because resignaturing is not required.

Tags: , , , , , ,

Various Odds and Ends

I was going through my list of flagged headlines in NetNewsWire and realized that I’d built up quite a list of articles that I intended to write something about.  Some of them just don’t merit a full-blown post, though, so I thought I’d just toss a bunch of them in here along with a brief sentence or two about them:

  • VMTN Discussion Forums: vdiskmanager GUI for OSX:  An enterprising Fusion user has written an OS X GUI for vdiskmanager, so that VMDKs on Fusion can be expanded or defragmented, or new virtual disks can be created.  I haven’t tried it yet, but it looks like it could be extremely useful, and it’s nice to see Fusion users creating useful utilities like this.
  • Running ESX 3i Beta in a VM with VMware Fusion:  Still thinking Fusion, this article discusses how a user managed to get ESX Server 3i (the beta version obtained at VMworld 2007) running as a VM under Fusion.  There’s also information on running it under Workstation 6 as well.
  • Tech: How to get the command line in ESX Server 3i beta:  Turns out ESX Server 3i has a command line after all, based on BusyBox.  Richard Garsthagen has more information about ESX 3i available at run-virtual.com.  Also see Eric Sloof’s info on boot options.
  • Storm Worm Botnet Attacks Anti-Spam Firms:  Is this botnet really as massive as everyone says?  I’ve been seeing so many articles about the Storm botnet, but I have yet to see (perhaps I haven’t looked hard enough yet) definitive information that describes the type of traffic these bots generate.  Surely there’s got to be something we can do about this.
  • Microsoft Updates Windows Without User Permission, Apologizes:  Oh, goodness—where do I start with this one?  Let’s just say that I’m glad I’m using Little Snitch, which catches this kind of outbound traffic that so easily slips through the Windows “firewalls” onto the Internet. Otherwise, I might be getting product updates without anyone bothering to tell me so.  (And perhaps it’s just me, but an apology from Microsoft doesn’t make me feel any more trusting of them.)
  • NFS vs iSCSI vs FC:  More information on why we should be interested in running VMware over NFS.

I guess that’s all for now, as it’s getting late and I have to get up in the morning and go to church.  Feel free to share any comments or corrections below.  Thanks for reading!

Tags: , , , , , , , ,

This article started life as something entirely different. I was reviewing some of the VMworld 2007 slide decks, looking for “nuggets of knowledge,” as I like to call them (these are the small details that are often far more significant than they might seem) when I came across some information on VMware HA isolation response. I was actually looking for something else but as is typically the case when you’re looking for something, you find everything but the one thing for which you’re looking.

In any event, I wanted to take some time to better understand isolation response, so I decided to perform some experiments in my lab with VMware HA and isolation response. For those that aren’t familiar with it, isolation response is the term used to describe what an ESX Server in a DRS/HA cluster will do if it loses connectivity to all the other servers in the cluster, i.e., if it becomes isolated. Isolation response is set on a per-VM basis, and the default (I believe) is to power off. What this means is that when an ESX host becomes isolated, it will power off the VMs that are currently running on that host.

There’s a great deal of debate as to whether this is the right setting or not, which I won’t really delve into right now. In any case, how does a host determine if it is isolated, or if the rest of the cluster is just down? That’s what got me started down this path. The VMware HA agent (which is really the Legato Automated Availability Manager, or AAM, agent—hence the AAMClient stuff in esxcfg-firewall) uses the Service Console’s default gateway as its isolation address. Basically, what this means is that if a host can’t get to any of the other hosts in the cluster and can’t get to the isolation address, then it assumes it is isolated and initiates the isolation response. If it can’t get to other nodes in the cluster but can reach the isolation address, then it is not isolated and should continue operation (perhaps even restarting some VMs locally since this would indicate host failures in the cluster).

The stuff I found in the VMworld 2007 slides talks of using a second isolation address, which provides the VMware HA agent with another means of verifying isolation before initiating the isolation response. Before I proceeded with setting this second address, however, I wanted to be sure I understood the operation of isolation response in the current configuration. Once I’d tested that and then tested the second isolation address, I was going to write it up here.

To make a long story not quite so long, I found that isolation response was not working as expected. What happened is that other hosts in the cluster would detect the “host failure” (the isolation of my test host) and try to restart the VM before the test host detected isolation and tried to shutdown the VM. This was evidenced by these lines in /var/log/vmkernel:

Oct 5 13:13:36 esx02 vmkernel: 38:20:29:34.025 cpu3:1305)WARNING: NFSLock: 1883: disk is being locked by other consumer
Oct 5 13:13:36 esx02 vmkernel: 38:20:29:34.025 cpu3:1305)NFSLock: 2479: failed to get lock on file vswim01-flat.vmdk 0×5a1b6a0 on 192.168.31.51 (192.168.31.51)

(Yes, I’m running my VMs on NFS. Yes, I did try iSCSI to see if the behavior was different. No, I did not try Fibre Channel. Yes, I got the same results in both cases.)

To make things even more interesting, I found that the test host failed to successfully shut down a Linux VM when the isolation response was finally triggered, but was able to successfully power down a Windows guest. Both VMs had the latest version of the VMware Tools installed.

Since that time, I’ve been combing the Internet searching for more information on the VMware HA agent, the AAM ftcli utility, behaviors, workarounds, configuration tweaks, etc. Thus far, it has been an abysmal failure. There are lots of VMware Community threads, but almost every one of those is a “double-check your DNS and /etc/hosts” thread.

So, any VMware gurus out there have some useful information to share? Anyone else having VMware HA problems? Anyone know where I can find some actually useful information on VMware HA and the AAM client? I’d love to get some more detailed information and be able to put this thing to rest (and be able to advise others on how to put it to rest as well).

Tags: , , , ,

NFS for VMware Storage

I thought that I had blogged here before about using NFS for VMware storage, but it appears that I have not.  (I guess that’s one of the downfalls of a fairly long-running weblog—you blog about some things too often and not at all about other topics.)  In any case, following some of the VMworld breakout sessions last week, NFS is getting a lot more attention these days as the storage protocol for VMware.

A couple of recent blog entries on this topic caught my attention:

Eisler’s NFS Blog - VMware over NFS?
Storage - VMware over NFS

Network Appliance seems to be talking the most about NFS for VMware, which kind of makes sense given their history in NFS.  I’m using NFS in our lab (which uses NetApp storage systems) and have had nothing but positive experiences thus far.  I have not yet had the opportunity to conduct any performance tests, but I do plan to try to work up some numbers on NFS vs. software-based iSCSI.  I can’t, unfortunately, compare to Fibre Channel as I have no FC infrastructure in the lab (yet).

I’d love to hear feedback from any readers that might be using NFS for VMware storage.  What have your experiences been?

Tags: , , , ,

I just wrapped up two different sessions on IP-based storage, one on iSCSI configuration and one on performance characteristics and comparisons between iSCSI and NFS.

I couldn’t liveblog the first session because it was too crowded (no room to type on my laptop) and I couldn’t get a wireless signal from the VMworld 2007 network.  There are, however, a couple key points from the session that stick out in my mind:

  • ESX Server does not currently support MCS (multiple connections per session) or jumbo frames, two key optimizations that can really help with iSCSI performance.  There is no word yet on when those shortcomings will be addressed; personally, I’m hoping that VMware fixes them in ESX 3.5.
  • There is, apparently, some way of performing manual load balancing of iSCSI LUNs to help improve performance.  The speakers did not go into any great details, and I was unable to speak with one of the presenters, Jon Hall, after the session.  He did, however, invite me to contact him via e-mail, so I’ll post more information on that once I’ve had some communication with him.

Most of the rest of the information presented in that session was pretty straightforward and was information I’d already seen.  All in all, it was a decent session, but I didn’t as much information from the session as I had hoped I would.

After lunch, I returned to the Moscone Center for a session titled “NFS and iSCSI - Performance Characterization and Best Practices.”  I was really hoping to get some additional best practices on using NFS and iSCSI and on maximizing performance with these IP-based storage solutions.

The session started with some performance characteristics with ESX Server today vs. ESX Server 3.0.1; basically, it was an update of some performance data presented last year at VMworld 2006.

These updated performance statistics are intended to show the results of optimizations that have been incorporated into ESX Server.  This includes optimizations like improved and more accurate CPU accounting (this improves load balancing across VMs), improved PAE support, minimized NUMA overhead, improved CPU cost per I/O, increased maximum transfer sizes, and the ability in handle more concurrent I/Os.

As a result, software iSCSI sees a range of improvements since ESX Server 3.0.1, as high as 15% for 8K block writes, with reductions in latency across the board and reducions in CPU utilization as well.  Read operations will show the greatest improvements.

Hardware iSCSI sees dramatic improvements in smaller block sizes, but the larger block sizes are essentially unchanged.  The same goes for latency, and the reductions in CPU utilization for hardware iSCSI shares the same characteristics as for software iSCSI (but keep in mind that the absolute change—as opposed to percentage change—will be greater for software iSCSI).  With hardware iSCSI, mixed read-write operations will benefit more than just read options.

Differences between VMFS and RDM (raw device mapping) are inconsequential (less than 2.5%); the only significant difference is CPU utilization, where VMFS requires more CPU time than RDM.

Comparisons of hardware iSCSI, software iSCSI, and NFS with regards to throughput show figures that are not entirely unexpected.  NFS is slightly slower than both flavors of iSCSI, and has greater latency than the iSCSI flavors.  However, all of the measured figures were in milliseconds, so it’s not terribly significant.

Moving into performance best practices, the presenter started with the storage array itself, and provided the typical list of items to consider:  total spindle count, number of spindles allocated for use, RAID level and stripe size, storage processor specifications, read/write cache sizes, and caching policies.  This is all pretty standard information that is applicable in sizing a correct storage solution, independent of a virtualization implementation.  (I would use those same counters to size a Microsoft Exchange storage solution or an Oracle storage solution, for example.)

Since we are talking IP-based storage, networking configuration comes to play here, including such things as the network topology, switches, NICs, flavor of iSCSI (hardware/software).  Similarly, things about the ESX Server host like CPU speed and number of CPU cores, overall system architecture, bus speed, I/O subsystems, and memory configuration all play a part in determining performance of IP-based storage solutions.

Finally, it’s important to understand the characteristics of the workload(s), such as I/O sizes, read/write patterns, and dependence upon aggregate throughput or latency.

<aside>OK, can we move past this stuff now?  This is all basic stuff that isn’t necessarily specific in any way to virtualization.  I want to see best practices for using IP-based storage with VMware!</aside>

To increase the overall throughput, using multiple NFS mount points may improve aggregate throughput as the cost of slightly higher CPU cost.  NFS export options can affect performance as well.  (OK, which options?  Telling us that without telling us which options is kind of like leaving us hanging.)

iSCSI digests may or may not have an impact on performance; iSCSI header digests have little or no impact; turning off iSCSI data digests can improve performance.

The presenter went over some additional troubleshooting tips, and the slide briefly mentioned the vsish command.  I hadn’t heard of that command; anyone know of where I can find some additional information on vsish?

Wrapping up the session, the presenter went through a few scenarios involving performance troubleshooting with both iSCSI and NFS.  Overall, I did not find this session to be nearly as helpful as I had hoped it would be, the presenter was not engaging, and the presentation did not provide the kind of detailed information that I felt should have been included.  (Examples: mentioning that some NFS export options affect performance, but failing to mention which options, or stating that there is a VMware knowledge base article about a topic but failing to provide the KB article number or URL).

Tags: , , , , , ,

NFS Help

I like to think that I’m a fairly intelligent guy, able to pick up most things reasonably quickly given the opportunity.  After all, I transitioned from a Windows-only SE into an SE with a good reputation for VMware ESX Server, various Linux flavors, Mac OS X, and some Cisco configuration (hey, if you can do GRE tunnels with IPSec encryption, you’re not too shabby with IOS).  But I’m having a real problem with NFS.

I know, it seems silly, but I just can’t wrap my head around how it works.  In particular, the NetApp implementation of NFS and the /etc/exports file that Data ONTAP uses seems to be very different than the way you would configure NFS on Linux or Solaris.  Even when I go through the FilerView GUI to configure an NFS export, it doesn’t seem to work the way I expect.  To be fair, I’m sure this is just a lack of understanding on my part and not necessarily a flaw or drawback in the NetApp implementation.

Take this example.  I recently added another old F840 storage system to my lab at the office, and will begin setting up a demo SnapMirror environment to show to customers (SnapMirror with VMware on NetApp is going to be slick!).  I thought I’d also start performing some NFS testing as well; I’m particularly interested in thin provisioning the VMDKs on a thin provisioned FlexVol via NFS.  So I create a new FlexVol and then proceed to configure a new NFS export.  After walking through the NFS export wizard in FilerView and specifying my MacBook Pro’s IP address as having both read/write access and root access, I mount the export and proceed to try to copy an ISO image file.  Denied!  Huh?  Checking the properties, I see that I only have read-only permissions.  What’s up with that?

I try several other variations, and all of them provide the same result.  How can that be?  If my host’s IP address is provided read/write access, why do I have read-only access?  Is one option overriding another?  How do the options interact with each other?  I’m sure these are silly/easy questions for those well-versed in NFS, but for whatever reason I’m having a hard time here.

If anyone could share some enlightening information, I’d certainly appreciate it.

Tags: , , , , , , ,

« Older entries § Newer entries »