VMwareFT

You are currently browsing articles tagged VMwareFT.

This is just a quick post inspired by Mike Laverick’s recent “Stupid IT” post, in which he weighed in on the blog discussion between Steve Chambers and I regarding “putting all your eggs in one basket,” the most common argument against high consolidation ratios and, in some cases, against consolidation in general.

Mike’s articles—part 1 and part 2—are excellent articles. The interesting thing here is that, when you really boil it down, my viewpoint is not that far off from both Steve and Mike (spoiler warning: Mike agrees with Steve). In my blog post, I tried to focus less on whether high consolidation ratios are good or bad but instead to focus on whether the high consolidation ratios—and the impact of the design decision to use high consolidation ratios—will satisfy the needs of the business.

I agree with a number of points from Steve’s post. For example, I agree the root cause of an outage is more likely to be human error than hardware outage. I also agree that building redundancy into the infrastructure helps further reduce the possibility of an outage. Mike makes the same argument:

The truth is that hardware and software components are so reliable and redundant they hardly ever fail. In fact, so much availability software is geared towards protecting the server from hardware failure that some of my peers are beginning to question why they even buy SKUs that contain VMware HA.

So if everything is so redundant and so stable, why do people buy VMware HA? Why do people use clustering solutions like Windows Failover Clustering? Why do people use VMware FT or Neverfail or any of the rest of it?

The answer is simple: fear. Businesses are afraid of their applications being unavailable. In some cases, this fear is irrational, and from this perspective I agree wholeheartedly with both Mike and Steve: don’t use the “all my eggs in one basket” argument with me just because it scares you, just because you’re afraid of running all your workloads together.

On the other hand, though, this fear might be justified. What if the application or applications in question are the very lifeblood of the business? If you are an online-only organization, the need for your web site to stay up and accessible is crucial. If the web site is down, lots of money gets lost. In this situation, the fear of being unavailable is justified. It’s not irrational—it’s based on a keen understanding of the needs of the business and the impact of the outage upon the business. And in those cases, where suppressing consolidation ratios is used deliberately in order to satisfy the needs of the business, I’ll accept the “all my eggs in one basket” argument.

Come to me with the “all my eggs in one basket” argument backed by irrational fear and a lack of information, and I’ll argue against it every time. Come to me with the “all my eggs in one basket” argument backed by an understanding of how IT aligns with the business and the impact of an outage on the business, and I’ll listen to—and possibly even agree with—your position. As I and so many others have stated on numerous occasions, don’t pursue high consolidation ratios for the sake of high consolidation ratios. Pursue them because it makes the most sense for the business.

In the end, I guess my point is that both Steve and Mike have missed the point. Not that their viewpoints are irrelevant; quite the opposite! Both of them make very good points that are quite relevant and pertinent to the discussion of “Why not higher consolidation ratios?” Unfortunately, that’s not the question that needs to be asked or answered. The question should be, “What is best for the business?” In that context, putting “all your eggs in one basket” isn’t always the best answer.

Courteous comments welcome!

Tags: , , , ,

With the release of VMware vSphere 4 earlier this year, VMware officially introduced VMware Fault Tolerance (VMware FT), a new mechanism for providing extremely high levels of availability to virtual machine workloads. As I’ve talked with customers, I’ve noticed a growing number of customers who are unaware of the differences between the types of high availability that VMware provides (in the form of VMware HA and VMware FT) and operating system-level clustering (such as Microsoft Windows Failover Clustering). Although both types of technology are intended to increase availability and reduce downtime, they are very different and offer different types of functionality.

Consider these points:

  • While using VMware HA will protect you against the failure of an ESX/ESXi host, VMware HA won’t—by default—protect you against the failure of the guest operating system. An OS-level cluster, on the other hand, does protect against the failure of the guest operating system. +1 for OS-level clustering.
  • VMware clusters that are using VMware HA can choose to use VM Failure Monitoring and gain some level of protection against the failure of the guest operating system, but you still won’t get protection of the specific application within the guest operating system, unlike an OS-level cluster. +1 for OS-level clustering.
  • These same arguments also apply to VMware FT. VMware FT won’t protect you against guest operating system failure—a crash of the OS in the primary VM generally means a crash of the OS in the secondary VM at the same time—and it won’t protect you against application failure. +1 for OS-level clustering.
  • You can’t failover between systems using VMware HA or VMware FT in order to perform OS upgrades or apply OS patches. +1 for OS-level clustering.
  • Similarly, you can’t failover between systems using VMware HA or VMware FT in order to do a rolling upgrade of the application itself. +1 for OS-level clustering.
  • Of course, the VMware technologies do have some advantages. Both VMware HA and VMware FT are far, far simpler to enable and configure than an OS-level cluster. +1 for VMware.
  • Both VMware HA and VMware FT don’t require any application support in order to protect the VM and its workloads. +1 for VMware.
  • Neither VMware HA nor VMware FT require that you license specific editions of the guest operating system or application in order to be able to use their benefits. +1 for VMware.
  • VMware HA can produce higher levels of utilization within a host cluster than using OS-level clustering. +1 for VMware.
  • VMware FT can provide higher levels of availability than what is available in most OS-level clustering solutions today. +1 for VMware.

This is not a knock against any of technologies listed—VMware HA, VMware FT, or OS-level clustering—but rather an exploration of their advantages, disadvantages, similarities, and differences. Hopefully, this will help readers who might not be as familiar with these products make a more informed decision about which technologies to deploy in their data center. (Hint: You’ll probably need all of them.)

Tags: , , , , ,

Welcome to Virtualization Short Take #30, my irregularly posted collection of links and thoughts on virtualization. I hope you find something useful here!

  • I believe Jason Boche already mentioned this on his own blog (I couldn’t find a link) and also started this VMware Communities thread discussing the fact that the 8/6 patch breaks FT compatibility between ESX and ESXi hosts in the same cluster. This VMware KB article is now available with more information on the problem. I’m hearing from VMware is that there is no short-term solution; the workaround is to use only ESX or only ESXi within a single cluster. (I don’t recommend not patching the hosts until the problem is fixed.)
  • And while we’re talking VMware FT, here’s a good document on VMware FT architecture and performance. (Eric Siebert’s Virtualization Pro blog post about VMware FT is really good, too.)
  • I’m also hearing reports that there are problems mixing ESX and ESXi in the same cluster when using host profiles. Theoretically, you should be able to use an ESX reference host and apply that to ESXi hosts, but in reality it’s not working so well.
  • If you’re using AppSpeed, you’ll need to manually turn off the AppSpeed sensor VMs in order to put ESX/ESXi hosts into Maintenance Mode. The sensor VM won’t VMotion off the host, so this prevents the host from entering Maintenance Mode.
  • Here’s another topic that I think has been mentioned elsewhere (looks like Duncan mentions it here), but SRM 1.0 Update 1 Patch 4 was released a couple of weeks ago and it includes a fit for customizing the IP addresses of Windows Server 2008 guest operating system instances.
  • Toward the end of August, VMware Infrastructure 3 support was added for NetApp MetroCluster (see this VMware KB article). Now, how about some VMware vSphere 4 support?
  • Most of you are aware by now (and if you aren’t aware, go buy a copy of my book so you will be aware) that you can use Storage VMotion to change virtual disks from thin provisioned to thick provisioned. The problem is this: the type of thick provisioned disk created when you do this via Storage VMotion is eagerzeroedthick, not zeroedthick. This means that it is not friendly to storage array thin provisioning!
  • I’m still looking for a valid use case for this little trick, but it’s mentioned by both Duncan and Eric: the ability to present multiple cores per socket to a virtual machine. Duncan’s post is here; Eric’s post is here. As Eric points out, licensing is one potential use. Anyone have any other valid use cases?
  • Eric Sloof has a great post on dvSwitch caveats and best practices that is definitely worth reading.
  • Want to make linked clones work on vSphere? Tom Howarth points out in this post some information made available by William Lam. Both articles are worth a look.
  • Tom also posted some useful information on enabling firewall logging on VMware ESX hosts.
  • This post over on Aaron Sweemer’s blog was actually written by guest author John Blessing (aka @vTrooper on Twitter) and just goes to illustrate how difficult it can be to create a chargeback model.
  • Of course, the “Super iSCSI Friends” recently produced a multi-vendor post on using iSCSI with VMware vSphere, a great follow-up to the original multi-vendor VI3 post. Here’s Chad’s version of the multi-vendor vSphere and iSCSI post.

That wraps it up for this time around. Thanks for reading, and feel free to submit any other useful or interesting links in the comments below.

Tags: , , , , , ,

You might have read the article I wrote here titled vSphere Virtual Machine Upgrade Process, in which I described a process whereby you could upgrade your VMs to VM hardware version 7 (the version used with vSphere) as well as use the latest paravirtualized network and SCSI drivers (VMXNET3 and PVSCSI). Both PVSCSI and VMXNET3 offer greater performance with the same CPU utilization.

Rightfully so, some readers and other bloggers pointed out that PVSCSI isn’t supported for boot disks (Rich Brambley put up a really good post, for example). Rich, among others, suggested moving virtual machines back to a “two disk model,” with a boot disk and a separate data disk; this would allow for the greater performance of the PVSCSI controller on the data disk. This seemed to be a reasonable workaround. I don’t recall hearing about any significant issues with VMXNET3. Using the newer network driver seemed to be a good move all the way around.

Unfortunately, there is another drawback to both of these devices. Rich caught this drawback in his article, but relegated it to a small mention at the very end of the article that even I overlooked at first (emphasis mine):

There are some other factors to consider as well. For example, vSphere Fault Tolerance cannot be enabled on a VM using PVSCSI.

That’s right—you cannot use VMware Fault Tolerance (FT) on a virtual machine that is using the PVSCSI device. However, this restriction doesn’t just apply to the PVSCSI device; it also applies to VMXNET3! VMware FT cannot be enabled on a virtual machine using either the VMXNET3 or PVSCSI devices; vCenter Server will simply report an error that the network interface or disk controller isn’t supported for VMware FT.

In my opinion, this is a significant enough limitation that I felt it warrants its own post. If you are planning on using VMware FT in your environment, be sure not to configure any virtual machines to use VMXNET3 or PVSCSI if they might need to be protected with VMware FT. In this case, you’ll have to choose from either maximum performance or maximum protection—you don’t get both.

UPDATE: Rich Brambley shared links to two resources that describe the incompatibility between VMware FT and PVSCSI and VMXNET3:

VMware Communities: Unable to configure FT with error “Unsupported virtual machine configuration for Fault Tolerance. Device ‘Network adapter 1′ is not supported”
VMware Fault Tolerance Requirements and Limitations

Tags: , , ,

Having previously discussed Marathon Technologies’ everRun VM product in conjunction with XenServer HA as part of XenServer 5, I think that I have some useful information to bring to the recent discussion that has come to light about everRun VM vs. VMware HA vs. VMware FT.

Apparently, this discussion started with a blog entry by Marathon titled VMware FT – The Top Four Reasons it’s Kinda Sorta Fault Tolerance. Mike DiPetrillo of VMware responded from his personal blog, first tackling Marathon’s blog post and then again tackling comparisons posted on Marathon’s web site. Various others also weighed in, such as Duncan at Yellow Bricks and TechTarget’s Server Virtualization Blog.

I’ve spoken with the folks from Marathon a couple of different times about everRun and its functionality, so let me attempt to compare these three products—everRun VM, VMware HA, and VMware FT—with an eye toward understanding the differences between them.

  • Marathon everRun VM provides two levels of protection: Level 1 and Level 2. Level 1 is basic failover, and is included in XenServer 5 as XenServer HA. In this regard, it is essentially the same as VMware HA in that it will restart VMs in the event of host failure. Both products calculate available capacity for failover but do not reserve those resources in advance; hence, neither of them can provide guaranteed failover. VMware HA seems to have an upper hand here because admission control can actually prevent users from powering on VMs if there are not enough resources to provide failover for that VM. From all information I have been able to obtain, everRun VM Level 1/XenServer HA lacks that ability, and it’s possible therefore that users could power on more VMs than the resource pool could sustain in the event of hardware failure. Both products should be considered “best effort” as a result. Users wanting to make comparisons between Marathon everRun VM and VMware HA should constrain their comparison to everRun Level 1. Otherwise, the comparison is not a like-to-like comparison.
  • Marathon everRun VM goes on to add Level 2 protection for component-level failure. It’s true that this level of protection exceeds anything that can be provided via VMware HA today. With component-level protection, I/O to or from a failed storage device or a failed network device is transparently redirected to another host, where an identical VM environment has been established. Please note that the two VM environments are not both executing at the same time, but that resources on the secondary host are reserved and cannot be used by any other VMs. These resources include not only RAM, but also storage and networking. If there is a host failure, the VM is restarted on the secondary host. Because resources were pre-allocated, everRun VM is able to provide guaranteed restart on the secondary host. The functionality provided by everRun VM when configured for Level 2 protection exceeds any functionality that VMware HA has today.
  • On the flip side, however, it’s also fair to note that VMware has not needed to provide component-level fault tolerance because they’ve supported storage multipathing and NIC teaming for quite some time. It’s my understanding that those features have only recently made it into the XenServer product line.
  • VMware Fault Tolerance (FT) and everRun VM Level 3 are comparable. Both establish an identical VM on another host and keep that VM “mirrored” with the original VM. If there is a host failure, the “mirrored” VM will automatically take over right where the primary was when it failed. It appears that everRun VM might have an edge here because it again supports component-level failover, but given that neither product is available yet it’s still a bit too early to be making calls on which product is “better”.
  • As for the “complexity” of one product versus the other, both have their own complexities. Marathon everRun VM requires a dedicated network link, called the “Availability Link”, in order to provide the component-level protection. I would assume the Availability Link will be needed for everRun VM Level 3 as well. That corresponds directly to VMware FT’s logging NIC. VMware HA does not require any special NICs or unique configurations; it’s unclear if the same is true for everRun VM Level 1/XenServer HA protection. I’ll have to call Marathon out on their knocks against setting up NIC teaming and storage multipathing; those tasks may be complicated in XenServer environments but are drop-dead simple in VMware ESX environments. The same goes for enabling VMware HA and VMware DRS.

As you can see, each product has its own set of strengths and weaknesses.

As a final note, as SearchServerVirtualization.com stated, comparisons between these two product sets are a bit irrelevent anyway: VMware’s functionality works only with VMware ESX environments, and Marathon’s functionality works only with XenServer. It’s not like users have to choose between them in the same virtualization environment.

I welcome everyone’s input and thoughts on this matter. Please contribute in the comments to this article.

Tags: , , , , , ,

There is no general session this morning at VMworld 2008; instead, a “keynote” will be delivered about automating disaster recovery (DR) using VMware Site Recovery Manager (SRM). This is similar to the way in which other vendors have delivered various “keynotes” throughout the conference instead of all the announcements being crammed into the morning general sessions.

The speaker this morning is Jay Judkowitz, the product manager for VMware SRM. I’ve met Jay before; he’s a good guy. There’s a small technical glitch as the session begins because the slide deck doesn’t come up, but that gets resolved within only a few minutes and Jay begins his presentation.

The presentation begins with yet another overview of the VDC-OS vision; SRM is considered one of the vCenter management vServices. Jay then goes on to address all the various ways in which VMware provides application availability for applications hosted on VMware Infrastructure. This would be technologies like VMotion, VMware HA, VMware DRS, VMware FT, NIC teaming, storage multipathing, and of course Site Recovery Manager.

The traditional challenges of DR (including complex recovery processes and procedures, hardware dependence, inability to test extensively or repeatedly) are all addressed by VMware SRM. More accurately, they are addressed by the products that form a foundation underneath VMware SRM. Features like hardware independence, encapsulation, partitioning and consolidation, and resource pooling. These features have a direct play in a DR environment. It’s funny to see Jay taking this particular approach; it’s almost like he’s using the same slide deck that I’ve used in DR presentations given over the last couple of months.

That finally brings the discussion around to Site Recovery Manager specifically. Jay goes over some of the features of SRM, and discusses some “do’s and dont’s” for SRM. For example, SRM isn’t really intended to provide failover for a single VM, although you can architect it to do that (put that VM on a single LUN by itself and create a Protection Group for that LUN and VM, then craft your Recovery Plan).

It’s important to note that SRM is not a replication product, but instead relies upon replication products from supported partners. This is done via the Storage Replication Adapter (SRA), a piece of software written by the storage vendor.

When setting up SRM, there are number of steps that it goes through. First, you have to integrate with the storage replication in place already (and yes, the storage replication needs to be in place already). Next, you need to map recovery resources; this creates the link between resources used in the Protected Site to resources that will be used in the Recovery Site. Third, you need to create Recovery Plans, which is the automated equivalent of the DR runbook. That is, the Recovery Plan defines which VMs will failover, in which order, at the Recovery Site. That’s a bit of simplistic overview but it does get the point across.

At this point, I’ve decided that I’m going to try to get into a different session. I’m quite familiar with SRM, a lot of readers are probably familiar with it as well, and it doesn’t look like there is anything new that will be revealed here. For those readers that aren’t familiar with SRM, let me know in the comments. If there’s enough interest, I’ll write something separate after my return from VMworld 2008.

Tags: , , , , , ,

Well, it looks like wireless coverage pretty much stinks, so this will have to be published post-session as well.

This session is BC2621, titled “Fault Tolerant VMs in VMware Infrastructure: Operations and Best Practices”, presented by Dan Scales, Principal Engineer with VMware. It’s a pretty full session, but I’m lucky (unlucky?) enough to get a seat toward the front of the session.

This is a technical preview of VMware FT, a new feature that is slated to be included in VDC-OS. This is the evolution of “Continuous Availability,” which was demo’ed by Mendel Rosenblum in San Francisco at VMworld 2007. VMware FT is part of the availability application vServices, with also include such things as VMware HA, VMotion, Storage VMotion, NIC teaming, and storage multipathing.

VMware FT is one of two new application vServices that are being discussed this week; the other is vCenter Data Recovery, a full backup solution.

VMware FT is based on vLockstep technology that keeps a primary and secondary machine in virtual lockstep with no special hardware. Like the rest of VMware Infrastructure, VMware FT will run on standard, x86 commodity hardware, provides zero downtime and zero data loss. This is just another level of availability that can be provided. When compared with hardware-based availability, which does not run on commodity hardware, or standard clustering solutions, VMware FT allows for more protection with less complexity at a lower cost.

Again, VMware FT is simply a way to provide a higher level of protection than VMware HA for some VMs and workloads.

VMware FT involves two VMs: a primary VM and a secondary VM. The secondary VM is doing exactly the same thing as the primary, but does not communicate across the network. (What about storage?) In the event of a hardware failure, the secondary VM will become the primary VM and will assume network connections. Like VMotion, there should not be any interruption in network connectivity. After a host failure, a new host will be selected to run a new secondary VM, and there is a brief window in which the VM can’t be fully protected with VMware FT.

VMware FT is fully integrated with VMotion and VMware DRS. Multiple FT pairs can run on a single host, and a host can run both FT-enabled and non-FT-enabled VMs on the same host. FT can be dynamically enabled or disabled dynamically, and VMware DRS is leveraged for the placement of the secondary VM.

VMware FT is based on VMware’s Record/Replay technology, which was first introduced in VMware Workstation in 2006. When the data stream is recorded, only non-deterministic events are recorded, and the replay will occur deterministically. This creates instruction-for-instruction, memory-for-memory identical results.

Deterministic means that a processor will execute the same instruction stream and will end up in the exact same state. By recording non-deterministic events, VMware ensures that the record and the reply are identical. Non-deterministic stuff involves things like network/disk/keyboard I/O and hardware interrupts.

Using record/replay, VMware FT keeps two VMs in lockstep. These two VMs will share a common disk, although in the future this may move to a non-shared disk. Only the primary VM responds across the network; the secondary VM is a silent partner. If the primary VM fails, the secondary VM takes over immediately. In the event of a secondary VM failure, then redundancy will be re-established by restarting the secondary and re-syncing the two VMs.

When using VMware FT, this must be done in a VMware HA cluster. Once in a VMware HA cluster, a VM may be protected by FT, HA, or both. When both HA and FT are in operation, then both technologies come into play. In the event of hardware failure, HA will restart VMs and FT will make secondary VMs take over and become primary while HA restarts the former primary VM.

When a user enables VMware FT for a VM, a special kind of VMotion is used to create a secondary VM on a second host (basically a copy of the configuration). Then the two VMs are kept in virtual lockstep via VMware FT. If the primary VM is powered off, the secondary VM is powered off as well; the secondary VM will also be powered off if VMware FT is disabled.

Now, looking at the hardware and software requirements, VMware FT requires CPUs that support hardware virtualization (AMD-V, Intel VT). These features sometimes need to be enabled in the BIOS of the server. All hosts must be running the same build of VMware ESX, shared storage is required (NAS or SAN), and all hosts must be in an HA-enabled cluster. In addition, a separate FT logging NIC and a separate VMotion NIC are required. This means a minimum of 4 NICs are necessary (Service Console, VM traffic, VMotion, FT logging). Gigabit Ethernet is required for the FT logging NIC (just like the VMotion NIC). There’s mention of “dedicated” or “separate” NICs; I wonder how firm that recommendation is? I’d be interested to know how this impacts best practices for NIC configurations.

VMware FT can’t protect VMs that are using thin provisioned disks; disks must be “thick.” Disks will be automatically made thick when VMware FT is enabled. (How does this impact vStorage Thin Provisioning?) VMs can’t have any non-replayable devices (USB, sounds, physical CD-ROM, physical floppy, physical-mode RDMs) and paravirtualization-based VMs are not supported. Otherwise, all VMs are supported, both 32-bit and 64-bit guest operating systems.

Once all the prerequisites are met, VMware FT can be enabled by simply right-clicking on a VM and selecting “Turn Fault Tolerance On”. After FT is fully ready to protect the VM, the icon color will change and the FT status will read “Protected”. A new “Fault Tolerance” pane within VirtualCenter will show all the VMware FT statistics and information, like the location of the secondary VM, the amount of secondary VM CPU and memory usage, latency, and log bandwidth.

The Fault Tolerance status will have a number of different states:

  • Enabled-Running (fully protected, this is the desired state)
  • Enabled-Starting (FT is getting started)
  • Enabled-Needs Secondary (a failure has occured, a new secondary needs to be created)
  • Disabled (VMware FT is disabled)

Disabling FT is better than turning off FT. When FT is disabled, all the underlying infrastructure is maintained and makes it easier to re-enable FT at a later date or time. Turning off FT removes all the underlying infrastructure and setup.

The secondary VM shows up as “VM Name (secondary)” in VirtualCenter. It will not show up in the inventory, but it will show up in the list of VMs for a cluster or in the list of VMs for the secondary host. This may be confusing but is based on customer feedback. There will be only certain places where the secondary VM will appear.

The Maps tab will have a way to show the link between the primary VM and the secondary VM.

There will be a number of FT-related events and alarms; the “Enabled-Needs Secondary” state, for example, is one place where an alarm already exists.

When considering VM migrations, either the primary or the secondary can be moved via VMotion, but both cannot be moved at the same time. There is a built-in rule in DRS to keep them on separate hosts. The DRS mode for fault-tolerant VMs is set to “Manual”; this means the user must explicitly choose to initiate a VMotion. FT must be temporarily disabled to do a Storage VMotion, or you could power off the VMs and do a datastore migration at that time.

Again, no network connections are lost during a failover.

Dan covered again the interplay between VMware HA and the placement of the secondary VM, and the use of a modified VMotion to provision the secondary VM. That is a very quick process, meaning that the FT status will return to “Enabled-Running” very quickly.

In the event of a multiple host failure, VMware HA will restart the primary and VMware FT will recreate the secondary VM to establish redundancy. In the event of a guest OS software failure, VMware FT won’t do anything because the primary and secondary are in sync; both will fail at the same time and in the same place. VMware HA failure monitoring can restart the primary; the secondary will then be recreated via special VMotion to re-establish redundancy.

What applications are suitable for VMware FT?

  • Applications that run well on uniprocessor VMs
  • Applications that can tolerate a small increase in network latency
  • Applications that have medium network bandwidth requirements (less than 600Mbps)

Examples of this would include medium-sized database applications, messaging applications, or important custom applications.

It’s important to keep in mind that the bandwidth of the FT logging NIC may become a bottleneck; watch how many FT pairs on placed on a single host or move to 10Gbps if available. You may also want to reconfigure some guest OSes (Linux with 1000Hz timer interrupts) to use a slower interrupt timer.

The session wrapped up with a customer portion by Mark Vaughn of The First American Corporation. Due to other schedule requirements, I didn’t stay for that portion of the session. Next up is some time in the Solutions Exchange. Stay tuned for more updates as soon as I can get network coverage.

Tags: , , , , ,