Design Question from a Reader

I had a reader contact me and ask if he could ask the rest of the readers a vSphere design question. I thought that it might start an engaging and interesting discussion around vSphere design, so here’s the reader’s scenario and question(s):

I am looking to design an ESXi environment to potentially deploy Microsoft SQL servers that require extreme high availability at a scale of 50+ MSCS/WFC clusters. We’d like to do this in an ESXi 4.1 environment using Windows Server 2008 R2, MSCS/WFC, and SQL Server 2008 with Fibre Channel storage. I’ve done this in the past on a smaller scale (3-4 total clusters) and know most of the caveats such as proper heartbeat requirements, no HA/DRS support, physical RDM compatibility mode requirements for shared disks, eageredzerothick OS disks, no round-robin multipathing, etc.

The issues I’ve run into in the past revolved around managing these virtual servers differently than other guests since they couldn’t readily be moved between hosts. We also found that the reboot time on these hosts with MSCS/WFC using RDMs was extremely slow (in excess of 45 minutes to fully reboot, we could speed this up by pulling the fibre cables).

Some of the design considerations I’m curious about would include:

  • Where do people put the VMFS/RDM file links?
  • Do people put the guests in different clusters? Is this even possible?
  • How do people separate active/passive nodes? Do people use host based affinity rules to accomplish this?
  • Do reboot times on hosts with lots of RDMs get linearly slower as more MSCS/WFC RDMs are presented to a host?
  • Do people really push back and try to get database mirroring instead of clustering? If so, what caveats around this have people encountered?

I’m just curious how others are handling situations like this or if anyone is really doing it at scale.

Thoughts? What do you guys think about this reader’s situation? I’d love for this to jump start a conversation here with recommendations, experiences, additional questions, etc. vSphere design is a topic that lots of readers are tackling, either for certification or just because that’s their job, and the discussion around this scenario could end up exposing some useful resources and information.

So jump in with your thoughts in the comments below! I only ask that you provide full disclosure with regards to vendor affiliations, where applicable. Thanks, and I look forward to seeing some of the responses.

Tags: , , , , ,


  1. Jason’s avatar

    The host (slow) slow reboot issue is still being worked on. Subscribe to this KB to for updates on that particular problem (apologies if you are already familiar with it):

    I seem to remember reading that a fix was coming for that later on this year but I unfortunately cannot recall where I read it.

    We decided to move away from MSCS clustering towards mirroring for pretty much all the reasons you mention. Some of the downsides (as I understand them, we aren’t there yet) is that you have to keep full recovery mode for mirrored databases (no simple mode), mirroring doesn’t check for host availability or health checks (no automatic failover), and not to mention the obvious double storage hit if you don’t have array based dedupe. I am not a DBA by trade so they may be other issues but those are ones that I came up with during my research.

  2. James Arenth’s avatar

    To address the slow boot of the ESX Hosts, take a look at this KB:

    ESX/ESXi 4.x hosts hosting passive MSCS nodes with RDM LUNs may take a long time to boot

    I personally will be starting to look at SQL Mirroring soon due to the additional challenges of Hosting MSCS VMs. Having to set different pathing policy on each RDM, LUN documentation, DRS Rules, etc. While SQL mirroring does not solve some of the challenges, I believe it may allow for faster overall provisioning while meeting the business needs.

  3. Larry Orloff’s avatar

    One of the issues with MCSC clusters is the requirement for RDM. The more clusters you need the more RDM’s you need. If you figure a minimum of 4RDM per MSCS cluster(1 disk quorum, 1 DB, 1 logs, 1 MSDTA) you start running into # of disks presented to each ESX host. If you’re talking 50 clusters, you’re talking minimum of 200 RDM’s, not even including your VMFS volumes you need to present. This really makes your need to scale both clusters and Hosts to a very large level, just to support # of disks presented to your hosts. You definitely want to push back if possible for only the hardware requirements on the host side to actually implement this.

  4. Skypoint’s avatar

    James: that kb article won’t help if you are runnig 4.1 , VMware told us they got an fix for it but it’s not released yet.

    I personally think one of the biggest problems with MSCS running on FC on ESXi is the lack of vmotion of the VM. As it makes patching of ESXi servers quite time consuming because you need to failover the clusters.

    Where do people put the VMFS/RDM file links?
    Same place as C: in my experience :)

    Do people put the guests in different clusters? Is this even possible?

    I have never done this myself and can’t really see the reason for this.
    How do people separate active/passive nodes? Do people use host based affinity rules to accomplish this?
    Thats what I have done.
    Do reboot times on hosts with lots of RDMs get linearly slower as more MSCS/WFC RDMs are presented to a host?
    It seams so yes. But an patch is on it’s wasy.

  5. bnj04’s avatar

    why don’t use only vmware HA and mirroring at the storage layer (like datacore for example) ? few minutes to reboot is not enough ?

  6. James Arenth’s avatar

    For ESX and ESXi 4.1 hosts

    To resolve this issue in ESX/ESXi 4.1 hosts, you must modify the Scsi.CRTimeoutDuringBoot parameter from the GUI.

    To modify the Scsi.CRTimeoutDuringBoot parameter:
    Go to Host > Configuration > Advanced Settings.
    Select SCSI.
    Change the Scsi.CRTimeoutDuringBoot value to 1.

  7. Erick Moore’s avatar

    I guess the first thing we need to understand is what level of downtime is acceptable, and does your application tolerate a MSCS failover event? I have built many MSCS inside of vSphere and it always ends up being a management nightmare. If you absolutely cannot convince your application owners to not use MSCS, and especially with this scale, you will want your own clusters to house your MSCS systems. Imagine you need to put a host in maintenance mode, but you can’t because you have guest pinned to it because of MSCS. This is why it is of critical importance to make sure the app tolerates a MSCS failover.

    You also are going to have to distribute your load accordingly since you can’t use DRS. What will you do if one or several of your guests start consuming massive amounts of CPU on a host? How will you make certain that a few rogue VMs won’t kill the performance on the other systems on that host? There are just so many things you lose by building an MSCS cluster in vSphere that I would avoid it at all cost. VMware has given plenty of alternatives to this, and it is advisable to use them.

  8. Jeremy Carter’s avatar

    In all I think mirroring is a much simpler and more elegant solution especially when you start throwing virtualization into the mix too.

    I used to be a dba, and personally I generally preferred mirroring and pushed my clients that way. I like it because it is a shared nothing architecture vs clustering which is still dependent on the same storage. Mirroring also is much simpler to configure and troublshoot. In fact the mirroring config can be completely automated.

    The downsides of mirroring is that some operations are more complicated than clustering. For example clustering is configured at the instance level so it does not matter how many databases, logins, jobs, etc, that you have, you do it once and your done. On the other hand mirroring is per database and does not have any native solution for syncing logins, jobs, etc. There are ways to script these out though and there are several solutions posted online to help automate it.

    Mirroring can provide auto failover with a couple of caveats. You need a third instance to act as a witness (although this can be a sql express edition and uses very little resources). Also your app must support it. The ip address does not move, each server has its own ip. The .Net has builtin support along with a few other languages, but if it is not supported you can write a little code to select which server to connect to.

    In the end it will be what is more comfortable for everyone to support. While from a virt perspective it would be easy to say you aren’t supporting clustering and to use mirroring, the dba team may not have the bandwidth to support mirroring. A good sit down with the dba team, windows admin team, and virt team would help hash out all of those operational questions.

  9. Cristi Romano’s avatar

    I might be out of subject, but seeing the caveats of MSCS and also the caveats of Mirroring (the biggest one: no automatic failover), I would ask: Isn’t it possible to use Fault Tolelance an option?
    All you need to take care is a good FT network connection. You don’t need RDM.
    Anyone tried FT with SQL Servers?


  10. Paul’s avatar

    Why not look at ApplicationHA from Symantec? I haven’t used it but its supposed to solve a lot of the issues of having MSCS/WFC on vSphere. If anyone else has used it could they post their experiences? The only downsides I can see is it won’t protect you from a totally corrupt OS or BSOD that persists after reboot type of scenario whereby MSCS/WFC or DB mirroring would (although I’ve seen less and less of these types of OS crashes since we went moslty virtual). These’s also an extra cost involved.

  11. PiroNet’s avatar

    I’ll try to give here some answers to your direct questions.

    •Where do people put the VMFS/RDM file links?
    Within the VM home folder unless you set the compatibility mode to virtual.

    •Do people put the guests in different clusters? Is this even possible?
    I don’t think it is even possible since the resource is provided by a host part of cluster. But you could eventually point to an existing mapping file that was created by another host/cluster providing the LUN is unmasked to the host(s)

    •How do people separate active/passive nodes? Do people use host based affinity rules to accomplish this?
    Yes DRS VM-Host Affinity Rules available in vSphere 4.1

    •Do reboot times on hosts with lots of RDMs get linearly slower as more MSCS/WFC RDMs are presented to a host?
    Unfortunately yes, this VMware KB can help you:
    Trust me this is going to be your worst concern over time and will require some management overheads/scripting skills until VMware publish a ‘patch’

    •Do people really push back and try to get database mirroring instead of clustering? If so, what caveats around this have people encountered?
    Jeremy Carter gave a very good answer to this question especially the caveats.

    Personally I think MSCS is a mature BUT aging technology. SQL Mirroring is on the other side a rising technology that I appreciate a lot.

    Perhaps other technologies should be considered but it all depends on your functional requirements…


  12. Brandon’s avatar

    Boy, this conversation brings up some memories I’d rather forget, not to mention some horrible management level decisions that get made. Why do we need app-level clustering, why not use fault tolerance with tons of sql vms, one database per! WE BOUGHT THE DATACENTER VERSION LIKE YOU SAID, RAWR. Yeah, that doesn’t have the same level of protection, next!

    Mirroring is great, but it is plagued with other issues. The main one being that a lot of out of the box apps don’t support mirroring directly. One solution is to use MS NLB, but that also involves some virtualization design changes which I won’t go over — the other issue here is failback with NLB can be challenging. Mirroring REALLY shines when the apps support it, or leverage the MS native client (like vCenter does :D), you can have some fancy stuff going on. As mentioned, the witness server can even be sql express, kind of neat watching it run monitoring a bunch of DBs.

    Why would you set any DRS affinity rules for VMs which can’t have DRS on anyway. HA shouldn’t be restarting them, so I’m curious how this is involved. Is there a plan to leave that on for the passive nodes?

    Finally, while I don’t want to say anything to make me appear “non-virtualization friendly”… physical hosts running SQL in a clustered configuration, especially active/active ultimately is consolidation. I don’t see how many SQL virtual machines sprawling around to meet performance goals — not to mention the many gotcha’s can out weigh this other benefit. Also, take into account that most SQL VMs that I’ve seen still have SQL agents for backups or maintenance plans, and it starts to get weaker and weaker the justification for virtualizing it. A couple of huge physical hosts running a large cluster is what we ended up with, and in the end killed a bunch of SQL VMs and other physical hosts. It is even possible to setup a mirror (simply for replication) or log shipping to a remote site for a cold setup later, or for a hot solution you could have a 3rd node in a geo-cluster. SRM would be a good justification for virtualization, beyond that most of the other benefits are negated.

  13. slowe’s avatar


    Don’t worry about “non-virtualization friendly” responses! The best solution is the best solution, and if there is a gap in VMware features or support that prevent it from being the best solution, so be it.


    Thanks for your responses! Let’s keep the discussion going.

  14. Mihai’s avatar

    In our company we avoid MS Clustering at all costs because of the above mentioned issues by having the following policy:

    1. Discuss with application owners and try to convince them to use just VMware HA and forget about MSCS by telling them that:
    – OS boot time in a VM adds only 10 seconds to a failover in case something goes wrong
    – If they are concened VMware cannot monitor SQL we suggest Symantec/Veritas integration with VMware HA
    – Better availability due to VMotion, Storage VMotion (zero downtime for infrastructure maintenance)
    – Simple architecture – easier to deploy, explain, administer, etc..
    – Mitigate OS corruption with daily full VM image backups

    2. If we can’t convince them to do no. 1, deploy as native physical machines and consolidate databases in fewer instances/nodes if possible, no reason to pay VMware for vSphere as it only gets in the way

  15. Jason’s avatar

    This is a really good discussion. I’d like to see more people pose these types of questions as they arise.

    Brandon, I agree virtualizing just for the sake of virtualizing isn’t a sound reason by any means.

    I can see one major issue with the FT recommendation; no SMP support. I highly doubt most people run SQL with 1 vCPU, so its difficult to get by in on that. FT also doesn’t provide the ability to “test” MS patches on a passive node before hand, if the patches have an problem in FT you have a major issue.

    In some environments minutes of downtime on CRITICAL systems just is not an option, so HA isn’t always a viable option. Also in the event of an HA event isn’t there concern around data integrity at some point? I’m not a SQL DBA but cant there be issues when a DB crashes? Please correct me if I’m wrong, but I believe the Symantec product has the same issue in a host failure.

  16. solgae’s avatar

    “Why would you set any DRS affinity rules for VMs which can’t have DRS on anyway. HA shouldn’t be restarting them, so I’m curious how this is involved. Is there a plan to leave that on for the passive nodes?”

    Not having DRS anti-affinity is a problem when you need to power off, then powering the VM on, because there’s a chance that the cluster will start the VM on the host where the other MSCS node is being hosted, assuming that your MSCS setup is to host VMs on different hosts.

    This is why as per the documentation (, you need to put the DRS advanced option to enable strict affinity control (ForceAffinePoweron to 1) to prevent them from powering on in the same host.

    HA is supported with MSCS since vSphere 4.1, but make sure to set up DRS Groups and VM-host affinity rules as per the documentation ( since HA doesn’t obey DRS affinity rules.

  17. Brandon’s avatar

    @solgae, DRS initial placement should not be used at all. If you set the per VM setting to “partially automated” rather than manual or disabled, then I would agree with you. Manual would prompt you to choose the host, and disabled would just power the VM on the host it is currently registered on.

    As far as the remainder of your post, HA being supported as of 4.1 — new one for me, very cool. I’ll have to read the documentation. That does indeed justify the affinity rule when you set the advanced option as well. It makes sense to me now.

  18. Brandon’s avatar

    @myself;) I just read the MSCS on 4.1 doc regarding HA and DRS. Looks like it does allow you to set the VM to partially automated. Neat. I wish I had read that before I responded, oh well.

  19. Eric Gray’s avatar

    Does in-guest iSCSI address any of the issues? Is that even viable?

  20. Jason’s avatar

    Correct me if I’m wrong, but isn’t in-guest iSCSI only for iSCSI arrays. This issue is specifically for W2K8 MSCS, ESX(i)4.x, and Fiber Channel arrays. I believe the root of the issue was the MSCS change to persistent SCSI-3 reservation requirements.

  21. Anthony’s avatar

    I often encounter MSCS hate but don’t see it.
    I have never had any problems with it and would use it if I had requirements for a Highly available business critical app, and have done so in the pass.
    Firstly FT does not supply SMP or OS protection, i.e. driver freezes, blue screens, patching issues, it just instantly replicates the problem.
    In my location dark fibre is plentiful so quite often we do campus or metro clusters with EMCs cluster enabler and consitency groups, it puts a storage step in the MSCS clustering stack. This means you don’t have to share the same array, but can use two seperate arrays giving you even more failover.
    Also I don’t understand the issues with people bring up with patching etc. patching with clusters is easy in my opinion, patch the passive server start it up test it, fail the cluster over, patch the other side. It actually allows you to test the patching before you do the failover, how is that a bad thing.
    I personally have not seen the problems with admin of mscs clusters to the level you are quoting. I don’t have a big problem with slow reboot times as the other side of the cluster is up and avialable, isnt’ that the point.
    As far as rules, I ensure they are sepeperate using anti affinity, and create reservations to ensure performance. If possible I will consider keeping them in a seperate cluster, but only if that make economic sense.

Comments are now closed.