The View from the Other Side

Steve Chambers at ViewYonder recently published an article countering the common “all my eggs in one basket” argument that is so frequently used as an argument against virtualization. Apparently, a comment in my Virtualization Short Take #37 post led him to believe that I am not, in his words, “fully on board yet”. So, I thought it might be helpful to discuss the view from the other side—from the perspective of those who aren’t “fully on board yet”.

Here’s the quote that apparently caught Steve’s eye:

The ages-old discussion of scale up vs. scale out is revisited again in this blog post. I guess the key takeaway for me is the reminder that while VMware HA does restart workloads automatically, there’s still an outage. If you’re running 50 VMs on a host, you’re still going to have an outage across as many as 50 different applications within your organization. That’s not a trivial event. I think a lot of people gloss over that detail. VMware HA helps, but it’s not the ultimate solution to downtime that people sometimes portray it as.

Steve’s article makes some good points; I particularly agree with his assertion that organizations that were bad at IT before virtualization are going to be really bad after virtualization. However, the underlying sentiment of the article seems to be one of “Don’t worry, just put all your applications in and everything will be fine”. Sorry, I don’t agree there. Feel free to put all your eggs in one basket if you prefer, but make sure you understand the value of those eggs and are willing to accept the risks that result.

Let me make something clear: I’m not advocating against high consolidation ratios. What I’m advocating against is a blind race for higher and higher consolidation ratios simply because you can. Steve’s article seems to push for higher consolidation ratios simply for the sake of higher consolidation ratios. I’ll use a phrase here that I’ve used with my kids many times: “Just because you can doesn’t mean you should.”

Just because you can run 100 workloads on a single server doesn’t mean you should run 100 workloads on a single server. Yes, VMware HA will help you mitigate the overall effect of a hardware outage should one occur, but it doesn’t change the fact that an outage will occur. Don’t rely upon VMware HA as the “be all end all” solution, because it isn’t. Is it good? Yes, it is. Is it good enough for your business when you have very high consolidation ratios? That depends upon your organization and your applications.

I’ve always favored a measured, deliberate approach to virtualization that takes into consideration all the various aspects of the workloads that you are virtualizing. In my view, it is critical to understand aspects about applications other than disk I/O requirements, CPU utilization, or memory usage. You need to understand aspects like the relationships between applications (which applications are dependent upon other applications?) and business dependencies upon applications (what parts of my business are affected if this application is unavailable?). Understanding the cost of an application outage is another aspect that must be considered in any virtualization approach. In my opinion, the cost of an application outage is the greatest measurement of impact.

How much will it cost your business if an application is down for 2 minutes? How much will it cost your business if 50 applications are down for 2 minutes? If these aren’t mission-critical applications, then fine; throw them all onto a set of servers and rely on VMware HA to restart them automatically. But do so recognizing that in the event of a hardware or hypervisor failure, there will be an outage. Make sure you understand the impact and the scope of that outage.

If, on the other hand, the cost of having these applications unavailable is too high to allow VMware HA to be the sole mechanism whereby you protect yourself against an outage, then it’s time to start employing other mechanisms. Other mechanisms would include VMware Fault Tolerance (FT), OS-level clustering, N+1 redundancy (for those applications that support it; many applications don’t), or—here’s the kicker—deliberately suppressing the overall consolidation ratio in order to limit the scope of an outage. That’s right: you might choose to use a lower consolidation ratio for a certain application or group of applications in order to limit the scope of an outage.

This brings us full circle, back to the “scale up vs. scale out” argument that started this whole discussion. Just because an organization can buy a full-width UCS blade with 384GB of RAM and run 100 workloads on it doesn’t mean they should. A perfectly valid decision would be to deploy a larger number of smaller servers in a “scale out” fashion so as to limit the scope of an outage. That would go hand-in-hand with the use of other mechanisms like spreading VMware vSphere clusters across multiple chassis to protect against the failure of a single chassis, using VMware FT to protect applications whose downtime cost is too great to rely upon VMware HA alone, or using DRS affinity rules to force workloads onto separate hosts so that the scope of a failure is further constrained. All of these decisions are perfectly valid ways of limiting the scope (and thus the impact) of an outage.

Taken in this context and in this view, you can now see that consolidation ratio is simply another design decision that is made based on both technical and business reasons, and it becomes an aspect of the design that you control, instead of it controlling you. Consolidation ratio becomes just another tool to use in protecting your organization against the impact of a potential outage. Don’t shoot for the highest possible consolidation ratio; aim for the right consolidation ratio that meets your organization’s needs.

Courteous comments are always welcome!

Tags: , , ,

25 comments

  1. Simon Gallagher’s avatar

    Some very good points in there, people always seem to forget that making the applications themselves redundant/tolerant to faults (where possible) is probably more important than the VMware HA part – HA/FT are very useful but they are not a magic availability bullet for the guest apps/OS

    In my experience guest OS outages are more likely to happen than a hypervisor fail or total host hardware failure – BSOD, or reboot for monthly MS patching etc.

    VMware doesn’t/can’t fix this for you, although HA can reduce the RTO by automating the restart on standby hardware – as Scott points out there is still an outage – and the more VMs re-starting in parallel, the longer the overall delay, and then there’s startup dependencies for multi-tier apps (app/db/etc.)

    In a high consolidation ratio environment you may need to spend more effort designing around this sort of outage (planned or otherwise) at the software level which is going to mean n+1 guest OS instances for important apps, which impacts your overall consolidation ratio as you have more VMs to start with.

  2. Duncan’s avatar

    A posted a similar article a couple of weeks back, which was actually the article that triggered the one you are referring to. http://www.yellow-bricks.com/2010/03/17/scale-up/

    I agree, one should consider the risks and impact of those before making a decision. Consolidation ratio should not be your goal per se. Uptime, flexibility, transportability, manageability etc are more important in my opinion.

  3. Mat’s avatar

    Certainly making a decision on the amount of consolidation is a design decision, but it seems to me it’s a “feel good decision” not a technology decision.
    If downtime on an application is acceptable then I’m not sure I understand why having simultaneous downtime on multiple applications is any worse than sporadic downtime on an application one at a time. In fact if there is dependency between applications better to have them all fall down together and restart together than have one application fail with some set of ripple effects on the still working applications.
    It doesn’t feel good when 50 applications fall over all at once, but I’m not sure why that is actually worse than the alternative.

  4. Stu’s avatar

    I’m 100% “on board” with you Scott! It is with exactly this in mind that will drive me (and I’m sure many others) to dual and perhaps even single socket boxes as CPU’s become more and more powerful.

    Until x86 hardware and software has mainframe-like redundancy (ever seen a CPU fail on a SPARC box? It just keeps running as if nothing happened), the risk of too many workloads on a single piece of tin should be at the forefront of consolidation considerations IMHO.

    Stu

  5. Steve Chambers’s avatar

    I thought “just because you can doesn’t mean you should” was my quote? :-) So what’s the formula, then? Gotta be one. Higher density is what sells vBlocks and all that expensive storage… ;-)

  6. Josh Townsend’s avatar

    You’ve hit on it exactly, Scott. There is no magic formula, no perfect consolidation ratio. We try so hard to fit everything into a certain framework, a turnkey package, or a nice little blog post (or even a 140 character tweet). It is human nature to try to filter things out to black and white, but the best technologists have some comfort in the gray areas that make each environment unique. Applying the right hardware, software, availability techniques is a good start, but the quality solution has to consider to the bottom-line, as well as the people, processes and business culture your solution is being deployed to.

  7. Dave

    Scott, agree 100%, my issue is that so many companies can’t see the big picture. VMware’s licensing pushes them towards higher consolidation ratios. What’s my cost per application for 50 apps on a host? for 100 apps on a host? The cost per application goes down dramatically. I understand the cost of an outage has to be considered, but most customers are more concerned about the cost of running the app, not what it costs when the app is not running. Especially in this economy.

  8. Dave Convery’s avatar

    No, “just because you can doesn’t mean you should” is MY line… <{:o)

    I have to say that I liked Stevie’s post, Duncan’s post AND Scott’s post. Different views on the same subject that all come to similar conclusions. If you’re going to do something, make sure you PLAN for every aspect. The consequences of high consolidation ratios, HA, availability, etc. ALL need to be considered in any design.

  9. daVikes’s avatar

    Very good things to keep in mind. Consolidation ratios are great when you can, but often other pressures are at foot to try and get the most possible out of the hardware and technology. However, these are great points to bring back to the discussion and to talk about so that you are aware of the risks and the business is ok with the risks when considering consolidation ratios. Another thing to think about is also how long if the same event happened in a physical world, and how long it would take you to recover from hardware issues. So 2 minutes compared to what you may have to do in a physical world may be why a lot of people are adopting virtualization along with DR along with utilizing the processing power better.

  10. Marc Farley (3PARFarley)’s avatar

    Steve was right about MTBC (THAT was funny) being much more prevalent than MTBF in high end gear. Even smart people make idiotic mistakes. Limiting the scope of human error is always a good idea. Fewer servers to manage is a good idea. If higher VM densities do that, it has to be part of the formula. YSMV (your stupidity my vary)

  11. Chris’s avatar

    Well laid out article. Should really be published in the CIO trade journals though.

  12. Rod’s avatar

    For years customers have run higher consolidation ratios of non-mission-critical workloads however when it comes to mission-critical have opted for lower ratios because of the business impact of an outage. This can show up as bigger servers in non-production environments and smaller-servers (often blades) in production.

    It seems odd at first glance but if you ever see this you know the customer has actually thought about the business impact of losing a single server (and you know it WILL happen someday).

  13. Robert Miletich’s avatar

    I have a number of clients with very high consolidation density ratios and have had them for a few years now. UCS can support 384GB RAM in a single image, but NEC A1160 or A1080 support up to a TB of memory, and multiples of 4 sockets as well. So do some Unisys servers. So do some IBM. Etc.

    People don’t know what HA means, so I have to explain it as the very first step. In that process I explain the risks of hosting a large number of VMs very clearly as well as laying out a regimen that can reduce their hardware and VM failures and make sure they understand what’s going on there. You’d be surprised (maybe not) of how many times this causes the light to go on.

    The profile of companies that buy these large (and more expensive) servers is that they have a good understanding of datacenter production and operation procedures. They generally have a good understanding of ITIL, change management, operations performance monitoring, etc and are less prone to act out with departmental practices.

    BTW, these large machines are really more intended for large application requirements (headroom for large memory VM servers) but as someone mentioned above, VMware socket based licensing makes it a very attractive proposition to get as high a density as possible.

  14. John Martin’s avatar

    Really interesting discussion, one of the things I said in the past is that if you’re going to put all of your eggs in one basket, that you want to make sure its a titanium clad basket with a pneumatic lining, though I was talking mostly of the eggs at the storage layer. If we need to be wary of the impliecation of high consolidation densities at the server layer, what are the implicatoins for even higher levels of consolidation at the storage layer ?

  15. John Martin’s avatar

    THought I’d expand on this with a blog post of my own “The Risks of Storage Consolidation” at http://www.storagewithoutborders.com

  16. slowe’s avatar

    Great comments everyone! A few thoughts…

    Duncan – I can’t believe I missed that article! I think I need to double-check my RSS feeds…I seem to be missing a number of your articles.

    Steve Chambers – No, higher VM densities don’t sell Vblocks. Simplified deployment, simplified operations, single support, and best of breed components sell Vblocks. As for the “magic formula,” I don’t think there is one. There are too many variables to consider.

    Marc – Too bad there isn’t a way to protect against user stupidity. Then again, it goes back to that saying about the next time you create something that’s foolproof, along comes a better fool.

    Robert – I’m glad to see someone else talking about large memory VMs! I’ve been saying for a while that Cisco and others were taking the wrong approach by talking about ever-higher consolidation ratios, and should instead be focusing on convincing customers of the value of moving larger workloads onto their platform.

    John – Interesting comment about your concerns. What additional steps would you propose need to be taken?

  17. Paul Sorgiovanni’s avatar

    There is away to mitigate user stupidity.

    I firmly believe that as Automation develops further the risks will decrease. We see major issues in our customer base more around long term management then initial configuration. Designs can be perfect but long term execution is generally poor. This is why we tend to lead with products like SanScreen which diagnose deviations from the initial best practice installations.

    We also look as part of an initial sell that we include 3 monthly Health Checks of the Virtualised environment that includes Network, Storage, Server and Virtualisation OS.

    This is something that i believe integrators do well and vendors execute poorly. The tools are there but vendors dont make a living by selling services

  18. Kelly Culwell’s avatar

    These are all great bits of info. It justifies the importance of setting RTOs and sev/pri ratings for applications, not just servers.

  19. Maish’s avatar

    Great article Scott and great comments from all you – but just by the way the quote is mine ;)
    “But just because you can does it mean that you should.”
    http://technodrone.blogspot.com/2010/01/travelling-down-5th-ave-at-180-kmh.html

  20. Joe Onisick’s avatar

    Scott, great articles by you and Steve and sadly it only comes back to the one truth in IT, ‘it depends.’ I’ve scaled customers up well into the double digits of virtual machines per server and others in the safe 8-11 VMs per server range. It all depends on their comfort, aptitude and most importantly applications.

    I’d confidently put 300 virtual virtual servers on a single blade in the right environment, with the right hardware, and the right staff that understood the environment. I’d also confidently implement 8:1 virtualization scenarios or even less. The fact of the matter is it always depends.

  21. Shahar Frank’s avatar

    Scott, it is true that “But just because you can does it mean that you should.” and it is always required to operate the gray matter in our heads before doing things such as hosting 100 workloads on a single physical host, but on the other hand, virtualization can improve the workload resiliency (and I am not referring to VMware HA):
    Before the server virtualization era, only about 30% of server malfunctions were hardware oriented. Most errors were operational errors (“fat fingers”) and software errors. Virtualization helps to reduce operation errors by making many procedures much more control and/or automatic. The physical host may be much more fault tolerant than the server used for a single workload – just because it is more economical and practical to protect few strong servers than protecting a lot of servers. This means you can use better hardware and with better (hw) redundancy. Furthermore, virtualiztion has a potential to improve the application endurance in ways that are impossible in non virtualized servers: for example say you have several physical hosts server using FC storage. Say you have an FC problem in one of your server that causes storage errors. A smart virtualization system may understand that and migrate the relevant VMs to different servers that share the same storage and by that avoid the storage errors. In this case the virtualization system can *solve* the issue.
    Just to clarify things, I don’t claim people should put 100 workloads on a single server just because they can, and yes x86 architecture should be much improved (Some “red books” can be a good source for some directions…), and the virtualization systems should be improved much further, but what I do claim is that in *some* aspects Virtualization can improve application/workload reliability.

  22. virtuel’s avatar

    Scott this is great. I think it definitely brings all of this into perspective. We have to remember that we have to architect solutions that are in line with the business and so yes it depends…but the business is the driver and not IT.

    A quick question for you though. would you recommend products then like never fail etc or other in that same family? the reason I ask is we’re moving into a new Data Center, with a lofty goal of 100% virtualization but reality is that there are SLAs that need to be met, expectations that for downtime etc. Have you or anyone seen implementations of any of these 3rd party stuff?
    Duncan are there slated to be improvements to FT ‘cos right now it really is not ready for some serious multi-proc production business critical etc etc stuff.

    thanks guys

  23. Hari’s avatar

    Does anyone have any info on average post-virtualization/consolidation utilization numbers? In pre-virtualization days, utilization numbers ranged from 10-20% for physical servers (x86 based). VIrtualization was supposed to improve this situation – does anyone know what it is these days? any gartner studies on this?

    THanks!

    Hari

  24. pranav’s avatar

    Scot really great article .. to know about the other side of it. As I am being new to virtualization , reading quite a few articles on “other side of this article” , this article is very interesting to me. I really agree about your view about application outage because it is really the fact which will impact the performance. Both the technical and business reasons decides the design of the application and its consolidation ratio.

  25. slowe’s avatar

    Pranav, thanks for your comment. Obviously, I heartily agree that there are multiple factors that contribute to every design decision, including consolidation ratios.

Comments are now closed.