Continuing the Consolidation Discussion

This is just a quick post inspired by Mike Laverick’s recent “Stupid IT” post, in which he weighed in on the blog discussion between Steve Chambers and I regarding “putting all your eggs in one basket,” the most common argument against high consolidation ratios and, in some cases, against consolidation in general.

Mike’s articles—part 1 and part 2—are excellent articles. The interesting thing here is that, when you really boil it down, my viewpoint is not that far off from both Steve and Mike (spoiler warning: Mike agrees with Steve). In my blog post, I tried to focus less on whether high consolidation ratios are good or bad but instead to focus on whether the high consolidation ratios—and the impact of the design decision to use high consolidation ratios—will satisfy the needs of the business.

I agree with a number of points from Steve’s post. For example, I agree the root cause of an outage is more likely to be human error than hardware outage. I also agree that building redundancy into the infrastructure helps further reduce the possibility of an outage. Mike makes the same argument:

The truth is that hardware and software components are so reliable and redundant they hardly ever fail. In fact, so much availability software is geared towards protecting the server from hardware failure that some of my peers are beginning to question why they even buy SKUs that contain VMware HA.

So if everything is so redundant and so stable, why do people buy VMware HA? Why do people use clustering solutions like Windows Failover Clustering? Why do people use VMware FT or Neverfail or any of the rest of it?

The answer is simple: fear. Businesses are afraid of their applications being unavailable. In some cases, this fear is irrational, and from this perspective I agree wholeheartedly with both Mike and Steve: don’t use the “all my eggs in one basket” argument with me just because it scares you, just because you’re afraid of running all your workloads together.

On the other hand, though, this fear might be justified. What if the application or applications in question are the very lifeblood of the business? If you are an online-only organization, the need for your web site to stay up and accessible is crucial. If the web site is down, lots of money gets lost. In this situation, the fear of being unavailable is justified. It’s not irrational—it’s based on a keen understanding of the needs of the business and the impact of the outage upon the business. And in those cases, where suppressing consolidation ratios is used deliberately in order to satisfy the needs of the business, I’ll accept the “all my eggs in one basket” argument.

Come to me with the “all my eggs in one basket” argument backed by irrational fear and a lack of information, and I’ll argue against it every time. Come to me with the “all my eggs in one basket” argument backed by an understanding of how IT aligns with the business and the impact of an outage on the business, and I’ll listen to—and possibly even agree with—your position. As I and so many others have stated on numerous occasions, don’t pursue high consolidation ratios for the sake of high consolidation ratios. Pursue them because it makes the most sense for the business.

In the end, I guess my point is that both Steve and Mike have missed the point. Not that their viewpoints are irrelevant; quite the opposite! Both of them make very good points that are quite relevant and pertinent to the discussion of “Why not higher consolidation ratios?” Unfortunately, that’s not the question that needs to be asked or answered. The question should be, “What is best for the business?” In that context, putting “all your eggs in one basket” isn’t always the best answer.

Courteous comments welcome!

Tags: , , , ,

8 comments

  1. Steve Chambers’s avatar

    It’s always good to leave a point out (completely different from missing it!) to see if someone can “fill it in”. The ITIL guys failed miserably, whereas you’ve done much better!

    There’s another factor that seems to be hard for enterprises to swallow and that’s “consolidating workloads over time”, or put another way, reducing the size of the trough (of no workload) by over-selling infrastructure capacity a-la-Cloud to get an even higher ROI by – AGHAST! – squashing even more eggs into the basket over a period of time.

    Extending a Hadoop cluster into “spare capacity”, or any other app depending on how good you are, is a way to add more eggs.

    There are other facets to this argument that we’ve left out (not missed!) that include the criticality of eggs, mixing different eggs together, overselling eggs… it’s a big, interesting topic that I am sad enough to enjoy…

    I think this would make a great debate but, sadly for us three (Me, Mike, Scott), I think we all really agree… we need someone who just doesn’t get it at all… any takers? :-)

  2. Anon’s avatar

    I used the comparison between recovering from a physical server and tapes to recovering using vmware and VMs as a big driver to move to VMware along with the fact that most of the hardware costs are wasted to applications that mostly sit idle vs the fear of running on a shared env. As with HA I’d a physical server does die, my VMs are restarted very quickly on another host. In addition using DRS to move VMs away from VM’s that are using a very high CPU percentage which also solves some of the ‘all eggs in a basket’ shared env. However, the one central pain point that I think we all must put in our designs to explore is solving is the one central storage (San) a single point of failure even if we have an active/active design. There is typically software in those sans that may be one bug or software version away from causing lots of problems. In bigger enterprises you may have more frames to work with but in smaller businesses and even medium size businesses that can be difficult to achieve. As we are bulding the internal cloud … We need to build capacity that scales both up and out and in a way we don’t need to think about what is happening on a host by host basis. Or by a San by San basis.. They both need to scale to their capacity and then tell us to add more capacity. As with virtalization in general it’s all about abstracting at different levels but when there are core, deep in the guts issues that need solving it will quickly bubble up and slow the adoption process. We seem to be solving some of those with each major version of vSphere but more work is needed on the core software to make it an easy proposition to just add capacity and not think about the details on a host by host basis. With storage seems to still be a major paint point for all.

  3. PiroNet’s avatar

    Whilst Steve and Mike are arguing on the issue from their own point of views and quite opposite ones, you’re much like playing the referee, counting the points, but you also raised a very fundamental point: Fear as in FUD.

    “What is best for the business?” as in “What is best for the company CFO?” or as in “What is best for the company CIO?” cause none of them understand what is consolidation ratio. They just know FUD that why CFO says “Does it fit my budget?”, and CIO says “Does it fit my KPIs”. The “What is best for the business?” stops here.

    What remains? You, me, us, the virtualization architects/designers who will enlighten up the managers cited above. The real question is: what’s your acceptable consolidation ratio? 2:1, 10:1, 50:1?

    No offense Scott, just sharing thoughts on a hot subject :)

    Cheers,
    Didier

  4. slowe’s avatar

    Steve,

    You’re right—there are so many aspects of this discussion that we haven’t even touched yet! And you’re right again that Mike, you, and I agree more than we disagree; that was a key point that I was trying to make here.

    Anon,

    (Disclosure: I work for EMC.) EMC’s storage federation vision is trying to help address that problem of “just adding capacity” from the storage perspective. Pay close attention to announcements coming out of EMC World 2010 (May 10-13, 2010) for more details.

    Didier,

    No offense taken! You are correct that each person’s viewpoint of the business needs are different and distinct, but that’s an organizational challenge that no technology can resolve. To paraphrase Steve Chambers from his post that started all this, if you were bad at IT before virtualization you’ll still worse at IT after virtualization. Thanks for your comment!

  5. Nate’s avatar

    There has been a lot of discussion through these posts about redundancy, but I feel there is another concern out there in IT folks minds about “eggs in one basket”. Security. We all hear that the hypervisor is secure from the vendors, but we also all know attackers are not stupid. As more workloads are moved to virtualized platforms that hypervisor will become a prime target. One of the things that “keeps me up at night” is the thought of which workloads could/should be so intermingled with each other based on security profile. Redundancy actually scares me far less. There are redundancy options for the virtualization platforms and just because you are using virtualization does not mean you cannot still use the same application level redundnacy you would have used on physical boxes (something else I think is too foten ignored in these types of conversations).

  6. slowe’s avatar

    Nate,

    You have a good point—security is yet another factor that can and should be included when considering your consolidation ratio. Again, design decisions like VM placement, security zones, and consolidation ratios need to be driven by the business. Thanks for your comment!

  7. Nick’s avatar

    We regularly deploy VMCO Appliances with 256 to 512GB of RAM and 100+ live VMs, some of our smaller clients run pretty much their whole estate off of 3 or 4 physical systems of this type, with appropriate redundancy in network, storage, and capacity at the VMware, Xen, and Solaris Zones layer.

    Is this fear of “too many eggs in 1 basket” a x86-64 admin’s affliction? As an old-time Solaris guy the thought of multiple workloads, even entire multiple business divisions running on one chunk of hardware doesn’t worry me. There are probably some guys at IBM who feel the same way. Unless of course you are talking about running on whatever $3000 special offer is going at Dell this month, now that does give me nightmares :-)

    The Hypervisor itself is going to come under huge scrutiny, and rightly so. My hope is that more of its functionality and role in partitioning a system can be subsumed by the CPU and subject to formal methods and proof of correctness. Leaving “the software guys” to get on with providing functionality. My worry is this sounds rather like the old network guy vs application guy argument of 20 years ago. “How was I to know IP addresses could be spoofed, I thought it was YOUR job to deliver me only the packets that belong to ME!?!”.

  8. adam baum’s avatar

    I think I am going to have fall into Scott’s camp on this one. Yes, equipment is better these days and yes, if you have bad practices, virtualization isn’t going to fix them. However, I have had equipment just fail. Just because equipment is better does not mean equipment is perfect. Just because you can do something doesn’t mean you should.

    I will also posit that just because you take all the HA, good design, etc factors into account, there are always going to be external influences which may drag you down. Anyone remember the vm timeout fiasco from a few yrs ago?

    Items from my job description include: maximize efficiencies, maximize savings/value, maximize up-time, and minimize exposure/risk. It’s a balancing act that changes on a weekly basis as new workloads come online. And as part of it, I do have some fear of having “just enough” equipment to handle all my workloads so I have designed my environment with spare capacity. I’ve also separated out various workloads onto different esx clusters based on some internally developed criteria to minimize the exposure related to an outage.

    This is one topic where the end result is going to be “we agree to disagree”. There is no end-all-be-all for everyone.

    Just my two cents.

    adam

Comments are now closed.