blog.scottlowe.org

The weblog of an IT pro specializing in virtualization, storage, and servers

More on Memory Overcommitment

March 18th, 2008 by slowe

After a brief mention of this topic in Virtualization Short Take #4, the battle between Citrix, Microsoft, and VMware over memory overcommitment has heated up.

The latest round comes from VMware, who provides some real-world statistics on memory overcommitment. In addition, I’ll draw readers’ attention to this comment on VMware’s original article, in which a VMware customer describes the benefits his organization is seeing from memory overcommitment. (BTW, this commenter apparently also started a VMware Communities thread which was, in turn, the basis for this article by Duncan over at Yellow Bricks. My, what a tangled web we weave!)

In any case, VMware’s response uses real data from a customer; only the names have been changed to protect the innocent. In the case study, a 64GB server has been oversubscribed to support VMs requiring 89GB of RAM, and only 20GB of the server’s 64GB is actually in use. So, by reducing the RAM configured in the server, VMware comes up with a way to show that—in this very specific example, at least—it is cheaper to buy VMware than to add RAM to the server. Looks like they called Microsoft’s bluff:

If someone can show me a customer who is running, in production, a VMware VI3 Enterprise system with a 2:1 memory overcommit ratio on all the VMs, where spending the cost of VMware on RAM wouldn’t remove the need to use overcommitment then I’ll give… lets say $270 to their choice of charity.

Apparently, VMware feels they’ve met those criteria:

So, James, the charity of choice is One Laptop Per Child. And just in case you believe that we’ve cherry picked a use case we’ll be more than happy to connect you directly via phone to any one of the numerous customers we have leveraging memory overcommitment in their environment today.

Now things are really getting interesting. Stay tuned!

Site Tags: , , , , ,
Related Site Tags: , , ,

This entry was posted on Tuesday, March 18th, 2008 at 11:01 pm and is filed under Microsoft, Virtualization. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

11 responses about “More on Memory Overcommitment”

  1. William Bishop said:

    That was a seriously stupid thing to bluff on, any shop running vmware, or even their own testers could have easily told him this is not only easily done, it’s almost a daily routine with most places. That wasn’t even that extreme an example. Why the hell don’t they test something before calling someone out? Microsoft should change their name to “Ballmer will embarrass us, or one of our spokesman will stick his foot in his mouth in a big way today”.

    Research first people, then you get to avoid embarrassing moments like the one highlighted.

  2. William Bishop said:

    What are the odds the guy will actually pay up?

  3. Brent O. said:

    Yeah, but try rebooting say 1/4 of those guests at once, and watch what happens to memory use. We had several horrendous experiences during our Patch Nights, evenings when we apply Windows & app patches to the entire datacenter. In the process, we had to reboot most of the machines, and there were plenty of times where we just couldn’t patch as many machines as we wanted because VM couldn’t stabilize the memory fast enough, and couldn’t free up enough to boot the next round of machines.

    I’m not saying I don’t love memory overcommitment - I do, and it does work - but just like CPU overcommitment or disk overcommitment, there’s drawbacks, and you just have to be careful.

  4. RobC said:

    I use vmware ESX to back rack PCs. I.e. all of my vms are user controlled full time production XP workstations instead of physical machines next to the user. I pretty much hate memory overcommitment in that it agressively harvests memory for you even when you don’t need/want it.

    The user experience when you come back to your XP workstation that has been idle for a while is a pretty nasty lag whilst the memory gets reallocated.

    Regardless of how clever it is and how many units they sell off the back of this feature, I sure wish they offered an opt-out option…..

  5. Mike said:

    RobC
    You can easily opt-out of memory overcommmitment. Either on host or VM basis. Just set the memory reservation according to your need.

  6. James O'Neill said:

    @ William,
    I laid down a challenge for *a customer who is running, in production, a VMware VI3 Enterprise system with a 2:1 memory overcommit ratio on all the VMs, where spending the cost of VMware on RAM wouldn’t remove the need to use overcommitment*

    Actually the example that Mike at VMware came back with (http://blogs.vmware.com/virtualreality/2008/03/memory-overcomm.html) was “customer who is getting ready to implement a very large VDI environment”

    So not in production then.

    If, as you say “it’s almost a daily routine with most places” - then pick your charity and put a customer in touch with me.

    The thing is; when you add $5000 worth of RAM to a system you can run a huge number of VMs. So you have to start VMWare of with a vast amount of RAM and a high level of over commit to make the numbers work. Then you need 4 processors so the cost of VMware goes up to $14,000 - which buys a lot of RAM. The example that was quoted used 178VMs. But VMWare say they only support 128 (http://www.vmware.com/pdf/vi3_35/esx_3/r35/vi3_35_25_config_max.pdf)

    Look at the other comments for cautionary notes on using over-commit. People have assumed I was saying Over-commit was always evil. I wasn’t, but the original article was dishonest in suggesting that that every workload could use 2x overcommit.

  7. William Bishop said:

    Yes, it’s not EVERY workload that could use a 2x overcommit(only 99%). But when does ANY “rule” dictate 100% use? The previous posters comment, “but try to boot 1/4 of them at once and see what happens”. Why would I normally boot 1/4 of my guests at once? But to set his mind at ease, I’ve overloaded my vdi guests, and rebooted significant portions of them with no negative results. Am I going to use 2x overcommit on my oracle guests? Not if I don’t have to, but if I need it, it WILL work. Am I going to do it on my sql guests? no, the same answer. Am I going to do it on VDI. You’re damned skippy I will, we build so that we don’t massively(3X) overcommit unless an HA event happens(so there is always room for a large number of HA failures), but I’ve punished hosts to no end to ensure that when and if that happens, it will work with no issues. By default I overcommit at 1.5X on all my desktops hosts. Will I use it on low and moderate demand windows and linux guests? Yep. Same answer. Will I do it in production? I do it every day. Yes, I trust it, yes it works, yes it works well.

    You’re basically selecting a specific set of circumstances, which is to my mind, invalid. All around, you’ll see no problems with a 2x overcommit. It’s not for every instance, nothing in the world is that I’m aware of. There is always some unique circumstances that preclude it(don’t know what they would be-you design so you don’t have that happen)

    Why add that much memory, when you can add another cheap dual proc(4 core) box and 16 gigs of ram and divide your workload? I don’t personally agree with the build up approach, I’d rather have a bladecenter full of 7 grand blades(8 cores), than several 4way boxes with a lot of memory. It’s cheaper, and more functional. It also gets you higher availability and better HA.

    And to help you with whether or not I can speak to the question, we’re not a customer ABOUT to implement a large VDI setup. We implemented well over a year ago, will have around 3000 xp instances by this july, and have over 70 production servers and 30 test servers on ESX. And those include Oracle, SQL, email servers, clinical systems, and about every other type of server you typically see.

    I may have been hesitant in the beginning, but I’m a CONVERT now.

  8. William Bishop said:

    I honestly believe 95% of problems and gotcha’s would not be encountered if people did more R&D in-house. Like Brent said, you have to be careful. Personally, if you do your design with caution in mind, and after having tested all your scenarios, you’ll find those things and plan for them before you build out.

    Test, test, test. I’ll get 3x overcommit if I lose an entire datacenter. It will work as well, and as cleanly as if I had not had a failover. How do I know this? I tested it. Rigorously. I tested to find out exactly how many guests the host would take, then calculated how many I could put on all of the hosts if I lost half(if I were not worried about COMPLETE failover, I would run them at 2x all day every day, without hesitation). I can remain operational for as long as necessary to rebuild the datacenter that fails.

    BTW, you can send that money to OLPC….I am a customer. Unless of course you wish to make the guidelines so strict that I don’t qualify(4 proc instead of dual, stacked boxes instead of blades, per processor licensing instead of ELA or chassis)….If it will make you happy, I’ll go load up a few of the servers so that it stays at 2x instead of 1.5x. And no, you can’t raise the memory enough vs the cost of me adding and hosts. The problem with your argument is that most people who are running overcommit at 2x already have the boxes fairly loaded out with ram, and to do away with overcommit means replacing what they have, which can get prohibitively expensive, often relying on high density modules.

    The build out approach with blades is far superior to the build up approach with 4 or 8 way boxes.

  9. James O'Neill said:

    So. William have you got boxes where you couldn’t get rid of overcommit if you spent the price of VMware on RAM ? Because the prize is still waiting for someone who can say that to get in touch.

    You say “Am I going to use 2x overcommit on my oracle guests? Not if I don’t have to”, but go back and read the post that kicked all this off, and tell me that does imply you can and should do it as a matter of routine. THAT was where I thought the post was dishonest.

    The “Why add that much memory ?” question is simple…
    You buy a base box. You want to run a certain number of VMs on it. Do you spend $5K adding VMware or $5K adding RAM. Of course if it’s a quad box it’s $14,000
    So I have have 4×4Core chips, and it doesn’t seem unreasonable to have 10 Machines per core, 160 Machines, lets say they’re 1GB machines.
    Now if I run 2x overcommit, on VMware I can run in 80GB - but compared with 512MB machines these are going to have fewer sharable pages and performance will get sucky. If I take the $14K VMware would cost I’ll can buy and extra 80GB with money left over. (And being over 128 Virtual CPUs means it’s unsupported on VM ware).
    Split it over two boxes ? OK that’s fine, but now you’re buying two lots of VMware at $5K each…

    For task workers running the same apps and very little else I’d normally go for a terminal services / citrix presentation Server solution rather than the overhead of running (and maintaining) a whole centralized OS per user.

  10. William Bishop said:

    If the original stated that you should run at 2x NO MATTER WHAT, then it is wrong obviously(only in that would mean piss poor planning for failover). I don’t use that number because that would mean 4X(or more) in the event of a failure to an entire datacenter. I do run at 1.5x on desktops, and only about 1.2-1.3x on Oracle and sql. I did say that I could do 2x easily, and had tested it to work. Like I said, if you like I can go load up a few servers and satisfy your requirements. Would I routinely run 2x on those boxes if my environment weren’t mission critical and required to suffer a halving of resources? Yep, even the oracle and sql boxes(what do I care if they run 2x if I never have to worry about some down time? They’re not going to suffer any performance constraints unless it goes higher than that).

    But my environment requires that I do have that amount of failover. If I had a total fail I would be over 4x which is beyond my comfort zone–And I could not guarantee it to work for as long as I might need for some catastrophic event(You’ve failed an entire datacenter…but during the following week, what happens if you lose a few more hosts? You’re now certainly going to have to lose some guests aren’t you?)

    We don’t disable page sharing on guests, so, without a doubt ALL of our guests share to some extent. I consider it wasteful to only assign the memory of the host…you do that and you’re wasting a lot of resources. It’s embarrassing to have a host using only 25% of it’s memory. That’s half the reason to go to virtualization in the first place, to stop waste.

    Again, you’re using a specific set of circumstances to tilt the calculation in your favor. From my licensing and hardware purchase prices, I can buy and license another blade cheaper than doubling the memory on a blade(talk about expensive)…Although it’s closer now that memory has dropped so drastically of late(what would you have done if that question had come up 6 months ago when prices were higher?)

    As to citrix or TS….We looked at those alternatives early on(and others as well), vmware on bladecenters came out on top for a lot of reasons. We are a non profit organization, a strong business case, and thorough research has to be done prior to a selection being made.

    Months of planning and testing culminated in a setup where overcommitment was a part of the design.

  11. Duncan said:

    i guess we can end this discussion because rumors has it that Dell will be giving ESX 3i away for free with their 2950 servers soon. So far for the “expensive” discussion.

Leave a Reply

- Why ask? This confirms you are a human user!