By Aaron Delp
Twitter: aarondelp
FriendFeed (Delicious, Twitter, & all my blogs in one spot): aarondelp
I wanted to relay some information regarding choosing memory speeds and types for the new Intel Xeon 5500 (Nehalem family) processors. As stated in my previous article on the Nehalem CPUs, there are some decisions that need to be made when choosing the memory and processor combinations. Let’s start off with what the memory architecture looks like.
- The current Xeon 5500 family is a two-socket configuration.
- Memory will run at 1333 MHz, 1066 MHz, and 800 MHz.
- Memory is currently produced in single, dual, and quad rank configurations. Dual rank is faster than single rank, quad rank is currently limited to 1066 MHz speed.
- Each CPU socket has 3 memory channels for a total of 6 channels per server.
- Each channel can accept up to 3 DIMMS. This is why the servers currently are made with either 12 sockets (2 DIMMS per channel x 3 channels per processor x 2 processor sockets) or 18 sockets (3 DIMMS per channel x 3 Channels per processor x 2 processor sockets).
- Some servers come in a 16 DIMM arrangement. Please see this IBM Paper for more information.
- The maximum memory speed is limited by processor. For example, the X5570 has a max memory speed of 1333 MHz, the E5540 has a max memory speed of 1066 MHz, etc.
- As more memory is added to a channel, the memory will slow down.
- Better performance is achieved when the memory is “balanced” (the total amount of memory across channels is the same).
Take a look at the Hp Quick Specs for the BL460 G6 server in the Memory section. I found this to be a great source.
So, what does all of that mean? It means that for best performance you should install the memory using the following guidelines:
- Ideally, install DIMMs in sets of 6, 1 per channel (populate both sockets with CPUs!). Use DIMMs that are dual rank and have the fastest speed you can purchase that the processor supports.
- Populate the first slot in all channels first, then populate the 2nd slots in all channels, etc. Don’t put all three DIMMs in one channel and leave other channels empty.
- Balance the amount of memory in each channel whenever possible (3 x 4GB on two channels and 1 x 4GB 1 X 8GB on the last channel).
- If at all possible, try to keep the system away from the 800MHz memory speed.
Here is link to an awesome IBM white paper explaining everything.
Here’s an example 12 DIMM slot Nehalem configuration:
|
Speed |
Max Mem Speed |
Bank 1 in Channel Populated |
Bank 2 in Channel Populated |
|
X5570 (2.93 GHZ) |
1333 MHz |
1333 MHz |
1066 MHz * |
|
X5560 (2.80 GHZ) |
1333 MHz |
1333 MHz |
1066 MHz * |
|
X5550 (2.66 GHZ) |
1333 MHz |
1333 MHz |
1066 MHz * |
|
E5540 (2.53 GHZ) |
1066 MHz |
1066 MHz |
1066 MHz |
|
E5530 (2.40 GHZ) |
1066 MHz |
1066 MHz |
1066 MHz |
|
E5520 (2.26 GHZ) |
1066 MHz |
1066 MHz |
1066 MHz |
|
E5506 (2.13 GHZ) |
800 MHz |
800 MHz |
800 MHz |
|
E5504 (2.00 GHZ) |
800 MHz |
800 MHz |
800 MHz |
|
E5502 (1.66 GHZ) |
800 MHz |
800 MHz |
800 MHz |
Here’s an example 18 DIMM slot Nehalem configuration:
|
Speed |
Max Mem Speed |
Bank 1 in Channel Populated |
Bank 2 in Channel Populated |
Bank 3 in Channel Populated |
|
X5570 (2.93 GHZ) |
1333 MHz |
1333 MHz |
1066 MHz * |
800 MHz |
|
X5560 (2.80 GHZ) |
1333 MHz |
1333 MHz |
1066 MHz * |
800 MHz |
|
X5550 (2.66 GHZ) |
1333 MHz |
1333 MHz |
1066 MHz * |
800 MHz |
|
E5540 (2.53 GHZ) |
1066 MHz |
1066 MHz |
1066 MHz |
800 MHz |
|
E5530 (2.40 GHZ) |
1066 MHz |
1066 MHz |
1066 MHz |
800 MHz |
|
E5520 (2.26 GHZ) |
1066 MHz |
1066 MHz |
1066 MHz |
800 MHz |
|
E5506 (2.13 GHZ) |
800 MHz |
800 MHz |
800 MHz |
800 MHz |
|
E5504 (2.00 GHZ) |
800 MHz |
800 MHz |
800 MHz |
800 MHz |
|
E5502 (1.66 GHZ) |
800 MHz |
800 MHz |
800 MHz |
800 MHz |
* According to the HP Quick Spec for the BL460 G6, they are able to keep the speed at 1333 MHz with 2 DIMMS. A BIOS update is required to achieve this. This is HP specific.
Common Questions:
Q: What kind of performance decrease will I see by lowering the clock speed of my memory? For example using 6x2GB DIMMs (running at 1333 MHz) vs 12 x 1 GB DIMMs (running at 1066 MHz) to save a little money.
A: According to the IBM white paper listed above, we have two main areas of performance to worry about, latency and throughput. The latency difference between 1333 MHz and 800 MHz is about 10%. Memory throughput is another story though. The different between 1333 MHz and 1066 MHz is about 9%. The difference from 1066 MHz to 800 MHz is 28%!
Q: What kind of performance increase will I see in a “balanced” (same amount of memory per channel) system?
A: Again, according to the IBM paper, you will see a performance increase if the system is balanced. An exact number isn’t given.
Q: Which is fastest? Single, dual, or quad rank DIMMS?
A: According to the IBM White Paper, dual rank outperforms single rank by 7% in Specjbb2005. Quad rank DIMMs decrease the clock speed to 1066 MHz so they are not faster at this time.
Q: What if I only populate one processor?
A: You want to populate both sockets if performance is a concern. Adding the second processor not only makes the second set of DIMM sockets available, it also doubles the memory bandwidth.
-
Nice write up Aaron.
This is a perfect example why the half-height Cisco UCS blade is a 12 DIMM configuration rather than 18 DIMM, because it keeps the system speed at 1066 MHz or higher for any configuration of high performance DIMMs. The performance penalty of adding the 3rd DIMM per channel doesn’t seem to make a lot of sense in these new higher performance systems.
Cheers,
Brad -
@Brad – this is only true until you need that extra memory that those extra 6 DIMMs will give you. Once you start swapping out to disk the performance hit from that will far exceed the 800Mhz bus speeds of having 18 DIMMs and if you don’t need the extra DIMMs or want to leave yourself some expansion room just populate 2 per channel and presto you are back at 1066 or even 1333Mhz with the BL490c.
-
@Tony – if your system is swapping to disk so much that knocking the memory bus down to 800Mhz actually results in long term overall system improvement then it is fair to say you have over provisioned that system and have a larger design problem at hand. Such a design challenge would be a good fit for the Cisco UCS B-250 series blade with 48 DIMM slots at 1066Mhz.
Sure, in any system there can be temporary spikes resulting in occasional swapping, but why affect the permanent system performance with an 800Mhz configuration? It doesn’t make much sense to me.
Better to react to such situations with good design philosophy, rather than applying band aids that actually lower the bar.
Cheers,
Brad -
@Brad I think I have to agree with Tony, having the ability to drop in more ram without upgrading was his main point down the road. I don’t think he is saying buy a system with 18 slots and fill it up from the start. It makes sense to buy equipment with the capacity to expand down the road. It does not make sense to buy equipment which is maxed out when you first turn it on, in that case I do agree with you in saying there is a larger design problem at hand.
Brad
-
@Brad – I think you missed the point. I wasn’t implying Tony wanted to fill up all 18 DIMMs day one. If Tony is running a system with 12 DIMMs @1066Mhz (or as it sounds in the HP case, 1333MHz), and he decides that his system needs more memory, he is faced with the consequence of lowering permanent memory bandwidth by 28% (or in the HP case, 40%). When does a 28-40% decrease in permanent memory bandwidth make sense? That’s a tough decision to make. You would probably only do that if the majority of operations are being swapped to disk. In this case it’s fair to say this a memory bound application. And in such a memory intensive application environment, upgrading your system from 96GB to 144GB will likely not be a long lasting solution – not to mention the significant performance tax paid.
In this scenario, I agree with you that having 18 slots vs 12 slots provides an option for a temporary short term solution otherwise not available. Point taken. However it is exactly that, a short term solution with consequences that must be evaluated.
In this case, your larger design problem for this memory bound application is having a system that is limited to 18 DIMM slots. The good new is, better choices now exist
Cheers,
Brad -
@Brad Hedlund – I’m trying to address the question of is running memory at 800Mhz actually bad. If your application is solely bound by memory bandwidth, not memory capacity, memory latency, disk, network, CPU or anything else then the answer is yes, but in that case 1066=bad as well. So go with a system that can run 12 DIMMS at 1333Mhz.
If memory bandwidth is not your issue then running memory at a lower speed will not affect your performance and can actually decrease your power consumption. This is true even on a 6 or 12 DIMM system and you can take advantage of this with BIOS options that allow you to select memory speed.
More than likely there will be multiple factors affecting the performance of any system, including budget. We’d all love unlimited budgets but not in today’s environment. If I have an app, for example, that needs 4GB per core today and will grow to 8GB per core that’s hard to do cost effectively in a 12 DIMM system. I can use 4GB DIMMs in an 18 DIMM system and get the amount of memory I need, and having the right amount of memory per core will provide optimum performance. It might indeed be the case that I am memory bandwidth constrained at that point, but I could equally well be CPU, disk or network bound or I could be hitting some kind of application or programming limitation. So the simplistic assumption that having 800MHz memory=bad is incorrect.
You could of course buy a bigger blade and use up more bays in the enclosure to get at an even larger memory footprint, but if your application needs 4-8GB per core installing or having the ability to install 16GB per core won’t do you any good. You’ve increased your costs as the enclosure and infrastructure gets amortized across all the blades, and to take advantage of that extra memory support you need more cores. So unless you’re going to a 4 socket blade, in this environment the larger memory footprint serves no purpose no matter what bus speed it runs at.
The trick is to balance the system, cores, memory size, memory bandwidth, memory latency, disk performance, network I/O etc for your application within your budget and to understand that you can never eliminate all bottlenecks they just move around the system.
-
@Tony Harvey – You are spot on that budgets are tight and every dollar will be scrutinized, and rightfully so. Not only will the cost of the compute resources (cpu & memory) be driven down, the larger architecture of support infrastructure propping up the compute environment will be targeted for cost reductions as well (switches, cables, adapters, management, licensing costs, etc.)
As for reducing costs of compute resources — Your statement that 4 sockets are needed for a larger memory footprint is simply not true anymore. This year buyers will be able to purchase blades with 48 DIMM slots in a 2 socket system running at 1066MHz fully populated.
Given your example of an application requirement that grows from 4GB to 8GB per core, this can done with 32 x 2GB DIMMs @ 1066 MHz for a much lower cost and higher performance than 16 x 4GB DIMMs @ 800MHz. Compute costs are further reduced in per socket licensing costs, for example a 2 socket system has much lower Oracle or VMWare licensing costs than a 4 socket system.
As for reducing the costs of support infrastructure — The time has come to start carefully scrutinizing the total cost of any components surrounding the compute resources, including the server adapters, fibre channel and ethernet switches, management modules, and management software licensing costs.
While lowering your system performance to 800MHz will save some power, there is much more power and costs to be saved in the support infrastructure, such as lowering power and cost consumed by switches and adapters with a unified fabric. You can even lower power consumed by fans with an efficient airflow design through the enclosure. How much power is consumed by the management modules in each enclosure, and are those really necessary anymore?
Before reducing the processing power of the compute resources (in the name of saving power), it makes more sense to me to optimize every watt in the surrounding support infrastructure before I punish what really matters – the compute resources executing the business applications.
Thanks for the great discussion.
Cheers,
Brad -
@Brad Hedlund – To deal with points in no particular order. I was not claiming that to get bigger memory footprints you have to go to 4-Sockets I was pointing out that with an application that needs 8GB per core having the ability to go to 16GB per core on a 2-Socket 8-Core system is of no benefit. You need more cores to take advantage of the extra memory.
Your assumption on the cost of 4GB R-DIMMs vs 2GB R-DIMMs is no longer correct. I just checked on CDW http://tinyurl.com/ppearj and a 4GB R-DIMM is $219 (or $190 for an 800Mhz only DIMM) a 2GB R-DIMM $116 so it’s actually cheaper to use 4GB DIMMs and you will use a lot less power. A 4GB DIMM only uses about a third more power than a 2GB DIMM reducing overall power consumption significantly.
The system infrastructure power is something that needs to looked at and worked on and is, but, the overwhelming amount of power taken by a system is in the CPU and Memory sub-systems. A 2 socket blade with 80W CPUs and with 32 2GB DIMMs uses 224W just to run the CPU and memory.
The CPU/memory complex makes up the biggest power consumer in an enclosure, or any other system for that matter. It’s also where you can get the biggest power savings by carefully choosing components to exactly match your needs and not overbuying.
The surrounding infrastructure power consumption and cost should be looked at carefully. Reduced airflow and decreased fan power is something blades have been working on for some time now. One of the key things is that the infrastructure cost and power consumption of the infrastructure needs to be amortized over as many compute resources as possible. Reducing the number of server nodes, ultimately reducing the number of compute cores, increases those fixed costs significantly.
I agree on the opportunities for network convergence, that is why today I can create a solution for consolidating multiple 1Gbit networks onto a single 10Gbit link within the enclosure. This reduces the switch requirements in enclosure and by working inside the enclosure it works with any network vendors core switch solution. As CEE moves to an industry standard then FCoE becomes an option for further infrastructure consolidation.
Tony
-
Scott, you say a ROM update is required to support 1333Mhz with 2 DIMMS per channel in HP systems. It only requires a ROM setting to allow this capability(unique to HP).
Brad, According to my reading it may be a little premature to claim that a 48 DIMM system can support 1066Mhz . According to http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=217600100 it has “not yet determined whether it will run the memory bus at the full 1,066 MHz data rate or at a lower 800 MHz rate”. Add to this the increased latency an additional memory controller will add, and it’s easy to understand why we haven’t seen any benchmarks from 48 DIMM systems.
Why is Cisco running their 12 DIMM system at 1066Mhz when HP can run theirs at 1333Mhz?
-
Here’s a whitepaper that includes lots of performance & power measurements on DDR3 memory configs with the Xeon 5500 processor:
http://h20195.www2.hp.com/v2/GetPDF.aspx/c01750914.pdfProbably the most useful are the charts that compare the bandwidth, latency, and power of the various ways you could populate 24GB.
…Daniel
-
Just an FYI. In planning for our next cluster, we used your post to question why Dell had listed 12×8 banks @ 1333 MHz (obviously this fills the second bank and should run at 1066MHz). We questioned our rep and he mentioned a BIOS update as of 3wks (1.4?) that allows the second filled banks to run at 1333MHz. Something about updating the controller chip….he was vague.
Thanks for writing this! very very helpful!!
-
First of all; great article! Really sums up the important facts regarding nehalem and the memory architecture.
Second; I’m having a hard time figuring out the following: I’m in the market to buy a new server and I’m currently deciding between a motherboard with 6 memory banks or 12 memory banks. The amount of memory is not important to me (24GB should be more then enough), but speed is. Will 6 x 4GB QUAD RANK (@1066MHZ) have the same performance as 12 x 2GB DUAL RANK (@1066MHZ)? With only one QUAD RANK K3-kit per cpu you will not experience downclocking to 800MHz, so that’s why I’m wondering.
-
You annotated this in your blog about the memory speed:
* According to the HP Quick Spec for the BL460 G6, they are able to keep the speed at 1333 MHz with 2 DIMMS. A BIOS update is required to achieve this. This is HP specific.
Actually, this is also Dell specific. Dell also has a BIOS update that will allow the memory to stay at 1333MHz in both channels.
-
Do you know if Cisco’s UCS platform has also implemented the 1333 MHz spec for the RAM?
-
As you add memory, the achievable bandwidth goes down.
CSCO UCS overcomes this by using an ASIC-on-motherboard to allow memory to run at 1066 MHz at full memory-loading (using the “Catalina ASIC” in the UCS M1 models it seems), while for the UCS M2 ones it supposedly supports full speed 1333 MHz at full memory loading).
After the demise of MetaRAM (conceding to Netlist), there are not that many alternatives available apart from Netlist’s HyperCloud memory (which has a patented ASIC-on-memory-module approach) to give 1333 MHz at full memory at 384GB for dual-socket system).
Question is, are HP and others planning to use NLST HyperCloud to counteract CSCO UCS ?
Currently NLST HyperCloud is undergoing qualification at HP and many other OEMs.
You can follow some of the conversation on this topic on the NLST yahoo board:
http://messages.finance.yahoo.com/Stocks_%28A_to_Z%29/Stocks_N/threadview?m=te&bn=51443&tid=23494&mid=23510&tof=1&frt=2#23510
Re: Cisco’s UCS -
quote:
vicl2010v2, are you an employee or partner of NLST? Full disclosure is expected here.I am a Netlist shareholder. The link above is for the yahoo NLST board where there can be some good discussion occasionally.
CSCO UCS is important (for NLST shareholders) because it uses an ASIC-on-motherboard approach to improve the memory-loading/memory-bandwidth tradeoff problem.
NLST is able to put ASIC/buffer chip on the memory module itself, with HyperCloud memory modules that are plug and play and require no BIOS updates.
NLST HyperCloud promised:
- ability to load higher memory
- runs at full 1333 MHz
- lower power (does something with turning off power for ranks not in use ?)
- uses “lower dollar per bit” memory chips to emulate “higher dollar per bit” memory chips which improves economics for NLST (in a way similar to CSCO use of large number of memory slots)NLST has demoed HyperCloud at exhibitions using DELL, HP and IBM machines.
On HP machines:
http://www.prnewswire.com/news-releases/netlist-demonstrates-new-hypercloud-memory-modules-at-supercomputing-09-70174702.html
Netlist Demonstrates New HyperCloud Memory Modules at Supercomputing 09On IBM machines:
http://www.prnewswire.com/news-releases/netlist-introduces-low-voltage-hypercloud-industrys-first-135v-virtual-rank-memory-module-92165154.html
Netlist Introduces Low Voltage HyperCloud, Industry’s First 1.35V Virtual Rank Memory ModuleAccording to company, they had a dozen OEMs qualifying HyperCloud, which has been upped to “3 dozen” companies now.
Won some qualifications recently (SuperMicro, Viglen), with testimonials:
http://finance.yahoo.com/news/Supermicro-Qualifies-Netlists-prnews-2550444882.html?x=0&.v=1
Supermicro Qualifies Netlist’s HyperCloud Memory on High-Density Servershttp://finance.yahoo.com/news/Viglen-Selects-Netlists-prnews-2806446361.html?x=0&.v=1
Viglen Selects Netlist’s HyperCloud Memory for HPC ApplicationsAnother area of concern for shareholders is litigation.
However there have been some successes recently with wins over:
- MetaRAM (Fred Weber’s outfit) where they conceded related IP to NLST (the rest was sold to GOOG)
- Texas Instruments (which was accused of leaking info obtained under NDA from NLST to JEDEC) – settled recently (reportedly favorably for NLST)
The most visible case is GOOG vs. NLST where judge has ruled in favor of NLST claims construction. After TXN settlement, NLST has asked court that following testimony obtained from JEDEC lawyer and GOOG employee Rob Sprinkle they have all the “facts” in place and matter can goto summary judgement (i.e. no need for jury trial which is primarily used to ascertain “facts”). Discovery is complete and judge has been urging parties to settle. So summary judgement request is likely to be approved, thereby moving timeline from Nov 2010 to a bit earlier (summary judgement hearing Sept 14).
GOOG vs. NLST is interesting because it was started by GOOG to protect itself against potential injunction against it’s servers, but judge forced GOOG to turn over it’s server to NLST lawyers where it was found to be using JEDEC “Mode C” proposed standard. JEDEC had earlier issued letters to members and GOOG that “Mode C” potentially infringed NLST IP.
It was in response to this that NLST vs. GOOG was started.
It turns out that GOOG had it’s own hardware division that was contracting for the buffer chips and GOOG was essentially manufacturing it’s own memory modules. It is unclear just how prevalent use of this type of memory is at GOOG.
GOOG will settle eventually, however such settlement will probably include future business with NLST.
Other litigation is against Inphi (which is seeking to manufacture buffer chips based on JEDEC “Mode C” proposed standard). Unlike MetaRAM which held significant IP, Inphi holds little IP in this area – they have been using a Sanmina-SCI patent application and other patents to provoke interference at USPTO for patent reexamination (which are routinely accepted by USPTO).
However Sanmina-SCI patent application has gotten rejected again.
Inphi litigation however is still very early (not even started discovery) so they will mature much after GOOG litigation and related JEDEC stuff is settled and done.
Anyway, so this is the short intro to NLST
-
More recent version of IBM White Paper:
Understanding Intel Xeon 5600 Series Memory Performance and Optimization in IBM System x and BladeCenter Platforms
How to get the best performance out of your new System x or BladeCenter Server
May 2010 -
I wonder if any folks here are starting to hear mention of Netlist’s HyperCloud memory esp. for high-memory loading virtualization work (where HyperCloud allows full speed use).
It is now listed on VMWare website as one of the memory partners (other one is Kingston) – click Hardware – System Boards.




25 comments
Comments feed for this article
Trackback link: http://blog.scottlowe.org/2009/05/11/introduction-to-nehalem-memory/trackback/