Original Link: https://www.anandtech.com/show/2851



The answer was simple: "You cannot compare a truck with a sports car." The question? "What is point of the hex-core Xeon MP now that we have servers based on the Xeon 5500 series?" The excellent performance of the new Xeon DP platform has put Intel's own quad socket platform in an uneasy spot and the marketing people now have to resort to "fuzzy logic" to combat the feeling that the Xeon MP platform is obsolete since the end of March 2009. Comparing the dual socket Nehalem servers with sports cars and the heavy quad socket "Dunnington" systems with trucks might look like a decent analogy at first, but both the DP and MP platform are made for the same reason: do your processing work. There is nothing that prevents you from using a Xeon DP X5550 server instead of a Xeon 7460 one as a database backend: they can perform the same tasks. However, moving your furniture with a Lamborghini instead of a truck might prove to be quite a challenge.

Does it matter for you, the IT professional? Yes, and the reason is once again… virtualization. Choosing between a dual and quad socket server used to be simple: use the latter for the heavy backend applications, the former for everything else. But do you build your virtual infrastructure on top of quad socket or dual socket machines? The dual socket servers are much cheaper than the quads, you can easily get two and still save some money compared to buying a quad machine. However, two dual machines have four power supplies if you want redundancy, and when running 10 critical applications on a machine redundancy is something you cannot afford to ignore. Most quad socket machines are more redundant and reliable. Since the quad market is less ruled by price/performance and slower to evolve, manufacturers can afford to spend more time and money working on the reliability of their machines. One 2U quad machine will also have more expansion slots than two 1U dual socket machines.

Do you get a few quad socket machines or (slightly less than) twice as many dual socket servers? It is not as clear cut a decision as it used to be. This article will compare the power and performance of the current AMD and Intel quad and dual platforms, giving you some of the information you need to make a well informed decision. Please share your own experiences with the dual and quad socket question, we are eagerly awaiting them.



AMD's dual and quad platform: consistency

AMD's PR is making a lot of noise about consistency, and rightly so. The quad socket and dual socket processors are - besides the obviously different multiprocessor capabilities - exactly the same. In the case of virtualization, this allows you to optimize your virtual machines and hypervisors once and then clone them as much as you like. There are fewer worries when moving virtual machines around, and there is no fiddling with masking processor capabilities. This is also well illustrated when you check what mode the VMware ESX virtual machines run. The table is pretty simple when you look at VMs running on top of an AMD processor: the virtual machines running on dual Opterons will run software virtualization, while the quad-cores will almost always run in the fastest mode (hardware virtualization combined with hardware assisted paging). The same is true for Hyper-V: it won't run on the dual-core Opterons and it will run at full speed on the quad-cores. It is remarkably simple compared to the complete mess Intel made: some of the old Pentium 4 based CPUs support VT-x, some don't. Some of the lower end Xeons launched in 2007 and 2008 don't and so on.

There is some inconsistency on HyperTransport and L3 cache speeds, but those will only cause small performance variations and no software management troubles. Of course, AMD's very consistent dual and quad socket platform is not without flaws either. The NVIDIA MCP55 Pro chipset was at times pretty quirky when installing new virtualization software. Most of the time, a patch took care of that, and the Opteron servers were running rock solid afterwards, but in the meantime a lot of valuable time was wasted. Also, the current platform has not evolved for years and is starting to show its age: we found out that the motherboards consume a bit more power than they should. In 2010, all Opteron server platforms will use AMD chipsets only.

The core part of the new hex-core Opteron is the identical to that of the quad-core, but the "uncore" part has some improvements. With the exception of the 2.8GHz 2387/8387 and 2.9GHz 2389/8389, most quad-core Opterons still connect with 1GHz HyperTransport links. The hex-core Opteron runs with speeds between 2 and 2.4GHz. The hex-core Opteron always connects to the other CPUs in the server via 2.4GHz HyperTransport links. That makes little difference in a 2P server, but performance gets quickly limited by interconnection speeds in 4P. Even at 2.4GHz (9.6GB/s interconnect), probe broadcasting can limit performance, and that is why you can reserve up to 1MB of cache for a snoop filter. These improvements make the hex-core Opteron a more interesting choice than the quad-core Opterons - even at lower clock speeds - for quad socket servers.

In fact, we feel that besides the very low power Opteron 2377 EE, the quad-core Opterons are of little use. If your application scales relatively badly, there is the X55xx series which offers much better "per thread" performance. If your application scales well, two 2.6GHz Opteron 2435 will offer 15% better (and sometimes more) performance than a 2.9GHz Opteron 2389 with the same power consumption. Using relatively "old" technology such as DDR2, the hex-core Opteron based servers are very affordable, especially if you compare them with similar Xeon servers.

The Intel Dual socket platform: pricey performance and performance/watt champion

We have already tested the new dual socket "Nehalem" Xeon platform. It is the platform with the fastest interconnects, the most threads per socket (thanks to Hyper-Threading), the most bandwidth (triple-channel) and the most modern virtualization features (Intel VT-D). Even the top models are far from power hogs: at full load, the X5570 offers an excellent performance/watt ratio. The low-power L5520 at 2.26GHz was a real champion in our performance per watt tests and is available at reasonable prices.

The relatively new platform (chipset, DDR3) is still on the expensive side: a similarly configured Dell R710 (two Xeon 5550 2.66GHz, 8 x 4GB 1066MHz DDR3) costs about one third more than a Dell R805 (Two Opteron 2435, 8 x 4GB 800MHz DDR2): $5047 versus $3838 (pricing at the end of September 2009). If you chose the Xeon platform, you should be aware of the fact that Intel's low end is much less interesting: the best Xeon 55xx CPUs have a clock speed between 2.26 and 2.93GHz. The low end models, the 5504 and 5506 are pretty crippled, with no Hyper-Threading, no Turbo Boost, and only half as much L3 cache (4MB). These crippled CPUs can keep up with the quad-core Opterons at about 2.5GHz, but they are the worst Xeons when you look at idle and full load power. The performance per Watt of the Xeon EE550x is pretty bad compared to the more expensive parts.

The Intel Quad socket platform

There is no quad socket version of Intel's excellent "Xeon Nehalem" platform. We will have to wait until the Nehalem-EX servers ship in the beginning of 2010. At that time, servers with the octal-core 24MB L3 cache CPU will almost certainly end up in a higher price class than the current quad socket servers. One indication is that Intel positions the Nehalem-EX as a RISC market killer. Then again, Intel might as well bring out quad-core versions too. We will have to wait and see.

So there's no Hyper-Threading, Turbo Boost, EPT, NUMA, or fast interconnects for the current Xeon "Dunnington" platform, which is still based on a "multi independent FSB" topology. It has massive amounts of bandwidth in theory (up to 21GB/s), but unfortunately less than 10GB/s is really available. Snooping traffic consumes lots of bandwidth and increases the latency of cache accesses. The 16MB L3 cache should lessen the impact of the relatively slow memory subsystem, but it is only clocked at half the clock speed of the core. A painful 100 cycle latency is the result, but luckily every two cores also have a shared and fast 3MB L3 cache.

When it was first launched, the Xeon MP defeated the AMD alternatives by a good margin in ERP and heavy database loads. It reigned supreme in TPC-C and broke a few new records. More importantly it took back 9% of market share in the quad socket market according to the IDC Worldwide Server Tracker. But at that time, the 2.66GHz hex-core had to compete with a 2.5GHz quad-core Opteron with a paltry 2M of shared L3, and AMD has been working hard on a comeback. The massive Intel chip (503 mm2) has to face a competitor that has three times as much L3 cache and 50% more cores at higher clock speeds, and that is not all: the DDR2-800 DIMMs deliver up to 42GB/s or four times as much bandwidth to the four AMD chips. At the same time, the Xeon behemoth has to outpace the ultra modern Dual Xeon platform by a decent margin to justify its much higher price.



What Intel and AMD Are Offering

Before we can dive into benchmarks, it is good to see how the vendors position their CPUs. Before we do that, here's a quick spec sheet overview of the most important AMD and Intel CPUs.

Processor Speed and Cache Comparison
Model Number
of cores
Clock speed L2 Cache (KB) L3 Cache (MB) Interconnect
Bandwidth
(one direction)
AMD Opteron 8439 SE 6 2.8 6 x 512 KB 6MB 9.6GB/s
Intel Xeon X7460 6 2.66 3 x 3MB 16MB Via FSB & chipset
AMD Opteron 8435 6 2.6 6 x 512 KB 6MB 9.6GB/s
Intel Xeon E7450 6 2.4 3 x 3MB 12MB Via FSB & chipset
AMD Opteron 8431 6 2.4 6 x 512 KB 6MB 9.6GB/s
Intel Xeon E7440 4 2.4 2 x 3MB 16MB Via FSB & chipset
AMD Opteron 8389 4 2.9 4 x 512 KB 6MB 8.8GB/s
Intel Xeon E7430 4 2.13 2 x 3MB 12MB Via FSB & chipset
Intel Xeon E7420 4 2.13 2 x 3MB 8MB Via FSB & chipset

Excluding the low power models, AMD offers three hex-core CPUs and Intel offers two. The gap between the top Xeon models and the midrange is remarkable: the 7440 only has four cores. That means that there is probably - roughly estimated - a gap of 30 to 50% performance between the 7440 and 7450. That gap does not exist in the AMD line-up: the Opteron 8389 has also four cores but clocks 21% higher than the 8431. The performance gap is therefore small. The pricing reflect our remarks:

Pricing
Intel Xeon Model Speed (GHz) /
TDP (W)
Price AMD Opteron Model Speed (GHz) /
ACP (W)
Price
Hex-Core
X7460 2.66/ 130W $2729 8439 SE 2.8 / 105-125W $2649
      8435 2.6 / 75 - 115W $2649
E7450 2.4 / 90W $2301 8431 2.4 /75 - 115W $2149
Quad-Core
E7440 2.4 / 90W $1980 8389 2.9 / 75- 115W $2149
E7430 2.13 / 90W $1391 8387 2.7 / 75 -115W $1865
E7420 2.13 / 90W $1177 8378 2.4 / 75 -115W $873
Dual Sockets
X5570 2.93 / 95W $1386      
X5550 2.66 / 95W $958 2435 2.6 / 75-115W $989

AMD feels that the E7450 is no match for the 8435. As a result, the latter comes with a pretty heavy price tag. Whether this is justified is easy to check, even if we do not test the E7450 in this review. As the E7450 is the same die as the X7460 at a slightly lower voltage and clock speed, the E7450 is about 7 to 8% slower than the X7460. The 2.4GHz 8378 is quite interesting: still clocked at a decent 2.4GHz, it is by far the cheapest quad socket processor. As the number of VMs that you can run on a server is often limited by the amount of memory and not processor power, a quad 8378 might make sense.

The question remains whether the best dual socket processors of Intel or AMD are a threat to the quad socket servers. Two X5570 will set you back less than $2800, while four Xeon E7420 start at $4700. Even a relatively entry-level X7430 2.13GHz based server (32GB, 4 CPUs) will cost in the range of $13000. That is three times as much as similar dual 2435 servers and 2.6 times as much as a dual X5550 machine. That is why we include the fastest dual socket machines in this test too.



Benchmark Methods and Systems

It is impossible to perform the same benchmark methodology as we use with dual socket servers. We limit ourselves to our virtualization benchmarking as we are sure these tests are able to saturate 24 cores. To add some perspective, we add industry standard benchmarks such as SAP and VMware VMmark scores.

Benchmark configuration

None of our benchmarks required more than 20GB RAM. Database files were placed on a three drive RAID0 Intel X25-E SLC 32GB SSD, with log files on one Intel X25-E SLC 32GB. Adding more drives improved performance by only 1%, so we are confident that storage is not our bottleneck.

Xeon Server 1: ASUS RS700-E6/RS4 barebone
Dual Intel Xeon "Gainestown" X5570 2.93GHz
ASUS Z8PS-D12-1U
6x4GB (24GB) ECC Registered DDR3-1333
NIC: Intel 82574L PCI-EGBit LAN
PSU: Delta Electronics DPS-770 AB 770W

Xeon Server 2: Supermicro SC818TQ-1000 Chassis
2x - 4x Intel Xeon X7460 at 2.66GHz
Supermicro X7QCE
64GB (16x4GB) ATP Registered FBDIMM DDR2-667 CL 5 ECC
NIC: Dual Intel PRO/1000 Server NIC
PSU: Supermicro 1000W w/PFC (Model PWS-1K01-1R)

Opteron Server 1 (Quad CPU): Supermicro 818TQ+ 1000
Quad AMD Opteron 8435 at 2.6GHz
Quad AMD Opteron 8389 at 2.9GHz
Supermicro H8QMi-2+
64GB (16x4GB) DDR2-800
NIC: Dual Intel PRO/1000 Server NIC
PSU: Supermicro 1000W w/PFC (Model PWS-1K01-1R)

Opteron Server 2 (Dual CPU): Supermicro A+ Server 1021M-UR+V
Dual Opteron 2435 "Istanbul" 2.6GHz
Dual Opteron 2389 2.9GHz
Supermicro H8DMU+,
32GB (8x4GB) 800MHz
PSU: 650W Cold Watt HE Power Solutions CWA2-0650-10-SM01-1

vApus/DVD Store/Oracle Calling Circle Client Configuration
Intel Core 2 Quad Q6600 2.4GHz
Foxconn P35AX-S
4GB (2x2GB) Kingston DDR2-667
NIC: Intel PRO/1000



Decision Support Benchmark: Nieuws.be

Decision Support benchmark Nieuws.be
Operating System Windows 2008 Enterprise RTM (64 bit)
Software SQL Server 2008 Enterprise x64 (64 bit)
Benchmark software vApus + real-world "Nieuws.be" Database
Database Size > 100GB
Typical error margin 1-2%

The Nieuws.be site is sitting on top of a pretty large database - more than 100GB and growing. This database consists of a few hundred separate tables, which have been carefully optimized by our lab (the Sizing Servers Lab). We have described our testing methods previously in more detail. As some of readers suggested we upgraded from SQL Server 2005 SP3 to SQL Server 2008.

 

Nieuws.be MS SQL Server 2008

 

In our "hex-core Opteron" review, we noticed excellent scaling from the quad-core Opteron "Shanghai" to a hex-core Opteron "Istanbul". 50% more cores resulted in a 40% performance increase. Even with 24 cores, the scaling remains outstanding: as we add another 12 "Istanbul" cores (a 100% increase), we get 65% more queries per second at a response time of 1000 ms. The Quad Xeon X7460 has a small but noticeable lead over the powerful dual Xeon X5570. The Quad Opteron 8435 outperforms the latter by 42%, which is quite impressive. The Microsoft SQL server team also deserves a pat on the back: few "native" applications - even databases - scale well to 24 cores.



SAP S&D 2-Tier

SAP S&D 2-Tier
Operating System Windows 2008 Enterprise Edition
Software SAP ERP 6.0 Enhancement package 4
Benchmark software Industry Standard benchmark version 2009
Typical error margin Very low

The SAP SD (sales and distribution, 2-tier internet configuration) benchmark is an interesting benchmark as it is a real world client-server application. We decided to take a look at SAP's benchmark database. The results below all run on Windows 2003 Enterprise Edition and MS SQL Server 2005 database (both 64-bit). Every "2-tier Sales & Distribution" benchmark was performed with SAP's latest ERP 6 enhancement package 4. These results are NOT comparable with any benchmark performed before 2009. The new "2009" version of the benchmark obtains scores which are 25% lower. We analyzed the SAP Benchmark in-depth in one of our previous server oriented articles. The profile of the benchmark has remained the same:

 

  • Very parallel resulting in excellent scaling
  • Low to medium IPC, mostly due to "branchy" code
  • Somewhat limited by memory bandwidth
  • Likes large caches (memory latency!)
  • Very sensitive to sync ("cache coherency") latency

 

And here are the results:

 

SAP Sales & Distribution 2 Tier benchmark

 

Some of you may have already made this analysis: the one year old Quad Xeon platform is outperformed by servers which are three times cheaper. The best dual Xeon makes the Quad Xeon look ridiculous as it outruns the latter by 15%. The Quad Opteron 2389 2.9 is getting a beating too, but his big brother, the Opteron 8435, takes revenge by running circles around the Intel hex-core: it is no less than 50% faster!

While performance is not the only factor to consider, the least you expect from a quad platform is that it offers somewhat better performance than a cheaper dual socket. This is exactly what we have been discussing in our "General IT" blog: the hex-core Opteron may tip the balance back in favor of a quad socket platform for a part of the server market. We are not impressed by the 30% performance advantage of a 24-core over an 8-core, and those looking for the highest raw performance will probably be disappointed. But for a large part of the market, performance is only one of the factors, and the 30% extra may well be good enough to convince people to consider a quad socket platform. Other factors like more memory and expansion slots, slightly better RAS capabilities and less power for the same number of applications might make a quad socket server a better choice for those people.



The Number One Reason for Quad Socket

VMmark - which we discussed in great detail here - tries to measure typical consolidation workloads: a combination of a light mail server, database, fileserver, and website with a somewhat heavier java application. One VM is just sitting idle, representative of workloads that have to be online but which perform very little work (for example, a domain controller). In short, VMmark goes for the scenario where you want to consolidate lots and lots of smaller apps on one physical server.

 

VMware VMmark

 

The VMmark scores of the Xeon X5570 make some of the quad socket platforms look silly - once again. The 16 "Shanghai" cores are 13% slower and the 24 "Dunnington" cores are 15% slower than the eight cores with SMT of the X5570. While raw processing power and the excellent optimizations for Hyper-Threading are the main reasons why the X5570 is superior, we suspect it is not the only reason. We'll discuss this in more detail later in this article. Luckily for AMD, the quad Opteron 8435 stays out of the reach of Intel's best server platform.



vApus Mark I: Performance-Critical applications virtualized

You might remember from our previous article that the vApus Mark I, our in-house developed virtualization benchmark, is designed to measure the performance of "heavy" performance-critical applications. Virtualization vendors are very actively promoting that you should virtualize these OLTP and heavy websites too, so that you can let the virtualization software dynamically manage them. In other words, if you want high-availability, load balancing, and low power (by shutting down servers which are not used), everything should be virtualized.

That is where vApus Mark I comes in: one OLAP, one DSS, and two heavy websites are combined in one tile. These are the kind of demanding applications that still received their own dedicated and natively running machine a year ago. vApus Mark I shows what will happen if you virtualize them. If you want to fully understand our benchmark methodology, vApus Mark I has been described in great detail here. We enabled large pages as it is generally considered a best practice with AMD's RVI and Intel's EPT.

vApus Mark I uses four VMs with four server applications:

 

  • An SQL Server 2008 x64 database running on Windows 2008 64-bit, stress tested by our in-house developed vApus test.
  • Two heavy duty MCS eFMS portals running PHP, IIS on Windows 2003 R2, stress tested by our in-house developed vApus test.
  • One OLTP database, based on Oracle 10G Calling Circle benchmark of Dominic Giles.

 

The beauty is that vApus (stress testing software developed by the Sizing Servers Lab) uses actions made by real people (as can be seen in logs) to stress test the VMs, not some benchmarking algorithm.

To make things more interesting, we enabled and disabled HT-assist on the quad Opteron 8435 platform. HT-assist (described here in detail) steals 1MB from the L3 cache, reducing the size of the L3 cache to 5MB. The 1MB of cache is used as a very fast directory which eliminates a lot of snoop traffic. Eliminating snoop traffic reduces the "bandwidth pressure" on the CPU interconnects (hence the name HyperTransport Assist), but more importantly it reduces the latency of a cache request.

 

vAPUS Mark I 2 tile test - ESX 4.0

 

Thanks to HT Assist, the 24 Opteron cores communicate and perform about 9% faster. That is not huge, but it widens the gap with the dual Xeon somewhat. The dual Xeon X5570 keeps up with the much more expensive quad socket Intel server: eight cores are just as fast as 24.

Two tiles, 4 VMs and 4 vCPUs per VM: a total of 32 vCPUs are active in the previous test. 32 vCPUs are harder to schedule on a hex-core CPU, and especially on 24 cores in total. So let us see what happens if we reduce the total amount of vCPUs to 24 vCPUs.


8 VMs, 2 tiles of vApus Mark I, 24 vCPUs

We reduced the number of vCPUs on the web portal VMs from 4 to 2. That means that we have:

 

  • Two times 4 vCPUs for the OLAP test
  • Two times 4 vCPUs for the OLTP test
  • Two times 2 vCPUs for the web test

 

That makes a total of 24 vCPUs. The 32 vCPU test is somewhat biased towards the quad-core CPUs such as the Xeon X5570 while the test below favors the hex-cores.

 

vAPUS Mark I 2 tile test - 24 vCPUs - ESX 4.0

 

The "Dunnington" platform beats the 16 thread, 8 core Nehalem server but it is nothing to write home about: the 24 core machine outperforms Intel's latest dual socket by 6%. The advantage of the Opteron 8435 compared to the Xeon X7460 shrinks from 28 to 21%, but that is still a tangible performance advantage. Our understanding of virtualization performance is growing. Take a look at the table below.

Virtualization Testing Results
Server System Comparison vApus Mark I -
24 vCPUs
vApus Mark I -
32 vCPUs
VMmark
Quad Xeon X7460 vs.
Dual Xeon X5570 2.93
6% -2% -15%
Quad Opteron 8435 vs.
Dual Xeon X5570 2.93
29% 26% 21%
Quad Opteron 8435 vs.
Quad X7460
21% 28% 42%
Dual Xeon X5570 2.93 vs.
Dual Opteron 2435
11% 30% 54%

Notice how the VMmark benchmark absolutely prefers the new "Nehalem" platform: the Dual Xeon X5570 is 54% faster, while it is only 11-30% on vApus Mark I. The quad Opteron 8435 is up to 30% faster than Intel's speed demon, while VMmark indicates only a 21% lead. But notice that vApus Mark I is also more friendly towards the Intel hex-core: VMmark tell us that eight Nehalems are 15% faster than 24 Dunnington cores. vApus Mark I tells us that the quad X7460 is about as fast as the dual Xeon X5570. So why is VMmark so much happier on the Xeon X5570 server? The answer might be found in the table below.

One VMmark tile generates about 21,000 interrupt per second, 22 MB/s of Storage I/O and 55 Mbit/s of network traffic. We have profiled vAPUS Mark in depth before. The table below compares both benchmarks from a Hypervisor point of view.

Virtualization Benchmarks Profiling
  vApus Mark I
(Dual Xeon X5570)
VMmark
(Dual Xeon X5570)
Total interrupts per second 2 x 19 K = 38 K/s 17 * 21 = 357 K/s
Storage 2 x 4.1MB/s = 8.2MB/s 17*22 = 374 MB/s
Network 2 x 50M bit/s = 100Mbit/s 17* 55 MB/s = 935 Mbit/s

VMmark places a lot more stress on the hypervisor and the way it handles I/O. It produces about 10 times more interrupts and almost a 100 times more storage I/O. We know from our profiling that vApus Mark I does a lot of page management, which is a logical result of the application choice (databases that open and close connections) and the amount of memory per VM.

The result is that VMmark with its huge number of VMs per server (up to 102 VMs!) places a lot of stress on the I/O systems. The reason for the Intel Xeon X5570's crushing VMmark results cannot be explained by the processor architecture alone. One possible explanation may be that the VMDq (multiple queues and offloading of the virtual switch to the hardware) implementation of the Intel NICs is better than the Broadcom NICs that are typically found in the AMD based servers.



Power Consumption

We have already done fairly comprehensive power consumption comparisons; note however the numbers of the dual Xeon X5570 are higher than in that article. The reason is that we use the 1U ASUS RS700-E6 and not one of the nodes in the ASUS twin server. The latter server has been sent back to ASUS, and the 1U model is of course a better choice if you want to compare with the other servers in this review, which are all single node servers. We notice that our ASUS RS700-E6 consumes about 12W more than the node of the ASUS twin, which has a smaller motherboard with slightly less features.

As the Dual Xeon X5570 had only 24GB, we determined how much one DIMM adds to the idle and full load power, and we added that power to the numbers measured. So all dual socket servers have 32GB of RAM, while the quad socket servers have 64GB of RAM. The reason is that we like to know what will happen if we replace two dual socket servers by one quad socket.

 

 Power consumption idle

 

A software bug in ESX is the reason why most servers - except Intel's - are not able to use EIST when running ESX 4.0 (VMware vSphere 4). This bug will be solved in ESX 4.0 SP1, which will be out at the end of this year. ASUS did not want to wait and circumvented the problem by adapting their BIOS. Thumbs up to ASUS for addressing this problem at such short notice. All the servers were able to use power saving features on ESX, as the AMD platform does not suffer from the ESX bug. To our surprise, enabling EIST did not decrease the power significantly: we went from 165W to 162W. So far as we know, there are no tools that can read out CPU clock and voltage data, so we have no way of verifying what is going on. Enabling AMD's PowerNow! causes a drop of about 10%. A possible explanation is that EIST does work, but that the PCU (Power Control Unit) of the Xeon x55xx is already shutting down so many parts of the die that the CPU sips very little power in idle and there is little room for any improvements. Our measurements were confirmed when we measured with and without EIST on Intel's low power optimized "Willowbrook/Chenbro" server.

The problem that the quad socket servers have is immediately apparent: they consume close to twice as much as the dual socket platform. That is rather bad news for the quad socket servers: it means that buying one large server instead of two dual does not result in any tangible power saving. One of the reasons is that for example our dual hex-core Opterons work together with very efficient 650W power supply, while our quad socket platform is using a relatively heavy 1200W PSU. Quad socket platforms still have a lot of room for improvement when it comes to power efficiency. What is worse is that the performance/watt of the dual servers is clearly better. Let us check the power numbers are full load.

 

 Power consumption Full Load

 

Again, the quad machines do not really convince us that they may save us a lot of power, especially from the performance/watt point of view. If you are not performance limited but memory limited, the quad machines might still make some sense.



Conclusion

When the Xeon X7460 "Dunnington" was launched in September 2008, our first impression was that the 503mm² chip was a brute force approach to crush AMD out of it last stronghold, the quad socket server. In hindsight, the primary reason why this server CPU impressed was the poor execution of the AMD "Barcelona" chip. Still stuck at 2.3GHz and backed by a very meager 2MB L3 cache, the AMD server platform was performing well below its true capabilities. The advantage that AMD still held was that the NUMA fast interconnect platform was capable of much more, and it was just a matter of improving the CPUs. Intel is far beyond the limits of the multiple FSB platform and needs to roll out a completely new server platform, a "QuickPath" quad socket platform. AMD has already improved their Quad socket CPU two times in one year, while Intel's updated quad platform will not be available before the beginning of 2010.

The end result is that servers based on a quad hex-core Opteron are about 20% to 50% faster, and at the same time consume 20% less than Intel hex-core. The E7450 has a slightly better performance/watt ratio, but simple mathematics show that no matter which hex-core Xeon you chose, it is going to look bad for the Intel six-core. The X7460 and its brothers are toast. The Intel quad platform will not be attractive until the Nehalem EX arrives.

Until then, we have a landslide victory for the AMD quad Opteron platform, if only the pesky dual Xeon X5570 wouldn't spoil the party. Servers based on the X55xx series are the most expensive of the dual socket market, but still cost about half (or even less than half) as much  than  the quad hex-core Opterons based servers. The memory slot advantage is also shrinking: an X55xx based server can realistically use 18 x 4GB or 72GB (maximum: 144GB). A quad Opteron based server typically has 32 slots and can house up to 128GB of RAM if you use affordable 4 GB DIMMs (maximum: 256GB).

Before you go for quad sockets, make sure your application scales beyond 16 cores. Most of the applications don't, and we picked only those applications (large databases, ERP and virtualization) which typically scale well, and which are the target applications for quad socket servers.

So who wins? Intel's dual socket, AMD's dual socket, or AMD's quad socket platform? The answer is that it depends on your performance/RAM ratio. The more performance you require per GB, the more interesting the dual Nehalem platform gets. The more RAM you need to obtain a certain level of performance, the more interesting the AMD quad platform gets.

 

For example, a small intensively used database will probably sway you towards the dual Xeon X55xx server, as it is quite a bit cheaper to acquire and the performance/watt and performance/$ ratio are better. A very large database or virtualization consolidation scenario requiring more than 72GB of RAM will probably push you towards the quad Istanbul - once you need more than 64-72GB, memory gets really expensive on the Intel dual socket platform. There are two reasons for this: 8GB DIMMs are five times more expensive than 4GB DIMMs, and DDR3 is still more costly than DDR2 (especially in large DIMMs).

So there you have it: the latest quad socket Opteron hex-core scales and performs so well that it beats the "natural" enemy, the Xeon X7460, by a large margin especially from a performance/watt point of view. At the same time, it has to sweat very hard to shake off the dual socket Intel Xeon in quite a few applications. Servers with 24 of those fast cores can only really justify their higher price by offering more and ironically cheaper memory. Choosing between a dual socket and quad socket server is mostly a matter of knowing the memory footprint of the applications you will run on it… and your own personal vision on the datacenter.

I would like to thank my colleague Tijl Deneut for his assistance.

Log in

Don't have an account? Sign up now