The upcoming Intel Nehalem CPU has been in the spotlight for months now. In contrast and despite the huge die size and 1.9 billion (!) transistors, the 6-core Xeon 74xx is a wallflower for both the public as Intel's marketing. However, if you've invested in the current Intel platform, the newly launched Intel 74xx series deserves a lot more attention.

The Xeon 74xx, formerly known as Dunnington, is indeed a very interesting upgrade path for the older quad socket platform. All Xeon 74xx use the same mPGA604 socket as previous Xeons and are electrically compatible with the Xeon 73xx series. The Xeon 73xx , also known as Tigerton, was basically the quad-core version of the Xeon 53xx (Clovertown) that launched at the end 2006. The new hex-core Dunnington combines six of the latest 45nm Xeon Penryn cores on a single die. As you may remember from our dual socket 45nm Xeon 54xx review, the 45nm Penryn core is about 10% to 20% faster than its older 65nm brother (Merom). There is more: an enormous 12MB to 16MB L3 cache ensures that those six cores access high latency main memory a lot less. This huge L3 also reduces the amount of "cache syncing" traffic between the CPUs, an important bottleneck for the current Intel server platforms.


2.66GHz, 6 cores, 3x3MB L2, and 16MB L3 cache: a massive new Intel CPU

With at least 10% to 20% better performance per core, two extra cores per CPU package, and an upgrade that only requires a BIOS update, the newest Xeon 7460 should be an attractive proposal if you are short on processing power.

Six Cores?

Dunnington was announced at the past IDFs as "extending the MP leadership". Readers who read our last quad socket report understand that this is a questionable claim. Since AMD introduced the Opteron 8xxx in April 2003, there has never been a moment that Intel was able to lead the dance in the quad socket server market. Sure, the Intel 73xx was able to outperform the AMD chip in some areas (rendering), but the AMD quad-core was still able to keep up with Intel chip in Java, ERP, and database performance. When it comes to HPC, the AMD chip was clearly in the lead.

Dunnington might not be the darling of Intel marketing, but the chip itself is a very aggressive statement: let us "Bulldoze" AMD out of the quad socket market with a truly gigantic chip that only Intel can produce without losing money. Intel is probably - courtesy of the impressive ultra low leakage 45nm high-K process technology - the only one capable of producing large quantities of CPUs containing 1.9 billion transistors, resulting in an enormous die size of 503 mm2. That is almost twice the size of AMD's upcoming 45nm quad-core CPU Shanghai. Even IBM's flagship POWER6 processor (up to 4.7GHz) is only 341 mm2 and only has 790 million transistors.

Processor Size and Technology Comparison
CPU transistors count (million) Process Die Size Cores
Intel Dunnington 1900 45 nm 503 6
Intel Nehalem 731 45 nm 265 4
AMD Shanghai 705 45 nm 263 4
AMD Barcelona 463 65 nm 283 4
Intel Tigerton 2 x 291 = 582 65 nm 2 x 143 = 286 4
Intel Harpertown 2 x 410 = 820 45 nm 2 x 107 = 214 4

The huge, somewhat irregular die - notice how the two cores in the top right corner are further away from the L3 cache than the other four - raises some questions. Such an irregular die could introduce extra wire delays, reducing the clock speed somewhat. Why did Intel not choose to go for an 8-core design? The basic explanation that Patrick Gelsinger, General Manager of Intel's Digital Enterprise Group, gave was that simulations showed that a 6-core with 16MB L3 outperformed 8-core with a smaller L3 in the applications that matter the most in the 4S/8S socket space.


Layout of the new hex-core

TDP was probably the most important constraint that determined the choice of six cores, since core logic consumes a lot more power than cache. An 8-core design would make it necessary to reduce the clock speed too much. Even at 65nm, Intel was already capable of producing caches that needed less than 1W/MB, so we can assume that the 16MB cache consumes around 16W or less. That leaves more than 100W for the six cores, which allows decent clock speeds at very acceptable TDPs as you can see in the table below.

Processor Speed and Cache Comparison
Xeon model Speed (GHz) Cores L2 Cache (MB) L3 Cache (MB) TDP (W)
X7460 2.66 6 3x3 16 130
E7450 2.4 6 3x3 12 90
X7350 2.93 4 2x4 0 130
E7440 2.4 4 2x3 12 90
E7340 2.4 4 2x4 0 80
E7330 2.4 4 2x4 0 80
E7430 2.13 4 2x3 12 90
E7420 2.13 4 2x3 8 90
L7455 2.13 6 3x3 12 65
L7445 2.13 4 2x3 12 50

The other side of the coin is that Dunnington probably uses an L3 cache that runs at half the clock speed of the cores. We recorded a 103 cycle latency, measured with a 2.66GHz CPU (39 ns), for the L3 cache.


Dunnington cache hierarchy

In comparison, the - admittedly much smaller - L3 cache of the quad-core Opteron needs 48 cycles (using a 2.5GHz chip, or 19 ns). The L3 cache is about half as fast as the one found in the Barcelona core, so the L3 is a compromise where the engineers traded in speed for size and power consumption.

Price Comparisons
POST A COMMENT

34 Comments

View All Comments

  • duploxxx - Thursday, November 13, 2008 - link

    your virtualisation life was very short, perhaps marketing can keep you alive for a while since on paper you are better with the amount of cores.

    your 24 cores @2,66ghz are just killed by 16cores @2,7ghz

    http://www.vmware.com/products/vmmark/results.html">http://www.vmware.com/products/vmmark/results.html
    Reply
  • synergyek - Wednesday, October 15, 2008 - link

    Why only testing scanline render? It's a slow and old monster. Can you add mental ray render to your tests or, maybe, vray, which is used in arch. visualizations? Also you can use Maya 32/64-bit (software, hardware, mental ray tests) for both windows and linux platforms. Mental ray on Vray uses all cores available in the system, and results must be much better, than ordinary scanline. Reply
  • duploxxx - Saturday, September 27, 2008 - link

    Nice article, altough in virtualisation with VMmark it was already clear that the new dunnington had more headroom with the additional cores.

    only few remark, since you are talking about a retail price of +25000euro you could at least add for information that there are 8socket barcelona for about 5000euro more that scale again way better then dunnington with its 32 cores. So indeed intel did a step up again after there tigertown was heavy beaten by new barcelona in 4s even in low speed but at a certain cost of platform, afterall this dunnington is not cheap. it will be the question what a 4s shangai @3.0 ghz will do against this 6 core giant, afterall it is a huge die and the shangai will be way cheaper and consume less.

    lets hope you update this nice article with the soon to be released shangai.
    Reply
  • Sirlach - Friday, September 26, 2008 - link

    From my research when the hex cores were announced the super micro boards came with an x16 slot. Is it possible to see how CPU restricted multithreaded games perform on this monster? Since it is running server 2008 this is theoretically possible!
    Reply
  • BaronMatrix - Thursday, September 25, 2008 - link

    It seems like a better comparison would be with the number of cores the same. You could take a 4S and remove one chip and match that against a 2S Dunnington.

    From what I saw, it is nowhere near 50% faster though it has 50% more cores plus 4 times the cache. It looks like Intel may NEVER catch up with Opteron. Shanghai will just increase the difference.

    It's just a shame Hector decided to have a "devalue the brand name" fire-sale or we'd be much closer to Bulldozer and SSE5.
    Reply
  • trivik12 - Thursday, September 25, 2008 - link

    4S has been one market where AMD dominated even after conroe's release. With Tigerton intel chipped away AMD's market share bcos of barcelona issues. with Dunnington Intel has a performance advantage. U dont look at per core performance but overall platform performance. AMD needs to catch up soon bcos with beckton AMD will be behind 8th ball in that market as well. Reply
  • snakeoil - Wednesday, September 24, 2008 - link

    intel is cannibalizing nehalem this are desperate measures from a desperate man.
    this is a dead end road, sooner or later intel will have to dump the front side bus,but its evident that intel is not very confident about nehalem and quick path.
    these processor are the last kick of an agonizing technology.
    this is just a souped up old car. nothing more.
    Reply
  • kingmouf - Wednesday, September 24, 2008 - link

    Although a good thing for testing, I'm wondering if by making artificially the VMs more processing intensive rather than memory intensive, one is getting a quite wrong idea of the power consumption between the Intel and the AMD systems.

    Off-chip activity (coming from signal amplifiers, sensors, external buses etc) results in great power consumption. Actually one should expect it to be a crucial part of the total consumption of a system. In this case, I believe the AMD system has an advantage with the memory controller being incorporated in the processor chip. To one extend this also becomes clear in your testing.

    Comparing the Intel CPUs one may observe that the 6-core part has a huge cache memory that seriously limits the main memory accesses. In the case of the 6VMs, there will also be reduced inter-socket communication. Both result in very serious reductions in off-chip activity, which materialises in a whopping 25% reduction in power usage.

    Therefore I believe that making the benchmarking process more memory intensive, as you point is the real-world scenario, AMD could earn quite a few points there.


    On a more general argument now, I can't stop thinking that chips like the 74xx Xeons are somewhat a waste of transistors. Intel is simply following the "bully" path rather than the "smart" path. I cannot stop thinking what would the results look like if instead of the two extra cores and the huge amount of cache, Intel added a TCP offload engine, a true hardware RAID controller, a block cipher accelerator, a DSP engine or an extra FP processor core (I'm not mentioning a memory controller because someone will pop up and say that they have already done that in the nehalem). All these things - and one could add much more - are integral to any server or HPC system and I believe can offer much more countable results than the two extra cores. Better performance and definitely better power usage. On the other hand, considering the weaknesses of AMD, maybe that is the company that should really get down to it.

    Not long there was a lot of hype of AMD opening up their socket and coherent HyperTransport so that people could actually produce accelerators. What has happened with that? Are there any products on that market? It would be interesting to see some benchmarking with these things. :)
    Reply
  • JohanAnandtech - Thursday, September 25, 2008 - link

    "Im wondering if by making artificially the VMs more processing intensive rather than memory intensive, one is getting a quite wrong idea of the power consumption between the Intel and the AMD systems. "

    You are right that most virtualized workloads (including the OLTP ones) will need a lot more memory *space*, but they are not necessarily more memory intensive. It is good practice for example to use another scheduler to make it more CPU intensive: you are getting more transactions per second on the same machine. It is pretty bad to lose your watts on anything else but transactions.
    Reply
  • jedz - Wednesday, September 24, 2008 - link

    It's pretty obvious that AMD is not competing neck to neck in the server arena with their current opteron offerings because of the fact that they are way behind Intel's, and it's not right to compare the opteron to an intel7460 in terms of performance/watt. Why don't you wait for AMD's Shanghai and then redo this benchmarking process.

    Maybe it will do justice for AMD....
    Reply

Log in

Don't have an account? Sign up now