Intel's Xeon E5-2600 V2: 12-core Ivy Bridge EP for Servers
by Johan De Gelas on September 17, 2013 12:00 AM ESTOther Improvements
One of the improvements that caught our attention was the "Fast Access of FS & GS base registers". We were under the impression that segment registers were not used in a modern OS with 64-bit flat addressing (with the exception of the Binary Translation VMM of VMware), but the promise of "Critical optimization for large thread-count server workloads" in Intel's Xeon E5-2600 V2 presentation seems to indicate otherwise.
Indeed no modern operating system uses the segment registers, but FS and GS registers are an exception. The GS register (for 64-bit; FS for 32-bit x86) points to the Thread Local Storage descriptor block. That thread block stores unique information for each thread and is accessed quite a bit when many threads are running concurrently.
That sounds great, but unfortunately operating system support is not sufficient to benefit from this. An older Intel presentation states that this feature is implement by adding "Four new instructions for ring-3 access of FS & GS base registers". The GCC compiler 4.7 (and later) has a flag called "-fsgsbase" to recompile your source code to make use of this. So although Ivy Bridge could make user thread switching a lot faster, it will take a while before commercial code actually implements this.
Other ISA optimizations (Float16 to and from SP conversion) will be useful for some image/video processing applications, but we cannot imagine that many server applications will benefit from this. HPC/render farms on the other hand may find this useful.
The Uncore
The uncore part has some modest improvements too. The snoop directory has now three states (Exclusive, Modified, Shared) instead of two and it improves server performance in 2-socket configurations as well. In Sandy Bridge the snoop directoy was disabled in 2-socket configurations as it hampered performance (which is also a best practice on the Opterons).
Also, the snoop broadcoasting got a lot more "opportunistic". If lots of traffic is going on, broadcasts are avoided; if very little is going on, it "snoops away". If it is likely that the snoop directory will not have the entry, the snoop is issued prior to directory feedback. "Opportunistic" snooping makes sure that snooping traffic is reduced and as a result the multi-core performance scales better. Which is quite important when your are dealing with up to 24 physical cores in a system.
Wrapping up, maximum PCI Express bandwidth when performing two thirds reads and one third writes has been further improved from 80GB/s (using quad-channel 1600 MT/s DDR3) to 90GB/s. T here are now two memory controllers instead of one to reduce latency. Bandwidth is also improved thanks to the support for DDR3- 1866. Lastly, the half width QPI mode is disabled in turbo mode, as it is very likely that there is a lot of traffic between the interconnects between the sockets. Turbo mode is after all triggered by heavy CPU activity.
70 Comments
View All Comments
Kevin G - Tuesday, September 17, 2013 - link
I'd be careful about using Java benchmarks on those SPARC chips for an overall comparison. The results on the SPARC side are often broken.x86 for many years has been ahead of SPARC. Only with the most recent chips has Oracle produced a very performance competitive chip.
The only other architecture that out runs Intel's best x86 chips is the POWER7/POWER7+. When the POWER8 ships, it is expected to be faster still.
Brutalizer - Thursday, September 19, 2013 - link
@Kevin G"...The results on the SPARC side are often broken..."
What do you mean with that? The Oracle benchmarks are official and anyone can see how they did it. Regarding the SPARC T5 performance, it is very fast, typically more than twice as fast as Xeon cpus. Just look at the official, accepted benchmarks on the site I linked to.
Kevin G - Friday, September 20, 2013 - link
@BrutalizerThere is a SPEC subtest whose result on SPARC is radically higher than other platforms. The weight of this one test affects the overall score. It has been a few years since I read up about this and SPARC as a platform has genuinely become performance competitive again.
Phil_Oracle - Friday, February 21, 2014 - link
Are you talking about libquantum?http://www.spec.org/cpu2006/Docs/462.libquantum.ht...
I believe IBM is the worst culprit on this subtest, showing a significant difference between base and peak. More so than any other vendor.
http://www.spec.org/cpu2006/results/res2012q4/cpu2...
But today, I believe all vendors have figured out how to improvise (cheat) on this test, even Xeon based.
http://www.spec.org/cpu2006/results/res2014q1/cpu2...
That’s why I believe SPEC CPU2006 is outdated and needs replacing and suggest looking at more realistic, recent (dal world) benchmarks like SPECjbb2013, SPECjEnterprise2010 or even TPC-H.
Phil_Oracle - Friday, February 21, 2014 - link
x86 was clearly ahead of SPARC till about SPARC T4 timeframe when Oracle took over R&D on SPARC. SPARC T4 allowed Oracle to equalize the playing field, especially in areas like database where the SPARC T4 really shines and shows considering many of the world record benchmarks that where released. When SPARC T5 came out last year, it increased performance by 2.4x, clobbering practically every other CPU out there. Today, you'll be hard pressed to find a real world benchmark, ones that are fully audited, where SPARC T5 is not in a leadership position, whether java based like SPECjbb2013 or SPECjEnterprise2010 or database like TPC-C or TPC-H.psyq321 - Tuesday, September 17, 2013 - link
I know that EX will be using the (scalable) memory buffer, which is probably the main reason for the separate pin-out. I guess they could still keep both memory controllers in, and fuse the appropriate one depending if it is an EX or EP SKU, if this would still make sense from a production perspective.Kevin G - Tuesday, September 17, 2013 - link
It wouldn't make much sense as the EX line up moves the DDR3 physical interface off to the buffer chip. There is a good chunk of die space used for the 256 bit wide memory interface in the EP line. Going the serial route, the EX line is essentially doubling the memory bandwidth while using the same number of IO pins (though at the cost of latency).The number of PCIe lanes and QPI links also changes between the EP and EX line up. The EP has 40 PCIe lanes where as the EX has 32. There are 4 QPI links on the EX line up making them ideal for 4 and 8 socket systems were as the EP line has 2 QPI links good for dual or a poor quad socket configuration.
psyq321 - Wednesday, September 18, 2013 - link
Hmm, this source: http://www.3dcenter.org/news/intel-stellt-ivy-brid...Claims that HCC is 12-15 core design. They also have a die-shot of a 15 core variant.
Kevin G - Thursday, September 19, 2013 - link
I'll 1-up you: Intel technical reference manuals (PDF).http://www.intel.de/content/dam/www/public/us/en/d...
http://www.intel.de/content/dam/www/public/us/en/d...
It does appear to be 15 core judging from the mask in the CSR_DESIRED_CORES register.
However, there is not indication that the die supports serial memory links to a buffer chip or >3 QPI links that an EX chip would have.
psyq321 - Thursday, September 19, 2013 - link
Well, I guess without Intel openly saying or somebody laser-cutting the die it would not be possible to know exactly is it a shared die between HCC EP and EX.However, there are lots of "hints" that the B-package Ivy Bridge EP hide more cores, like the ones you linked. If it is the case, it is really a shame that Intel did not enable all cores in the EP line. There would still be lots of places for differentiation between the EX and EP lines, since EX anyway contains RAS features without which the target enterprise customers would probably not even consider EP, even if it had the same number of cores.
Also, Ivy EX will have some really high TDP parts.