ARM Challenging Intel in the Server Market: An Overview
by Johan De Gelas on December 16, 2014 10:00 AM ESTCavium Thunder-X
A few months ago, we talked briefly with the people of Cavium. Cavium is specialized in designing MIPS SoCs that enable intelligent networking, communications, storage, video, and security applications. The picture below sums it all up: present and future.
Cavium's "Thunder Project" started from Cavium's existing Octeon III network SoC, the CN78xx. Cavium's bread and butter has been integrating high speed network capabilities in SoCs, so you will be able to choose between SoCs that have 100 Gbit Ethernet and 10GBit Ethernet. PCI-Express roots and multiple SATA ports are all integrated. There is no doubt that Cavium can design a highly integrated feature-rich SoC, but what about the processing core?
The MIPS cores inside the Octeon are much simpler – dual-issue in-order – but also much smaller and need very little power compared to a typical server core. Four (28nm) MIPS cores can fit in the space of one (32nm) Sandy Bridge core.
Replace the MIPS decoders with ARMv8 decoders and you are almost there. However, while the Cavium Thunder-X is definitely not made to run SAP, server workloads are bit more demanding than network processing, so Cavium needed to beef up the Octeon cores. The new Thunder-X cores are still dual-issue, but they're now out-of-order instead of in-order, and the pipeline length has been increased from eight to nine stages to allow for higher clocks. Each core has a 78KB L1 Instruction cache and a 32KB data cache.
The 37-way 78KB L1 I cache is certainly odd, but it might be more than just "network processor heritage". Our own testing and a few academic studies have shown that scale-out workloads such as memcached have a higher than normal (meaning the typical SPECIntRate2006 characterization) I-cache miss rate. The reason is that these applications run a lot of kernel code, and more specifically the code of the network stack. As a result, the software footprint is much higher than expected.
Another reason why we believe Cavium has done it's homework is the fact that more die area is spent on cores (up to 48) than on large caches; an L3 cache is nowhere to be found. The Thunder-X has only one centralized relatively low latency 16MB L2 cache running at full core speed. A lot of academic studies have confirmed that a large L3 cache is a waste of transistors for scale-out workloads. Besides the most used instructions that reside in the I-cache, there is a huge amount of less frequently used kernel code that does not fit in an L3 cache. In other words, an L3 cache just adds more latency to requests that missed the L1 cache and that will end up in the DRAM anyway. That is also the reason why Cavium made sure that a beefy memory controller is available: the Thunder-X comes with four DDR3/4 72-bit memory controllers and it currently supports the fastest DRAM available for servers: DDR4-2133.
On the flip side, having 48 cores with a relatively small 32KB D-cache that access one centralized 16MB L2 cache also means that the Thunder-X is less suited for some "traditional" server workloads such as SQL databases. So a Thunder-X core is simpler and probably quite a bit weaker than an ARM Cortex-A57 in some ways, let alone an X-Gene core. The fact that the Thunder-X spends a lot less transistors on cache than on cores clearly indicates that it is targeting other workloads. Single-threaded performance is likely to be lower than that of the AMD Seattle and X-Gene, but it could be close enough: the Thunder-X will run at 2.5GHz, courtesy of Global Foundries' 28nm process technology. Cavium is claiming that even the top SKU will keep the TDP below 100W.
There is more. The Thunder-X uses Cavium's proprietary Coherent Processor Interconnect (CCPI) and can thus work in a dual socket NUMA configuration. As a result, a Thunder-X based server can have up to 96 cores and is capable of supporting 1TB of memory, 512GB per socket. Multiple 10/40GBE, PCIe Root Complex, and SATA controllers are integrated in the SoC. Depending on SKU, TCP/IP Sec offload and SSL accelerators are also integrated.
The recent launch of Cavium's Thunder-X SKUs make it clear that Cavium is trying to compete with the venerable Xeon E5 in some niche but large markets:
- ThunderX_CP: For cloud compute workloads such as public and private clouds, web caching, web serving, search, and social media data analytics.
- ThunderX_ST: For cloud storage, big data, and distributed databases.
- TunderX_NT: For telecom/NFV server and embedded networking applications.
- ThunderX_SC: For secure computing applications
Considering Cavium's background and expertise, it is pretty obvious that ThunderX_NT and SC should be very capable challengers to the Xeon E5 (and Xeon-D), but only a thorough review will tell how well the ThunderX_CP will do. One of the strongest points of Calxeda was the highly integrated fabric that lowered the total power consumption and network latency of such a server cluster. Just like AMD/Seamicro, Cavium is well positioned to make sure that the Thunder-X based server clusters also have this high level of network/compute integration.
78 Comments
View All Comments
jhh - Tuesday, December 16, 2014 - link
SPARC and Power have had trouble keeping up with Moore's law, as neither sold enough to amortize R&D to push out innovation at the same rate as Intel. As Moore's law comes to an end, this will stop being a unique Intel advantage. It just might be too late for both of them. One can see the pressure on IBM, with their opening the Power architecture in similar ways to ARM. Both POWER and SPARC have to keep up to porting drivers to their Unix implementations, while the device manufacturers either write drivers for Linux or don't get volume. I just can't see either POWER or SPARC being cost effective over the long run. And, when others see the same thing, they aren't going to be excited about porting application software to those platforms.ARM needs to have a good performance/power and performance/cost ratio to get people excited to buy something other than Intel. They are certainly getting enough volume from the low-end to make investment on high-end parts. So far, I'm not excited enough to recommend any ARM proof-of-concept though.
Kevin G - Wednesday, December 17, 2014 - link
IBM always had a licensing model similar to ARM with PowerPC cores. The only thing really new here is that IBM is licensing out there flagship POWER chip in the same manner. Despite Intel having a process advantage, IBM was able to keep up in performance. (The 45 mm based 8POWER7 was generally faster than the 32 mm 10 core Westmere-EX.) There will always be a market for top performance but you are correct that sustaining on just that customer base is unwise.IBM does realize that their software licensing model to subsidize hardware R&D was not sustainable. So while you can't run AIX, you can get a POWER8 box for less than $3k now.
OreoCookie - Wednesday, December 17, 2014 - link
Really, just $3000? Wow, how times have changed, I remember ~12 years ago that a single Alpha CPU cost that much (the department I was working for had a workstation fail, fortunately under warranty, because otherwise they would have had to pay for 2 new CPUs and new RAM worth about 15,000 German Marks).Ratman6161 - Wednesday, December 17, 2014 - link
"The general lower cost of Linux and open source software" While it's true that the cost of a Linux OS including support is lower than an equivalent Windows OS, in the larger scheme of things the cost of Windows and even VMware becomes little more than background noise in the total cost of operations. Try pricing out an Oracle DB for example and you find that the cost of that software dwarfs the price of the hardware it's running on as well as whatever the OS is costing. Ditto with most "enterprise software".lefty2 - Tuesday, December 16, 2014 - link
Intel has another big advantage over ARM, which everyone seems to have forgotten about, and that is software compatibilty. 64-bit ARM server software is still a work in progress. The stuff that's being worked on at the moment is open source. Once that's finished you still have to convince clients to convert their proprietary software to ARM.JohanAnandtech - Tuesday, December 16, 2014 - link
Don't you think that the open source software that has been/is ported now is enough? Apache/PHP/MySQL, Memcached and Hadoop...that is a massive server market. And there is little stopping Microsoft to invest in ARM software too. Just VMware might be a bit tricky, but I don't think the software is a problem.Kevin G - Wednesday, December 17, 2014 - link
Actually VMware has said some less that flattering about ARM. Xen is the main hyper visor on ARM for the moment.goop666666 - Thursday, December 25, 2014 - link
Yeah, recompiling is so very hard. Essentially what you're saying is that Intel is for legacy systems and software that is poorly written. That is a large enough market, but doesn't apply to hyperscale deployments, which are the future.gostan - Tuesday, December 16, 2014 - link
great article by Johan as always.but the argument is muted. we have heard this tune before.
the hardware might be cheaper. the power bill might be cheaper. wait until you see the software maintenance cost. custom software needs 'custom' pricing.
besides, arm has no cutting edge fab process to back them.
JohanAnandtech - Tuesday, December 16, 2014 - link
You do not need expensive software to create a server market these days. Just look how many webservers are running the LAMP stack.