ARM Challenging Intel in the Server Market: An Overview
by Johan De Gelas on December 16, 2014 10:00 AM ESTCavium Thunder-X
A few months ago, we talked briefly with the people of Cavium. Cavium is specialized in designing MIPS SoCs that enable intelligent networking, communications, storage, video, and security applications. The picture below sums it all up: present and future.
Cavium's "Thunder Project" started from Cavium's existing Octeon III network SoC, the CN78xx. Cavium's bread and butter has been integrating high speed network capabilities in SoCs, so you will be able to choose between SoCs that have 100 Gbit Ethernet and 10GBit Ethernet. PCI-Express roots and multiple SATA ports are all integrated. There is no doubt that Cavium can design a highly integrated feature-rich SoC, but what about the processing core?
The MIPS cores inside the Octeon are much simpler – dual-issue in-order – but also much smaller and need very little power compared to a typical server core. Four (28nm) MIPS cores can fit in the space of one (32nm) Sandy Bridge core.
Replace the MIPS decoders with ARMv8 decoders and you are almost there. However, while the Cavium Thunder-X is definitely not made to run SAP, server workloads are bit more demanding than network processing, so Cavium needed to beef up the Octeon cores. The new Thunder-X cores are still dual-issue, but they're now out-of-order instead of in-order, and the pipeline length has been increased from eight to nine stages to allow for higher clocks. Each core has a 78KB L1 Instruction cache and a 32KB data cache.
The 37-way 78KB L1 I cache is certainly odd, but it might be more than just "network processor heritage". Our own testing and a few academic studies have shown that scale-out workloads such as memcached have a higher than normal (meaning the typical SPECIntRate2006 characterization) I-cache miss rate. The reason is that these applications run a lot of kernel code, and more specifically the code of the network stack. As a result, the software footprint is much higher than expected.
Another reason why we believe Cavium has done it's homework is the fact that more die area is spent on cores (up to 48) than on large caches; an L3 cache is nowhere to be found. The Thunder-X has only one centralized relatively low latency 16MB L2 cache running at full core speed. A lot of academic studies have confirmed that a large L3 cache is a waste of transistors for scale-out workloads. Besides the most used instructions that reside in the I-cache, there is a huge amount of less frequently used kernel code that does not fit in an L3 cache. In other words, an L3 cache just adds more latency to requests that missed the L1 cache and that will end up in the DRAM anyway. That is also the reason why Cavium made sure that a beefy memory controller is available: the Thunder-X comes with four DDR3/4 72-bit memory controllers and it currently supports the fastest DRAM available for servers: DDR4-2133.
On the flip side, having 48 cores with a relatively small 32KB D-cache that access one centralized 16MB L2 cache also means that the Thunder-X is less suited for some "traditional" server workloads such as SQL databases. So a Thunder-X core is simpler and probably quite a bit weaker than an ARM Cortex-A57 in some ways, let alone an X-Gene core. The fact that the Thunder-X spends a lot less transistors on cache than on cores clearly indicates that it is targeting other workloads. Single-threaded performance is likely to be lower than that of the AMD Seattle and X-Gene, but it could be close enough: the Thunder-X will run at 2.5GHz, courtesy of Global Foundries' 28nm process technology. Cavium is claiming that even the top SKU will keep the TDP below 100W.
There is more. The Thunder-X uses Cavium's proprietary Coherent Processor Interconnect (CCPI) and can thus work in a dual socket NUMA configuration. As a result, a Thunder-X based server can have up to 96 cores and is capable of supporting 1TB of memory, 512GB per socket. Multiple 10/40GBE, PCIe Root Complex, and SATA controllers are integrated in the SoC. Depending on SKU, TCP/IP Sec offload and SSL accelerators are also integrated.
The recent launch of Cavium's Thunder-X SKUs make it clear that Cavium is trying to compete with the venerable Xeon E5 in some niche but large markets:
- ThunderX_CP: For cloud compute workloads such as public and private clouds, web caching, web serving, search, and social media data analytics.
- ThunderX_ST: For cloud storage, big data, and distributed databases.
- TunderX_NT: For telecom/NFV server and embedded networking applications.
- ThunderX_SC: For secure computing applications
Considering Cavium's background and expertise, it is pretty obvious that ThunderX_NT and SC should be very capable challengers to the Xeon E5 (and Xeon-D), but only a thorough review will tell how well the ThunderX_CP will do. One of the strongest points of Calxeda was the highly integrated fabric that lowered the total power consumption and network latency of such a server cluster. Just like AMD/Seamicro, Cavium is well positioned to make sure that the Thunder-X based server clusters also have this high level of network/compute integration.
78 Comments
View All Comments
beginner99 - Tuesday, December 16, 2014 - link
Agree. I just don't see it. What wasn't mentioned or I might have missed is Intels turbo technology. Does ARM have anything similar? Single-threaded performance matters. If a websites takes double the time to be built by the server the user can notice this. And given complexity of modern web sites this is IMHO a real issue. Latency or "service time" is greatly affected by single-threaded performance. That's why visualization is great. Put tons of low-usage stuff on the same physical server and yet each request profits from the single-threaded performance.Now these ARM guys are targeting this high single-threaded performance but why would any company change? Whole software stack would have to change as well at don't forget the software usually cost way, way more than the hardware it runs on. So if you save 10% on the SOC you maybe save less than 1% on the total BOM including software. They can't win on price and on performance/watt Intel still hast best process. So no i don' see it except for niche markets like these Mips SOCs from cavium.
Ratman6161 - Wednesday, December 17, 2014 - link
"Xeon performance at ridiculous prices" I just don't get the "ridiculous prices" comment. To me, it seems like hardware these days is so cheap they are practically giving it away. I remember in the days of NT 4.0 Servers we paid $40K each for dual socket Dell systems with 16 GB Ram.A few years later we were doing Windows 2000 Server on Dell 2850's that were less than half the price.
Then in 2007 we went the VMWare route on Dell 2950's where the price actually went up to $23K but we were getting dual sockets/8 cores and 32GB of RAM so they made the $40K servers we bought years before look like toys.
Four years later we got R-710's that were dual socket/12 cores and 64GB or RAM and made the $23K 2950's look like clunkers but the price was once again almost half at about $12K.
Today we are looking at replacing the R-710's with the latest generation which will be even more cores and more RAM for about the same price.
So to me, the prices don't seem ridiculous at all. The servers themselves now make up only a fraction of our hardware costs with the expensive items being SAN storage. But that too is a lot cheaper. We are looking at going from our two SANS with 4GB fiber channel connections to a single SAN with 10GB Ethernet and more storage than the two old units combined...but still costing less than the old SANs did for just one. So prices there are expensive but less than half of what we paid in 2007 for more storage.
The real costs in the environment are in Software licensing and not I'm not talking about Microsoft or even VMware. Licensing those products are chump change compared to the Enterprise Software crooks...that's where the real costs are. The infrastructure of servers, storage and "plumbing" sorts of software like Windows Server and VMWare are cheap in comparison.
mrdude - Tuesday, December 16, 2014 - link
Great article, JohanI think the last page really describes why so many people, myself included, feel that ARM servers/vendors have a very good chance of entrenching themselves in the market. Server workloads are more complex and varied today than they ever have been in the past and it isn't high volume either: the Facebook example is a good one. These companies buy hardware by the truckload and can benefit immensely from customization that Intel may not have on offer.
To add to that, what wasn't mentioned is that ARM, due to its 'license everything' business model, provides these same companies the opportunity to buy ready-made bits of uArch and, with a significantly smaller investment, build them own as-close-to-ideal SoC/CPU/co-processor that they need.
Competition is a great thing for everyone.
JohanAnandtech - Tuesday, December 16, 2014 - link
True. Although it seems that only AMD really went for the "license almost everything" model of ARM.mrdude - Tuesday, December 16, 2014 - link
Yep. And that's likely due to the budget/timing constraints. I think they were gunning for the 'first to market' branding but they couldn't meet their own timelines. Something of a trend with that company. I'm curious as to why we haven't heard a peep from AMD or partners regarding performance or perf-per-watt. Iirc, we were supposed to see Seattle boards in Q3 of 2014.I also feel like ARM isn't going to stop at the interconnect. There's still quite a bit of opportunity for them to expand in this market.
cjs150 - Tuesday, December 16, 2014 - link
Ultimately, my interest in servers is limited but I would like a simple home server that would tie all my computers, NAS, tablets and the other bits and bobs that a geek household has.witeken - Tuesday, December 16, 2014 - link
Who's interested in Intel's data center strategy, can watch Diane Bryant's recent presentation (including PDF): http://intelstudios.edgesuite.net/im/2014/live_im.... The Q&A from 2013 also has some comments about ARM servers: http://intelstudios.edgesuite.net/im/2013/live_im....Kevin G - Tuesday, December 16, 2014 - link
"Now combine this with the fact that Windows on Alpha was available." - Except that Windows NT was available for Alpha. There was a beta for Windows 2000 in both 32 bit and 64 bit flavors for the curious.I disagree with the reason why Intel beat the RISC players. Two of the big players were defeated by corporate politics: Alpha and PA-RISC were under the control of HP who was planning to migrate to Itanium. That leaves POWER, SPARC, MIPs and Intel's own Itanium architecture at the turn of the millennium. Of those, POWER and SPARC are still around as they continue to execute. So the only two victims that can be claimed by better execution is MIPs and Intel's own Itanium.
While IBM and Oracle are still executing on hardware, the Unix market as a whole has decreased in size as a whole. The software side isn't as strong as it'd use to be. Linux has risen and proven itself to be a strong competitor to the traditional Unix distribution. Open source software has emerged to fill many of the roles Unix platforms were used to. Further more, many of these applications like Hadoop and Casandra are designed to be clustered and tolerate node failures. No need to spend extra money on big iron hardware if the software doesn't need that level of RAS for uptime. The general lower cost of Linux and open source software (though they're not free due to the need for support) combined with furhter tightening of budgets during the great recession has made many businesses reconsider their Unix platforms.
JohanAnandtech - Tuesday, December 16, 2014 - link
My main argument was that the RISC market was fragmented, and not comparable to what the x86 market is now (Intel dominating with a very large software base).While I agree with many of your points, you can not say that SPARC is not a victim. In 90ies, Sun had a very broad product range from entry-level workstation to high-end server. The same is true for the Power CPUs.
Kevin G - Wednesday, December 17, 2014 - link
The RISC market was fragmented on both hardware and software. The greatest example of this would be HP that had HPUX, Tru64, OpenVMS, and Nonstop as operating system and tried to get them all migrated to a common hardware platform: Itanium. How each platform handled backwards compatibility with their RISC roots was different (and Tru64 was killed in favor of HPUX).The midrange RISC workstation suffered the same fate as the dual socket x86 workstation market: good enough hardware and software existed for less. The race to 1Ghz between Intel and AMD cut out the performance advantage RISC platforms carried. Not to say that the RISC a chips didn't improve performance but vendors never took steps to improve their price. Window 2000 and the rise of Linux early in the 2000's gave x86 a software price advantage too while having good enough reliability.
Sun's hardware business did suffer some horrible delays which helped lead the company into Oracle's acquisition. Notably was the Rock chip which featured out-of-order execution but also out-of-order instruction retirement. Sun was never able to validate any prototype silicon and ship it to customers.