ARM Challenging Intel in the Server Market: An Overview
by Johan De Gelas on December 16, 2014 10:00 AM ESTCavium Thunder-X
A few months ago, we talked briefly with the people of Cavium. Cavium is specialized in designing MIPS SoCs that enable intelligent networking, communications, storage, video, and security applications. The picture below sums it all up: present and future.
Cavium's "Thunder Project" started from Cavium's existing Octeon III network SoC, the CN78xx. Cavium's bread and butter has been integrating high speed network capabilities in SoCs, so you will be able to choose between SoCs that have 100 Gbit Ethernet and 10GBit Ethernet. PCI-Express roots and multiple SATA ports are all integrated. There is no doubt that Cavium can design a highly integrated feature-rich SoC, but what about the processing core?
The MIPS cores inside the Octeon are much simpler – dual-issue in-order – but also much smaller and need very little power compared to a typical server core. Four (28nm) MIPS cores can fit in the space of one (32nm) Sandy Bridge core.
Replace the MIPS decoders with ARMv8 decoders and you are almost there. However, while the Cavium Thunder-X is definitely not made to run SAP, server workloads are bit more demanding than network processing, so Cavium needed to beef up the Octeon cores. The new Thunder-X cores are still dual-issue, but they're now out-of-order instead of in-order, and the pipeline length has been increased from eight to nine stages to allow for higher clocks. Each core has a 78KB L1 Instruction cache and a 32KB data cache.
The 37-way 78KB L1 I cache is certainly odd, but it might be more than just "network processor heritage". Our own testing and a few academic studies have shown that scale-out workloads such as memcached have a higher than normal (meaning the typical SPECIntRate2006 characterization) I-cache miss rate. The reason is that these applications run a lot of kernel code, and more specifically the code of the network stack. As a result, the software footprint is much higher than expected.
Another reason why we believe Cavium has done it's homework is the fact that more die area is spent on cores (up to 48) than on large caches; an L3 cache is nowhere to be found. The Thunder-X has only one centralized relatively low latency 16MB L2 cache running at full core speed. A lot of academic studies have confirmed that a large L3 cache is a waste of transistors for scale-out workloads. Besides the most used instructions that reside in the I-cache, there is a huge amount of less frequently used kernel code that does not fit in an L3 cache. In other words, an L3 cache just adds more latency to requests that missed the L1 cache and that will end up in the DRAM anyway. That is also the reason why Cavium made sure that a beefy memory controller is available: the Thunder-X comes with four DDR3/4 72-bit memory controllers and it currently supports the fastest DRAM available for servers: DDR4-2133.
On the flip side, having 48 cores with a relatively small 32KB D-cache that access one centralized 16MB L2 cache also means that the Thunder-X is less suited for some "traditional" server workloads such as SQL databases. So a Thunder-X core is simpler and probably quite a bit weaker than an ARM Cortex-A57 in some ways, let alone an X-Gene core. The fact that the Thunder-X spends a lot less transistors on cache than on cores clearly indicates that it is targeting other workloads. Single-threaded performance is likely to be lower than that of the AMD Seattle and X-Gene, but it could be close enough: the Thunder-X will run at 2.5GHz, courtesy of Global Foundries' 28nm process technology. Cavium is claiming that even the top SKU will keep the TDP below 100W.
There is more. The Thunder-X uses Cavium's proprietary Coherent Processor Interconnect (CCPI) and can thus work in a dual socket NUMA configuration. As a result, a Thunder-X based server can have up to 96 cores and is capable of supporting 1TB of memory, 512GB per socket. Multiple 10/40GBE, PCIe Root Complex, and SATA controllers are integrated in the SoC. Depending on SKU, TCP/IP Sec offload and SSL accelerators are also integrated.
The recent launch of Cavium's Thunder-X SKUs make it clear that Cavium is trying to compete with the venerable Xeon E5 in some niche but large markets:
- ThunderX_CP: For cloud compute workloads such as public and private clouds, web caching, web serving, search, and social media data analytics.
- ThunderX_ST: For cloud storage, big data, and distributed databases.
- TunderX_NT: For telecom/NFV server and embedded networking applications.
- ThunderX_SC: For secure computing applications
Considering Cavium's background and expertise, it is pretty obvious that ThunderX_NT and SC should be very capable challengers to the Xeon E5 (and Xeon-D), but only a thorough review will tell how well the ThunderX_CP will do. One of the strongest points of Calxeda was the highly integrated fabric that lowered the total power consumption and network latency of such a server cluster. Just like AMD/Seamicro, Cavium is well positioned to make sure that the Thunder-X based server clusters also have this high level of network/compute integration.
78 Comments
View All Comments
jjj - Tuesday, December 16, 2014 - link
If you look at phones and tabs ,we might be getting some rather big custom cores in 2015 and 2016. Apple and Nvidia already have that, ofc much smaller than Intel's core when adjusting for process (actually that's an assumption when it comes to Denver since don't think we've seen any die shots).Intel at the same time in consumer is pushing for more non-CPU/GPU compute units and low power and they might face a tough question about core size and even process (if they target low clocks, low power , or the opposite).Got to wonder if at some point they'll have to go for a big core just for server.Would make things even more interesting.
Might not matter but Apple kinda has the perf for an ARM Macbook Air if they go quad. Not something worth doing for such low volume but doable when they go quad on all ipads or sooner if they launch a bigger ipad. Could be a trigger for others pushing more ARM based Chromebooks and beyond. That would set the stage for even bigger ARM cores.
Also got the feeling Nintendo will go ARM in 2016 and not many reasons for Sony and M$ not to go that way if they ever make a new gen- just another market for bigger ARM cores, any significant revenue helps with dev costs so it matters.
CajunArson - Tuesday, December 16, 2014 - link
1. The Core-m is widely derided as not being fast enough for the MacBook Air.2. The Core-m is easily twice as fast as the A8X in benchmarks that count... even Anandtech's own benchmarks show that. Furthermore, when you step away from web browsers and get to use the advanced features of the Core-m like AVX, that advantage jumps to about 8x faster in compute-heavy benchmarks like Linpack.
3. Even the mythical A9 coming in 2015 is expected to have roughly a 20% performance boost over the A8x.
4. Any real computer using an ARM chip would have to have a translation layer just like the old Rosetta to run the huge library of x86 software out there. Rosetta sort of worked because the Core 2 chips from Intel were *massively* faster than the PowerPC parts they replaced. Now you expect to run the translation overhead on an A9 chip that is slower -- by a large margin -- than the Core-m parts you've already derided as not being good enough?
Yeah, I'm not holding my breath.
fjdulles - Tuesday, December 16, 2014 - link
You may be right, but remember that ARM chips using the same power budget as Intel core i* will no doubt be clocked higher and perform that much better. Not sure if that will be competitive but it would be interesting to see.wallysb01 - Tuesday, December 16, 2014 - link
Only if you want a glorified tablet as a laptop. The software most people use in real work on laptops/desktops is not going to be ported over to ARM at an speed, even if ARMs could do that work reasonably well.Kevin G - Wednesday, December 17, 2014 - link
I'm under the impression that a good chunk has already been ported. MS Office for example is native ARM on Windows RT. Various Linux distributions have ARM ports completed with ARM based office and desktop software. The main thing missing are some big commercial applications like Photoshop etc.The server side of thing is similar with Linux and open software ports. MS is weirdly absent but I suspect that an ARM based version of Windows 2012/2014 is waiting of major hardware to be released. Much of the Windows base is already ported over to ARM due to Windows RT.
Kevin G - Wednesday, December 17, 2014 - link
Indeed. Performance of ARM platforms once power constraints have been removed is a very open question. So far all the core designs in products have been used in mobile where SoC power consumption is less than 5 W. What a 100 W product would look is an open and very interesting question.Ratman6161 - Wednesday, December 17, 2014 - link
If they "use the same power budget as an Intel core i*" then what would be the point?jjj - Tuesday, December 16, 2014 - link
Ok you are focusing on the wrong thing but lets do that anyway.I have never claimed that Apple's own SoC would beat Intel's current SoCs, just that the perf would be enough if they go quad and obviously higher clocks.
When you talk Core M you should remember that the price at launch was $281 so it's not good enough for anything.
Anyway how about you compare a possible Apple SoC with a MacBook Air from 2011, lets face it the Air is a crap machine anyway , not much perf and TN panel for w/e ridiculous price it costs now and it's users are certainly not doing any heavy lifting with it.
At the same time Apple's own 15- 20$ SoC would allow them a much cheaper machine and a presence in a price segment they never competed in, adding at least 5B of revenue per year (including cannibalization) and a share gain in PC of 2-3%.
But then again the point was that there are a bunch of trends that could favor bigger ARM cores.
Morawka - Wednesday, December 17, 2014 - link
it might cost them $20 for the A8X in fab cost, but the R&D for that chip is in the 10's of millions. Factor that in, to however many they ship, and it adds at least another $20 per chipjospoortvliet - Wednesday, December 17, 2014 - link
Even more obvious then that this would save them money by spreading out the fixed costs over more devices...