Apple's Cyclone Microarchitecture Detailed
by Anand Lal Shimpi on March 31, 2014 2:10 AM ESTThe most challenging part of last year's iPhone 5s review was piecing together details about Apple's A7 without any internal Apple assistance. I had less than a week to turn the review around and limited access to tools (much less time to develop them on my own) to figure out what Apple had done to double CPU performance without scaling frequency. The end result was an (incorrect) assumption that Apple had simply evolved its first ARMv7 architecture (codename: Swift). Based on the limited information I had at the time I assumed Apple simply addressed some low hanging fruit (e.g. memory access latency) in building Cyclone, its first 64-bit ARMv8 core. By the time the iPad Air review rolled around, I had more knowledge of what was underneath the hood:
As far as I can tell, peak issue width of Cyclone is 6 instructions. That’s at least 2x the width of Swift and Krait, and at best more than 3x the width depending on instruction mix. Limitations on co-issuing FP and integer math have also been lifted as you can run up to four integer adds and two FP adds in parallel. You can also perform up to two loads or stores per clock.
With Swift, I had the luxury of Apple committing LLVM changes that not only gave me the code name but also confirmed the size of the machine (3-wide OoO core, 2 ALUs, 1 load/store unit). With Cyclone however, Apple held off on any public commits. Figuring out the codename and its architecture required a lot of digging.
Last week, the same reader who pointed me at the Swift details let me know that Apple revealed Cyclone microarchitectural details in LLVM commits made a few days ago (thanks again R!). Although I empirically verified many of Cyclone's features in advance of the iPad Air review last year, today we have some more concrete information on what Apple's first 64-bit ARMv8 architecture looks like.
Note that everything below is based on Apple's LLVM commits (and confirmed by my own testing where possible).
Apple Custom CPU Core Comparison | ||||||
Apple A6 | Apple A7 | |||||
CPU Codename | Swift | Cyclone | ||||
ARM ISA | ARMv7-A (32-bit) | ARMv8-A (32/64-bit) | ||||
Issue Width | 3 micro-ops | 6 micro-ops | ||||
Reorder Buffer Size | 45 micro-ops | 192 micro-ops | ||||
Branch Mispredict Penalty | 14 cycles | 16 cycles (14 - 19) | ||||
Integer ALUs | 2 | 4 | ||||
Load/Store Units | 1 | 2 | ||||
Load Latency | 3 cycles | 4 cycles | ||||
Branch Units | 1 | 2 | ||||
Indirect Branch Units | 0 | 1 | ||||
FP/NEON ALUs | ? | 3 | ||||
L1 Cache | 32KB I$ + 32KB D$ | 64KB I$ + 64KB D$ | ||||
L2 Cache | 1MB | 1MB | ||||
L3 Cache | - | 4MB |
As I mentioned in the iPad Air review, Cyclone is a wide machine. It can decode, issue, execute and retire up to 6 instructions/micro-ops per clock. I verified this during my iPad Air review by executing four integer adds and two FP adds in parallel. The same test on Swift actually yields fewer than 3 concurrent operations, likely because of an inability to issue to all integer and FP pipes in parallel. Similar limits exist with Krait.
I also noted an increase in overall machine size in my initial tinkering with Cyclone. Apple's LLVM commits indicate a massive 192 entry reorder buffer (coincidentally the same size as Haswell's ROB). Mispredict penalty goes up slightly compared to Swift, but Apple does present a range of values (14 - 19 cycles). This also happens to be the same range as Sandy Bridge and later Intel Core architectures (including Haswell). Given how much larger Cyclone is, a doubling of L1 cache sizes makes a lot of sense.
On the execution side Cyclone doubles the number of integer ALUs, load/store units and branch units. Cyclone also adds a unit for indirect branches and at least one more FP pipe. Cyclone can sustain three FP operations in parallel (including 3 FP/NEON adds). The third FP/NEON pipe is used for div and sqrt operations, the machine can only execute two FP/NEON muls in parallel.
I also found references to buffer sizes for each unit, which I'm assuming are the number of micro-ops that feed each unit. I don't believe Cyclone has a unified scheduler ahead of all of its execution units and instead has statically partitioned buffers in front of each port. I've put all of this information into the crude diagram below:
Unfortunately I don't have enough data on Swift to really produce a decent comparison image. With six decoders and nine ports to execution units, Cyclone is big. As I mentioned before, it's bigger than anything else that goes in a phone. Apple didn't build a Krait/Silvermont competitor, it built something much closer to Intel's big cores. At the launch of the iPhone 5s, Apple referred to the A7 as being "desktop class" - it turns out that wasn't an exaggeration.
Cyclone is a bold move by Apple, but not one that is without its challenges. I still find that there are almost no applications on iOS that really take advantage of the CPU power underneath the hood. More than anything Apple needs first party software that really demonstrates what's possible. The challenge is that at full tilt a pair of Cyclone cores can consume quite a bit of power. So for now, Cyclone's performance is really used to exploit race to sleep and get the device into a low power state as quickly as possible. The other problem I see is that although Cyclone is incredibly forward looking, it launched in devices with only 1GB of RAM. It's very likely that you'll run into memory limits before you hit CPU performance limits if you plan on keeping your device for a long time.
It wasn't until I wrote this piece that Apple's codenames started to make sense. Swift was quick, but Cyclone really does stir everything up. The earlier than expected introduction of a consumer 64-bit ARMv8 SoC caught pretty much everyone off guard (e.g. Qualcomm's shift to vanilla ARM cores for more of its product stack).
The real question is where does Apple go from here? By now we know to expect an "A8" branded Apple SoC in the iPhone 6 and iPad Air successors later this year. There's little benefit in going substantially wider than Cyclone, but there's still a ton of room to improve performance. One obvious example would be through frequency scaling. Cyclone is clocked very conservatively (1.3GHz in the 5s/iPad mini with Retina Display and 1.4GHz in the iPad Air), assuming Apple moves to a 20nm process later this year it should be possible to get some performance by increasing clock speed scaling without a power penalty. I suspect Apple has more tricks up its sleeve than that however. Swift and Cyclone were two tocks in a row by Intel's definition, a third in 3 years would be unusual but not impossible (Intel sort of committed to doing the same with Saltwell/Silvermont/Airmont in 2012 - 2014).
Looking at Cyclone makes one thing very clear: the rest of the players in the ultra mobile CPU space didn't aim high enough. I wonder what happens next round.
182 Comments
View All Comments
techconc - Monday, March 31, 2014 - link
While the Mac Pro isn't the machine for me, I can still recognize the genius in the design. The integrated thermal core is an interesting design that allows for incredible cpu/gpu power without noisy fans, etc. As for expansion, with 6 Thunderbolt 2 ports, it's a matter of external expansion vs. internal. Likewise, the expansion exists whether you recognize it as such or not.Kevin G - Monday, March 31, 2014 - link
Until Thunderbolt can officially accommodate a GPU without any bandwidth compromise like an internal expansion card, then the Mac Pro is crippled over the long term. High speed PCIe based IO is great for other use-cases (especially in laptops) but the GPU upgradability is important. That's the deal breaker for me.The rest of the system isn't that impressive either. Only 4 memory slots offers less expandiblity than the previous generation Mac Pro. CPU upgradibility is also hampered with the new Mac Pro (though it wasn't officially sanctioned in the previous Mac Pros, it was rather straight forward to do).
techconc - Tuesday, April 1, 2014 - link
Agree that GPU would be an exception for Thunderbolt, though I believe Thunderbolt is viable for every other expansion use case. The fixit / breakdowns of the product show that both the CPU and the GPU can be easily swapped out. The only real problem is that it doesn't take off the shelf parts. Perhaps that's an opportunity for third party vendors. This system was clearly designed for heavy use of OpenCL, etc. Final Cut Pro X demonstrates that well enough.Again, it's not the machine for me either, but I do still see genius in the overall design. I suspect this type of design is the future. I'm not sure how much of a future these large towers really have. As an example, something like the Razer modular gaming design is another similar direction away from the conventional towers. It may take a few years, but this type of thing will likely replace the conventional towers we use today.
Kevin G - Tuesday, April 1, 2014 - link
Even though the CPU can be replaced with off the shelf parts, I'd call the procedure a bit more involved than simply 'easy'. The GPU's are a bit easier to remove but without any upgrade kits even announced, what good does that do? Nothing.For OpenCL usage, what prevents an owner of the previous generation of Mac Pro from plugging in two GPU's to get the same effect? The only thing that prevents the old Mac Pro surpassing the 2013 model in OpenCL is that it only has two internal 6 pin PCIe power connectors. If you are willing to deal with an external power supply for the video cards, then the 2013 never surpassed the 2010 model's potential OpenCL capabilities. This will be particularly true when the next generation of AMD and nVidia GPU's hit the market later this year.
Thunderbolt can be used for bulk storage but with NVMe drives around the corner, it won't be the fastest solution. Native PCIe based storage is going to be damn fast at the high end which the old Mac Pro can use to their full potential.
I do like the engineering of the 2013 Mac Pro but it feels more like it'd be the hypothetical machine to sit between the Mac Mini and a new Mac Pro tower. In this regard, the only thing I'd change is have the video cards use MXM style video cards instead of the proprietary ones. This way there would be some potential avenue for upgrades for the warranty be damned DIY crowd and it'd re-use parts from the iMac.
robinthakur - Thursday, April 3, 2014 - link
"for the warranty be damned DIY crowd" Does this crowd routinely spend £7000 on a Mac Pro? Would you want to carry out an unsanctioned hardware modification on such an expensive device which voids the support warranty? This is a professional workstation intended for companies and money-no-object prosumers to buy, they aren't going to be overclocking the CPU cores and buying aftermarket GPU parts on ebay...Penti - Monday, March 31, 2014 - link
Ones they go for the desktop they have to compete with Intel on performance and fabrication, they exited that fight (that they waged through IBM and Freescale) when they switched to Intel cpu's. While they have a really big core, it takes a lot more effort to have the same type of chip around 4GHz rather than 1.3-1.4GHz, it's really to hard to even compete in the notebook space where we have Intel's chips with a base freq of 1.7GHz but turbo of 3.3GHz. They don't really want to invest in fabs to compete with Intel. What Intel has shown as the ARM chips gets more advanced is that there isn't really any penalty to have x86 everywhere. If they go back to ARM they need to invest in compilers and tech for ARM as well as x86, now they have basically chosen x86 for everything and don't need to support an ecosystem for other platforms with the same kind of resources.JDG1980 - Monday, March 31, 2014 - link
x86 will own the professional desktop market forever due to legacy lock-in. Almost every sizable business has some program as a part of their workflow that can only run on (x86) Windows, for which the source code is unavailable.ARM already beat out x86 for the low-end, non-technical user. These users have, by and large, already switched to iPads or their Galaxy equivalents.
It's the server market where a lot of competition will take place in the future. Architecture isn't quite as important with an open-source LAMP stack, or with Java (though there will still be some servers that need closed-source x86 binary stuff). But Intel will have to keep on its toes in terms of price/performance/power to stay on top.
coder543 - Monday, March 31, 2014 - link
This simply isn't true. If the company has lost the source code (that's just bad.) then the program's system requirements aren't going to be getting any heavier. There have been demonstrations of x86-to-ARM binary translators that only have like a 40% to 60% performance degradation. ARM chips will get powerful enough to do that kind of binary translation on the handful of legacy programs a company might need to support.dmunsie - Monday, March 31, 2014 - link
More likely than the company has lost the code is that they are using a commercial package that is no longer maintained/updated by the original vendor. But that said, binary translation would work just fine in that case. In fact, we've even seen work in the past with older Mac software during each of the two major transitions they've done. There were a few older 68k apps that had stopped working on more recent 68k CPUs that started working again on the PowerPC with Apple's emulator.Kevin G - Monday, March 31, 2014 - link
Apple isn't too far behind Intel in terms of IPC. Apple is floating in the Westmere range in terms of ICP. Impressive for a design that is used in a phone. The catch with the A7 is its low clock speed and lack of Turbo. A good portion of Haswell's performance is the high turbo frequencies that it can hit under a single threaded load.The real threat to Intel is if Apple decided to move upward into the laptop or even desktop space. They seemingly have the talent to produce a design that matches the best of what Intel can offer. And Apple isn't afraid of a platform transition in the desktop space having done so twice in the past. (68k -> PPC and PPC -> x86)