Apple's Cyclone Microarchitecture Detailedby Anand Lal Shimpi on March 31, 2014 2:10 AM EST
The most challenging part of last year's iPhone 5s review was piecing together details about Apple's A7 without any internal Apple assistance. I had less than a week to turn the review around and limited access to tools (much less time to develop them on my own) to figure out what Apple had done to double CPU performance without scaling frequency. The end result was an (incorrect) assumption that Apple had simply evolved its first ARMv7 architecture (codename: Swift). Based on the limited information I had at the time I assumed Apple simply addressed some low hanging fruit (e.g. memory access latency) in building Cyclone, its first 64-bit ARMv8 core. By the time the iPad Air review rolled around, I had more knowledge of what was underneath the hood:
As far as I can tell, peak issue width of Cyclone is 6 instructions. That’s at least 2x the width of Swift and Krait, and at best more than 3x the width depending on instruction mix. Limitations on co-issuing FP and integer math have also been lifted as you can run up to four integer adds and two FP adds in parallel. You can also perform up to two loads or stores per clock.
With Swift, I had the luxury of Apple committing LLVM changes that not only gave me the code name but also confirmed the size of the machine (3-wide OoO core, 2 ALUs, 1 load/store unit). With Cyclone however, Apple held off on any public commits. Figuring out the codename and its architecture required a lot of digging.
Last week, the same reader who pointed me at the Swift details let me know that Apple revealed Cyclone microarchitectural details in LLVM commits made a few days ago (thanks again R!). Although I empirically verified many of Cyclone's features in advance of the iPad Air review last year, today we have some more concrete information on what Apple's first 64-bit ARMv8 architecture looks like.
Note that everything below is based on Apple's LLVM commits (and confirmed by my own testing where possible).
|Apple Custom CPU Core Comparison|
|Apple A6||Apple A7|
|ARM ISA||ARMv7-A (32-bit)||ARMv8-A (32/64-bit)|
|Issue Width||3 micro-ops||6 micro-ops|
|Reorder Buffer Size||45 micro-ops||192 micro-ops|
|Branch Mispredict Penalty||14 cycles||16 cycles (14 - 19)|
|Load Latency||3 cycles||4 cycles|
|Indirect Branch Units||0||1|
|L1 Cache||32KB I$ + 32KB D$||64KB I$ + 64KB D$|
As I mentioned in the iPad Air review, Cyclone is a wide machine. It can decode, issue, execute and retire up to 6 instructions/micro-ops per clock. I verified this during my iPad Air review by executing four integer adds and two FP adds in parallel. The same test on Swift actually yields fewer than 3 concurrent operations, likely because of an inability to issue to all integer and FP pipes in parallel. Similar limits exist with Krait.
I also noted an increase in overall machine size in my initial tinkering with Cyclone. Apple's LLVM commits indicate a massive 192 entry reorder buffer (coincidentally the same size as Haswell's ROB). Mispredict penalty goes up slightly compared to Swift, but Apple does present a range of values (14 - 19 cycles). This also happens to be the same range as Sandy Bridge and later Intel Core architectures (including Haswell). Given how much larger Cyclone is, a doubling of L1 cache sizes makes a lot of sense.
On the execution side Cyclone doubles the number of integer ALUs, load/store units and branch units. Cyclone also adds a unit for indirect branches and at least one more FP pipe. Cyclone can sustain three FP operations in parallel (including 3 FP/NEON adds). The third FP/NEON pipe is used for div and sqrt operations, the machine can only execute two FP/NEON muls in parallel.
I also found references to buffer sizes for each unit, which I'm assuming are the number of micro-ops that feed each unit. I don't believe Cyclone has a unified scheduler ahead of all of its execution units and instead has statically partitioned buffers in front of each port. I've put all of this information into the crude diagram below:
Unfortunately I don't have enough data on Swift to really produce a decent comparison image. With six decoders and nine ports to execution units, Cyclone is big. As I mentioned before, it's bigger than anything else that goes in a phone. Apple didn't build a Krait/Silvermont competitor, it built something much closer to Intel's big cores. At the launch of the iPhone 5s, Apple referred to the A7 as being "desktop class" - it turns out that wasn't an exaggeration.
Cyclone is a bold move by Apple, but not one that is without its challenges. I still find that there are almost no applications on iOS that really take advantage of the CPU power underneath the hood. More than anything Apple needs first party software that really demonstrates what's possible. The challenge is that at full tilt a pair of Cyclone cores can consume quite a bit of power. So for now, Cyclone's performance is really used to exploit race to sleep and get the device into a low power state as quickly as possible. The other problem I see is that although Cyclone is incredibly forward looking, it launched in devices with only 1GB of RAM. It's very likely that you'll run into memory limits before you hit CPU performance limits if you plan on keeping your device for a long time.
It wasn't until I wrote this piece that Apple's codenames started to make sense. Swift was quick, but Cyclone really does stir everything up. The earlier than expected introduction of a consumer 64-bit ARMv8 SoC caught pretty much everyone off guard (e.g. Qualcomm's shift to vanilla ARM cores for more of its product stack).
The real question is where does Apple go from here? By now we know to expect an "A8" branded Apple SoC in the iPhone 6 and iPad Air successors later this year. There's little benefit in going substantially wider than Cyclone, but there's still a ton of room to improve performance. One obvious example would be through frequency scaling. Cyclone is clocked very conservatively (1.3GHz in the 5s/iPad mini with Retina Display and 1.4GHz in the iPad Air), assuming Apple moves to a 20nm process later this year it should be possible to get some performance by increasing clock speed scaling without a power penalty. I suspect Apple has more tricks up its sleeve than that however. Swift and Cyclone were two tocks in a row by Intel's definition, a third in 3 years would be unusual but not impossible (Intel sort of committed to doing the same with Saltwell/Silvermont/Airmont in 2012 - 2014).
Looking at Cyclone makes one thing very clear: the rest of the players in the ultra mobile CPU space didn't aim high enough. I wonder what happens next round.
Post Your CommentPlease log in or sign up to comment.
View All Comments
Khato - Monday, March 31, 2014 - linkWell, I'd disagree that Apple has designed a 'better' CPU than their competition. They designed it for a different target, but it's nothing special in the overall picture. The only point of surprise with respect to their CPU design is the pace of iterations thus far, though it's still unknown if such is just the result of a 'backlog' or if they'll actually maintain a one year cadence for major architecture changes.
techconc - Tuesday, April 1, 2014 - linkI think this article makes it clear that Apple set the bar higher than others with the current round. Sure, other vendors can put together 8 core chips of lesser complexity that will be faster under some MP workloads. I'm not sure that makes for a "better" CPU either. To your point, it will be interesting to see if Apple can continue at the pace they have in the past. Logically, if history has demonstrated they can, then we should have no reason to doubt they will in the future. Given the percentage of revenue that come from iOS based devices, coupled with the vast resources Apple has, one would expect that they will continue this pace. Also, given the differences between the A6 and the A7, it would seem that Apple has multiple hardware teams working concurrently on chip designs. That of course is just speculation on my part though.
Khato - Wednesday, April 2, 2014 - linkAhhhh, but history thus far hasn't really demonstrated such. For all we know cyclone could have started development back when Apple acquired PA semi 6 years ago and been the primary focus of the core design team with swift being a simpler version that was supposed to be out the door much earlier than it was. I don't really expect that such is the case, but simply having two major core iterations in a row doesn't tell anything about the future pattern.
Regardless it'll be quite interesting to see where all the other players fall in terms of performance. We can make a reasonable estimate for Intel and vanilla ARM cores, but NVIDIA and Qualcomm? No idea if they'll follow Apple's path of high IPC with low frequencies to keep power in check or continue with the status quo. Both are perfectly viable and can end up with the same efficiency.
Kevin G - Monday, March 31, 2014 - linkThe iPad replacing the low end laptop/desktops is more of a software issue than hardware at this point. On that note, for wide spread adoption into the corporate world Apple needs to reinvest into the server software market so that companies can better manage large iPad deployments remotely.
Doormat - Monday, March 31, 2014 - linkI've always assumed that the goal is that one day, your phone or tablet wirelessly tethers with your monitor, keyboard and mouse and runs the regular, desktop version of OSX. You don't need a separate computer unless you're a gamer or professional who needs CAD or 3D or something intense.
The A7 is the first chip that seems to fit in with that realization.
PeteH - Tuesday, April 1, 2014 - linkI've been under the same assumption, but I just thought of one possible side effect. If your phone is your only real computing device and you carry your phone with you everywhere, does that mean cloud storage becomes unnecessary?
Kidster3001 - Monday, April 7, 2014 - linkThis is simply the evolution of computing. The idea has been around forever. It's just a matter of who can market it AND make money that will get the credit for "creating it"
errorr - Monday, March 31, 2014 - linkThe Qualcomm in US Samsung phones is about the millions it costs to get approved by the big carriers who won't let a phone attach unless it has gone through strict validation. It can cost a lot of money especially considering the $25-$50 patent license penalty added on the 3G bands.
I think we will start to see Samsung chips in 2015 when they are LTE only and the carriers have VoLTE up and running which will start later this year. If they only need to validate the LTE they won't need to spend the money on proving each new chip has a quality legacy stack.
Qualcomm has already done the legwork and testing needed so as long as the process is long and expensive it is almost certainly cheaper for Samsung to use their chips.
ahassan - Monday, March 31, 2014 - linkBut at the same time, own your own chip IP is extremely expensive. Your chip R&D department also has to be on top of their game. In the end, for many companies, it's more worthwhile to just buy chip designs from other companies.
name99 - Monday, March 31, 2014 - linkAll this has happened before. Read Sutherland on the Wheel of Reincarnation or Christensen on Disruption Theory.
There is a time when doing everything in house is optimal and, luckily for Apple, mobile is at that time. Basically, doing everything in house is a win when the gains from tight integration (for example in a smaller device, or lower power usage, or SW making optimal use of all the HW features) outweighs the costs of not being able to spread R&D expenses across the ENTIRE community of buyers, no matter what the company. With mobile we'e at that point and we're likely to stay there for a while, given that size will always be an issue, and that power seems likely to be an issue for a while.