I always wondered why Intel hasn't done something like this yet. Pair of Core2-like CPUs and a pair of Atoms (SIlvermont+) cores for a laptop. This removes the need for stuff like CoreM.
Tricky part is how to assign threads to cores. Probably needs a new API and scheduler. Each core communicates his performance and power use and OS schedules background tasks to the slow, powersaving cores and user threads to the fast cores.
For each soc, part of the bringup will include the (normalized to the fastest core's highest pstate) performance, power states and state change costs. Those values are then used by the scheduler to determine what to place where.
The reason arm does it is explained in this article. Having heterogeneous cores is more power efficient if you need to scale between multiple performance levels.
X86 cores are now so small the cost of the core itself is negligible.
I think the real reason is that Intel tries to use one design for a lot of niches and they don't want to start optimizing heavily for one application the way arm does. I suspect they may have to eventually however.
Keep in mind, the cost of the core and the price Intel charges for it are two different things.
Leaving aside whether the cost/price for an x86 core is insignificant, I'm not fully convinced that big.LITTLE is the best approach in the first place. It seems to me that, rather than put the necessary work in to allow for cores to use lower power states and even off states, ARM simply decided that it was more profitable to design a new low power core to sell as the solution and make their customers deal with the increased die area. There is also an associated cost to the end consumer that I certainly wouldn't welcome when only half my cores worth of die space is always inactive. Kind of reminds me of how much money I'm wasting on the IGP die space of an i7-7700K that is effectively dead space. Apple's A9 and predecessors did quite well despite not using big.LITTLE.
Perhaps big.LITTLE would make more sense to me if there were a much larger difference in the processing power of the big and LITTLE cores. That said, I'm not sure what application would care. Phones can't make use of high TDP processors. Tablets may be able to, but not for any meaningful length of time. Laptops can work with higher TDP processors (relative to phones), but an extremely low TDP processor (like a phone) would be unbearably slow, but perhaps there is a niche here. I've played the "use a phone / tablet as a full computer" game on iOS, Android, and Windows and come away less than impressed for anything beyond basic media consumption (even that is lacking in some ways). The associated battery life gains would often go unappreciated anyways given that many laptops can already achieve 8 hrs (north of 12 hrs in some cases) battery life. Desktops and workstations don't care for the low performance cores. Some servers may like low performance cores, but would benefit far more if all the cores were usable simultaneously (so not big.LITTLE).
I do like where this DynamIQ is going, though. Heterogeneous sets of cores with their own sets of compute characteristics that can be active for any appropriate workload regardless of what the other cores are doing makes a lot of sense. Though, I feel like this will get used mainly for high / low power cores, when it would make more sense to use a general purpose sequential core / highly parallel core / special purpose (media decoding, security processing, etc.) split.
FYI, current implementations of big.LITTLE do allow for all cores to be used at the same time -- not just cores from one cluster at a time. Only some of the first chips out, and nvidia's chips are like that.
"Increased efficiency from shared memory between CPUs" (slide pg.14) Does that mean past and present big.LITTLE SOCs do not share memory between the clusters? Is that why big.LITTLE SOCs have such poor showing in memory-bound benchmarks?
I still shed tears of joy (? well... mirth, anyway, at least) every time I reread that sentence. I almost want to frame and hang it up on a wall as an immortal work of some kind of art...
My wild guess that Apple A10 already have similar design, big/little cores in the same cluster, share the same L2 cache and hardware based system monitor will automatically switch thread execution to either core.
Well this seems to validate MediaTeks Tri-Cluster and CorePilot technologies. But it always strikes me that while the ARM ISA is popular, ARM cores and SoC tech is not. Apple have their own cores, Qualcomm does, Nvidia does. Rumours are that the X30 is getting very few design wins; most phone makers are buying 821s or 835s.
you say that, but: Apple makes only custom cores Qualcomm does, but Kryo cores are only used in theit top performing chips, everything else is standard ARM IP. Samsung mixes their custom IP with standard ARM IP. Nvidia created their own custom core (Denver) which sdly failed spectacularly, but also pairs with standard ARM IP. MediaTek *only* uses standard ARM IP.
For all but the extremely high performance stuff where custom is more efficient and effective, everyone uses standard ARM IP
I'm not sure I'd say Qualcomm is moving towards ARM designs. The change happened when Apple shifted the market by introducing 64-bit ahead of what people expected, and Qualcomm didn't have a 64-bit design in their near term roadmap. Since getting their first 64-bit ARM standards out, they've been shifting back to their business as usual custom designs.
Qualcomm is moving towards ARM desings. The Snapdragon 835 uses a customization of the Cortex-A73, not an evolution of the Kryo cores on the Snapdragon 820.
And looking at midrange designs like the 650/652, those plain ARM A72 cores pack one heck of a punch while still keeping power consumption reasonable. It might not make sense in the future to devote resources to a custom core when an ARM core is good enough while allowing faster time to market.
I guess it comes down to how fine the hairs can be split... You could have 10 clusters at each power level and you'd see little performance penalty if all cache was shared. A tricluster design could work well if there was enough power/performance differentiation for the middle cluster. Looking at Mediatek's latest stuff, it looks like they're still struggling with efficiency and adding another cluster didn't improve things.
The X20 and X25 still can't compete with the Snapdragon 650/652 on an ancient 28nm process.
The Redmi Note 4 with the X20 (4x A53 + 4x A53 + 2x A72) has 25% less battery life compared to the Redmi Note 3 Pro with the Snapdragon 650 (4x A53 + 2x A72), as shown in Anandtech reviews. These two phones have similar components and battery capacity, yet the older phone is more efficient and has higher performance even though the Snapdragon 650 is on an ancient 28nm process. Mediatek chips consume a lot more power and have slow GPUs compared to midrange Qualcomm stuff.
Great, Samsung and Mediateknwill love these, get a license and slap together another lame SoC instead of innovating and customising themselves like Qualcomm and Apple.
Qualcomm doesn't really customize their chips all that much anymore either. SD835's Kryo cores are part of the new "Built on ARM Cortex Technology", which means that they are basically lightly optimized ARM A53 and A72 cores. That is why SD835, Kirin 960 and Exynos 8895 are all benchmarking in the same general performance range, because they are essentially the same architectures these days.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
35 Comments
Back to Article
beginner99 - Tuesday, March 21, 2017 - link
I always wondered why Intel hasn't done something like this yet. Pair of Core2-like CPUs and a pair of Atoms (SIlvermont+) cores for a laptop. This removes the need for stuff like CoreM.Tricky part is how to assign threads to cores. Probably needs a new API and scheduler. Each core communicates his performance and power use and OS schedules background tasks to the slow, powersaving cores and user threads to the fast cores.
tuxRoller - Tuesday, March 21, 2017 - link
A good chunk of the work is done. The model is called EAS.https://www.linaro.org/blog/core-dump/energy-aware...
https://lwn.net/Articles/706374/
For each soc, part of the bringup will include the (normalized to the fastest core's highest pstate) performance, power states and state change costs. Those values are then used by the scheduler to determine what to place where.
StevoLincolnite - Tuesday, March 21, 2017 - link
Cost probably.Intel would rather give you a small efficient processor on Idle, that can also clock high when the need arises.
Krysto - Tuesday, March 21, 2017 - link
Because Intel cares first and foremost about profit margin. Core + Atom > just low-clocked Core.Meteor2 - Tuesday, March 21, 2017 - link
So why does ARM do it? Hopefully they care about profit too. If they don't they'll go out of business.Rakib - Tuesday, March 21, 2017 - link
They sell licenses, cpu's.saratoga4 - Tuesday, March 21, 2017 - link
The reason arm does it is explained in this article. Having heterogeneous cores is more power efficient if you need to scale between multiple performance levels.saratoga4 - Tuesday, March 21, 2017 - link
X86 cores are now so small the cost of the core itself is negligible.I think the real reason is that Intel tries to use one design for a lot of niches and they don't want to start optimizing heavily for one application the way arm does. I suspect they may have to eventually however.
BurntMyBacon - Wednesday, March 22, 2017 - link
Keep in mind, the cost of the core and the price Intel charges for it are two different things.Leaving aside whether the cost/price for an x86 core is insignificant, I'm not fully convinced that big.LITTLE is the best approach in the first place. It seems to me that, rather than put the necessary work in to allow for cores to use lower power states and even off states, ARM simply decided that it was more profitable to design a new low power core to sell as the solution and make their customers deal with the increased die area. There is also an associated cost to the end consumer that I certainly wouldn't welcome when only half my cores worth of die space is always inactive. Kind of reminds me of how much money I'm wasting on the IGP die space of an i7-7700K that is effectively dead space. Apple's A9 and predecessors did quite well despite not using big.LITTLE.
Perhaps big.LITTLE would make more sense to me if there were a much larger difference in the processing power of the big and LITTLE cores. That said, I'm not sure what application would care. Phones can't make use of high TDP processors. Tablets may be able to, but not for any meaningful length of time. Laptops can work with higher TDP processors (relative to phones), but an extremely low TDP processor (like a phone) would be unbearably slow, but perhaps there is a niche here. I've played the "use a phone / tablet as a full computer" game on iOS, Android, and Windows and come away less than impressed for anything beyond basic media consumption (even that is lacking in some ways). The associated battery life gains would often go unappreciated anyways given that many laptops can already achieve 8 hrs (north of 12 hrs in some cases) battery life. Desktops and workstations don't care for the low performance cores. Some servers may like low performance cores, but would benefit far more if all the cores were usable simultaneously (so not big.LITTLE).
I do like where this DynamIQ is going, though. Heterogeneous sets of cores with their own sets of compute characteristics that can be active for any appropriate workload regardless of what the other cores are doing makes a lot of sense. Though, I feel like this will get used mainly for high / low power cores, when it would make more sense to use a general purpose sequential core / highly parallel core / special purpose (media decoding, security processing, etc.) split.
extide - Wednesday, May 10, 2017 - link
FYI, current implementations of big.LITTLE do allow for all cores to be used at the same time -- not just cores from one cluster at a time. Only some of the first chips out, and nvidia's chips are like that.edzieba - Tuesday, March 21, 2017 - link
I'm fully expecting a future chip to dump the IGP in favour of a cluster of Xeon-Phi-like cores.prisonerX - Tuesday, March 21, 2017 - link
LOLAlexvrb - Tuesday, March 21, 2017 - link
Maybe for headless applications that also don't have strict TDP requirements. Servers, perhaps. Otherwise no.lopri - Tuesday, March 21, 2017 - link
"Increased efficiency from shared memory between CPUs" (slide pg.14) Does that mean past and present big.LITTLE SOCs do not share memory between the clusters? Is that why big.LITTLE SOCs have such poor showing in memory-bound benchmarks?name99 - Tuesday, March 21, 2017 - link
Means shared LLC. So don't have to pay the cost of copying data from one of the clusters to the other.jjj - Tuesday, March 21, 2017 - link
Do they add a L3$?Guess gen 2 would be integrating accelerators in a cluster.
boeush - Tuesday, March 21, 2017 - link
"As the tide of progress washes against the sure, ARM is today announcing the next step on the sandy beach with DynamIQ."Wow, Ian. I mean, it's you and all, but this one just floored me... I do believe you are the new Shakespeare of tech reporting ;-p
Meteor2 - Tuesday, March 21, 2017 - link
I just love that a typo made it in there too, just as Ian was doing his eloquent thing. It's like The Guardian round here :)boeush - Tuesday, March 21, 2017 - link
It's the guardIan, alright.I still shed tears of joy (? well... mirth, anyway, at least) every time I reread that sentence. I almost want to frame and hang it up on a wall as an immortal work of some kind of art...
Ian Cutress - Tuesday, March 21, 2017 - link
Ooops :DAlexvrb - Tuesday, March 21, 2017 - link
I thought it was intentional. Like technological progress vs certainty.g1011999 - Tuesday, March 21, 2017 - link
My wild guess that Apple A10 already have similar design, big/little cores in the same cluster, share the same L2 cache and hardware based system monitor will automatically switch thread execution to either core.Meteor2 - Tuesday, March 21, 2017 - link
Well this seems to validate MediaTeks Tri-Cluster and CorePilot technologies. But it always strikes me that while the ARM ISA is popular, ARM cores and SoC tech is not. Apple have their own cores, Qualcomm does, Nvidia does. Rumours are that the X30 is getting very few design wins; most phone makers are buying 821s or 835s.So maybe ARM's own IP isn't that great.
Mobile-Dom - Tuesday, March 21, 2017 - link
you say that, but:Apple makes only custom cores
Qualcomm does, but Kryo cores are only used in theit top performing chips, everything else is standard ARM IP.
Samsung mixes their custom IP with standard ARM IP.
Nvidia created their own custom core (Denver) which sdly failed spectacularly, but also pairs with standard ARM IP.
MediaTek *only* uses standard ARM IP.
For all but the extremely high performance stuff where custom is more efficient and effective, everyone uses standard ARM IP
Meteor2 - Tuesday, March 21, 2017 - link
Yeah I was referring to high-end (because that's the most interesting).Is Denver a failure? It's in Jetson and DrivePX, and Nvidia are apparently rolling another custom core for Xavier.
saratoga4 - Tuesday, March 21, 2017 - link
Qualcomm and Nvidia are moving towards ARM designs from full custom though. If anything, arm is gaining ground.senecarr - Tuesday, March 21, 2017 - link
I'm not sure I'd say Qualcomm is moving towards ARM designs. The change happened when Apple shifted the market by introducing 64-bit ahead of what people expected, and Qualcomm didn't have a 64-bit design in their near term roadmap. Since getting their first 64-bit ARM standards out, they've been shifting back to their business as usual custom designs.Marc GP - Tuesday, March 21, 2017 - link
Qualcomm is moving towards ARM desings. The Snapdragon 835 uses a customization of the Cortex-A73, not an evolution of the Kryo cores on the Snapdragon 820.serendip - Wednesday, March 22, 2017 - link
And looking at midrange designs like the 650/652, those plain ARM A72 cores pack one heck of a punch while still keeping power consumption reasonable. It might not make sense in the future to devote resources to a custom core when an ARM core is good enough while allowing faster time to market.serendip - Wednesday, March 22, 2017 - link
I guess it comes down to how fine the hairs can be split... You could have 10 clusters at each power level and you'd see little performance penalty if all cache was shared. A tricluster design could work well if there was enough power/performance differentiation for the middle cluster. Looking at Mediatek's latest stuff, it looks like they're still struggling with efficiency and adding another cluster didn't improve things.Meteor2 - Wednesday, March 22, 2017 - link
Has the X30 been reviewed? Seems a little harsh to say they're still struggling if not. It could be a belter with A73 and A35 in there.serendip - Wednesday, March 22, 2017 - link
The X20 and X25 still can't compete with the Snapdragon 650/652 on an ancient 28nm process.The Redmi Note 4 with the X20 (4x A53 + 4x A53 + 2x A72) has 25% less battery life compared to the Redmi Note 3 Pro with the Snapdragon 650 (4x A53 + 2x A72), as shown in Anandtech reviews. These two phones have similar components and battery capacity, yet the older phone is more efficient and has higher performance even though the Snapdragon 650 is on an ancient 28nm process. Mediatek chips consume a lot more power and have slow GPUs compared to midrange Qualcomm stuff.
Meteor2 - Thursday, March 23, 2017 - link
Granted. But I'm happy to wait and see how the X30 does when it's released. Mediated claim to have made a jump in performance; maybe they have.chipped - Saturday, March 25, 2017 - link
Great, Samsung and Mediateknwill love these, get a license and slap together another lame SoC instead of innovating and customising themselves like Qualcomm and Apple.I'm going to call this "Cluster Fucks"
warreo - Tuesday, March 28, 2017 - link
Qualcomm doesn't really customize their chips all that much anymore either. SD835's Kryo cores are part of the new "Built on ARM Cortex Technology", which means that they are basically lightly optimized ARM A53 and A72 cores. That is why SD835, Kirin 960 and Exynos 8895 are all benchmarking in the same general performance range, because they are essentially the same architectures these days.