The Neoverse E1 CPU: A Small SMT Core for the Data-Plane

We’ve talked a lot about the Neoverse N1 – but today’s announcement actually covers two new platforms. Today actually isn’t the first time that we’ve talked about the other new platform, the new Neoverse E1. Codenamed Helios, the E1 CPU actually is derived from the Cortex-A65AE which  we’ve already briefly talked about in December.

Arm largely differentiates between two big workload types in infrastructure deployments, “Compute” use-cases where we need arithmetically capable CPU cores such as the N1, and “Throughput” workloads that largely are mainly about shifting large amounts of data around. The latter category is what the new Neoverse E1 is targeting, representing a specialised CPU core that efficiently and cost effectively is able to handle such tasks.

Arm’s First SMT Core

As some might have read in the Cortex-A65AE announcement piece, the µarchitecture used in Helios both in the Cortex chip as well as in the Neoverse E1 represents Arm’s first ever foray into designing and bringing to market an SMT (simultaneous multi-threading) core. It may at first glance sound a bit weird to have SMT introduced in a smaller CPU core, as traditionally we would be thinking about SMT being useful in raising the back-end execution unit utilisation rate in larger microarchitectures, but it actually makes a lot of sense as well in smaller CPU cores whose workloads are predominantly in the data plane.

In data-processing workloads, cache misses dominate CPU work cycles. This is because we’re talking about data that inherently doesn’t have a long residence time on the CPU core, either streaming from sensors as described in the use-case of the Cortex-A65AE, or in this case streaming from the network in the case of infrastructure workloads. This means that the CPU has to deal with long memory latency accesses, introducing stalls in the CPU pipelines.

The E1 CPU is a small out-of-order design with SMT: In data-plane heavy workloads, introducing the ability to handle a secondary thread on the CPU core essentially represents almost a free throughput gain for the microarchitecture, as it’s able to fill unused execution cycles that otherwise would have just gone to waste. Indeed in optimal workloads, this can result in essentially a doubling of the throughput, on top of the single-threaded microarchitectural gains that the E1 CPU brings over predecessors such as the Cortex A55.

At the cluster level, the Neoverse E1 CPU configuration options are extremely similar to what we’ve seen deployable in DynamIQ systems. The cluster itself looks very much like a DSU, with up to eight cores, a cluster-level snoop filter, and an L3 cache configurable up to 4MB.

The E1 CPU’s Pipelines

The E1’s CPU pipeline actually represents a brand new-design which (besides the A65) haven’t seen employed before. What Arm has done here is take the foundation of what was the in-order Cortex A55 µarch and turned it into a minimally narrow out-of-order CPU. Moving to a basic out-of-order CPU design was essential in order to get more throughput out of the core, as it avoids stalling the whole pipeline in scenarios where we’d have a long load.

The fetch/decode/dispatch stages of the pipeline is 2-wide. In terms of its out-of-order windows, we see a small 40-deep ROB (Re-order buffer).

On the back-end we find a similar execution unit count and layout as on the Cortex-A55. A big difference to the A55 is that the integer ALUs are now differently partitioned: One ALU is dedicated for simple arithmetic operations only, while the second unit now takes over both integer multiplication as well as division tasks, whereas this was divided among the two ALUs in the A55. Interestingly, Arm also demarks the branch port being shared with this second complex integer ALUs, no longer representing it as a dedicated port as on the A55.

The SIMD/Neon pipelines seem to remain the same as on the A55 in terms of its high-level capabilities, but it is possible Arm may have improved cycle latencies of some instructions between the two generations.

Finally, the load and store ports also match the capabilities of the A55.

SMT on the Neoverse E1 is enabled through the duplication of architectural state components of the core. This means the CPU has double the general purpose, vector and system registers and their corresponding structures on the physical core.

At the software level, this naturally simply appears as two separate CPU cores, and can be run at different exception levels or even be running different OSes.

Performance partitioning between the two threads is enabled by a simple round-robin instruction fetch mechanism, ensuing that both threads get the same amount of attention.

The front-end of the CPU has seen improvements in all regards and adopts many of the state-of-the art front end branch prediction and prefetch mechanisms we find in other more recent Arm OoO cores – just in a more limited implementation optimised for the smaller nature of the E1 CPU.

As mentioned just earlier, although the throughput of the execution pipelines hasn’t fundamentally changed, Arm has updated the execution units to employ newer designs with shorter cycle latencies. Here in particular FMAC latency has bene halved, while the integer division unit has been updated to a Radix-16 unit. It’s to be noted that these improvements are again slightly misleading as Arm is making comparisons to the Cortex A53 – the newer Cortex A55 already employed the aforementioned changed, including the new ARMv8.2 double throughput FP16 instructions as well as Int8 dot-products.

On the memory system side of things, we again see very large similarities to the Cortex A55 and the E1 enjoys the same improvements made by that µarch. The only other notable changes here are the fact that the data prefetchers as well as the L2 TLB are now multi-thread aware and are optimised in their design to better operate with the new SMT nature of the core.

Performance Targets: What Are The Numbers E1 Implementation & Performance Targets
Comments Locked

101 Comments

View All Comments

  • Antony Newman - Thursday, February 21, 2019 - link

    (Arbitrary example)

    If a SoC can run at 5GHz when 8 cores single core, but throttles down to 2.5GHz when 16 cores are active - then it cannot scale (due to the TDP limit).

    If ARM are designing their CPUs so that 128 (ie all) of them can run flat out without requiring throttling, then ARMs single core performance is indicative of the overall performance.

    If ARM increase their single core performance by 1.7 times in two years - and keep this same MO (of no Throttling cause to keep within the TDP) - it will be more than just data centres that want to buy into this new architecture.

    AJ
  • wumpus - Thursday, February 21, 2019 - link

    Very few problems scale without penalty. Having high single core performance (for each core in a multichip server CPU, obviously. The Intel result using all of its cache on one core is obviously irrelevant and why it was so anomalous vs. AMD) means far less cores are needed, when scaling up. Also adding more and more cores require as much cache or more. If not, your bandwidth will scale even worse.

    Single core is absolutely critical for servers, and why it is taking ARM so long to break in. IBM is the exception that proves the rule: but they rely on weird licensing rules and making sure all the threads can access the same cache.
  • eastcoast_pete - Thursday, February 21, 2019 - link

    I actually think we are in agreement. While this borders on semantics, per core performance is, of course, very important for servers, while high single (one) core is not. As you point out, Intel getting really high one core performance from a 18 core Xeon by running a strictly single core/thread test while allocating all the cache and much of the thermal envelope to that one core is an artificial situation for a server.
  • The_Assimilator - Wednesday, February 20, 2019 - link

    Remember when "system on chip" meant IO too? Apparently Arm doesn't.

    Remember when Arm chips didn't need HSFs to run? Pepperidge Farm remembers.

    I'm going to enjoy it when this, like all of Arm's previous attempts at the high-end, fails once again. Or when Lakefield eats Arm's lunch, whichever comes first.
  • wumpus - Wednesday, February 20, 2019 - link

    When your volume is 1400 chips (not all the same design) over 4 years, you use FPGA for anything you can. Doing anything else is pretty dumb. I'm surprised they bothered with an actual layout, but I suspect that they've been bitten by tiny details in FPGA simulation that never quite worked the same at speed.

    HSF? You want the MIPS, you burn the Watts. Presumably this is your "tell" in your troll.

    When has ARM made a previous attempt at the high-end? Certainly more than a few of their architectural licensees have, but there's a huge difference between a server architecture backed by ARM and even one backed by Qualcom. For one thing, they pretty much need to standardize remote adminstration to Intel levels (possibly circa ~2008ish to get off the ground). That's a lot of pesky little details, but something they absolutely need standardized to allow server use in the datacenter (yes, the Big Boys can roll their own, but everybody else needs a common server definition.
  • Antony Newman - Wednesday, February 20, 2019 - link

    Fascinating article.

    Do you think Ampere, Huawei, Cavium and Amazon will all switch to the Neoverse?

    In terms of IPC - do you have a view on if ARM have Caught up with Apples Vortex yet?

    Is there any reason why a mobile phone (or Tablet) maker would’t use the ARM ‘server’ chip in a fondleslab?

    AJ
  • ballsystemlord - Wednesday, February 20, 2019 - link

    Spelling and grammar corrections:
    ...the actual real-life performance improvements will higher due other SoC-level improvements as well as software improvements that aren't available in existing actual A72 silicon products.
    Missing be:
    ...the actual real-life performance improvements will behigher due other SoC-level improvements as well as software improvements that aren't available in existing actual A72 silicon products.

    The figured weren't run actual silicon but rather estimated on Arm's server farm in an emulation environment with RTL.
    Miswritten sentence:
    The figures weren't calculated on actual silicon but rather estimated on Arm's server farm in an emulation environment with RTL.

    The E1's CPU pipeline actually represents a brand new-design which (besides the A65) haven't seen employed before.
    Missing we:
    The E1's CPU pipeline actually represents a brand new-design which (besides the A65) we haven't seen employed before.

    Here we have to clusters of 8 cores in a small CMN-600 2x4 mesh network, ...
    Wrong 2:
    Here we have two clusters of 8 cores in a small CMN-600 2x4 mesh network, ...

    I was half asleep when I read it so there might be more.
  • sohntech43 - Wednesday, February 20, 2019 - link

    Could someone help me understand why the Spec CPU2006 results are so different from those recorded for the AMD 7601 (1000 - 1200 vs. 690.63) and Xeon Platinum results (1300+ vs 730) in the Spec data base?

    https://www.spec.org/cpu2006/results/cpu2006.html

    They are also different from what AMD was boasting at the time of the original EPYC launch:

    https://www.microway.com/download/whitepaper/AMD-E...

    I'm probably missing something obvious...
  • Wilco1 - Wednesday, February 20, 2019 - link

    Yes you're missing the fact these are GCC8 scores using -Ofast as mentioned in the article - ie. like when you build code yourself.

    Official SPEC scores are quite different and use special trick compilers to get the highest score. For example libquantum shows a completely unrealistic result in most SPEC submissions which artificially inflates the integer score by 30+%.
  • sohntech43 - Wednesday, February 20, 2019 - link

    Thanks - was surprised by the sheer magnitude of the delta caused by the compilers. Impressive results for N1 and will be interesting to see when silicon is available.

Log in

Don't have an account? Sign up now