N1 Hyperscale Reference Design

A big part of what is defining the N1 Platform as an actual platform, is the fact that Arm is offering a full reference design with a set of IPs that is fully validated by Arm themselves.

Here we see three reference designs, a Neoverse N1 hyperscale design, which we’ll get into more detail shortly, an N1 edge design, and a Neoverse E1 edge design. Arm’s goals with the reference designs is to give vendors “sweet-spot” configuration options that they will then be able to implement with (relatively) minimal effort.

The N1 hyperscale design is what we’ll be covering in more detail as this represents Arm’s most cutting-edge and competitive product.

As covered on the previous page, at the heart we find the Neoverse N1 CPU in either 64 or 128 core configurations, integrated in a CMN-600 mesh network with either 64 or 128MB of SLC cache. We also see 128 lanes for PCIe 4 respectively CCIX interfaces which provide plenty of I/O bandwidth.

In terms of memory controllers, Arm employs 8x DD4 interfaces up to 3200MHz. Arm actually has abandoned development of its own memory controllers as customers in most cases opted for their own in-house designs or rather opted to choose IP from other third-party vendors such as Cadence or Synopsys. For the current reference designs Arm’s own DMC-520 was still up-to-date and a well-understood block for the company, although in the future newer memory controllers such as for DDR5 will have to rely on third-party IP. Naturally, the reference design targets the latest 7nm process node.

The physical implementation of the SoC would use replicable hierarchical building blocks for ease of design. A “CPU Tile” consists of the two N1 CPU cores, a slice/bank of the SLC cache as well as part of the CMN’s cross points and home-nodes. This CPU Tile is replicated to generate a “Super Tile”, what is added here is peripheral parts of the SoC such as I/O as well as memory controllers. Finally, replicating the super tile in flipped and mirrored implementations results in the final top-level mesh that is to be implemented on the SoC.

Scaling the design to 128 cores doesn’t represent an issue for the IP, although we’ll be hitting some practical limits in terms of current generation technology. Arm’s 64 core N1 reference design with 64MB of cache on a 7nm process node would result in a die size a little under 400mm², which probably is on the higher end of what vendors would want to target in terms of manufacturability. To alleviate such concerns, Arm also took a page out of AMD’s book and floated the idea of chiplet designs, where each chiplet would communicate over CCIX links. Inherently it’s up to the vendor to decide how they’ll want to design their solution, and Arm provides the essential building blocks and flexibility to enable this.

SmartNIC integration capability is also an important aspect of the design and its flexibility. To maximise compute capacity in large scale system, having accelerated network connectivity is key in actually achieving high throughput in the densest (and efficient) form-factor possible.

The CMN-600 allows for slave ports on its crosspoints: Here we can see MMUs connected with high bandwidth interfaces of up to 128GB/s. Attaching fixed-function hardware offloading IP thus would be extremely easy to implement.

CCIX is extremely important for Arm as it enables its product portfolio to integrate with third-party IP offerings. Enabling cache coherency for external IP blocks is an incredibly attractive feature to have as it massively simplifies software design for the vendors. Essentially what this means is that software simply sees a single huge block of memory, whereas non-coherent systems require drivers and software to be aware and track what part of memory is valid and what isn’t. In terms of IP integration, Arm provides the CCIX coherent gateway that integrates with the CMN-600, while on the other side it’s the onus of the third-party IP provider to provide the CCIX translation layer.

Currently Xilinx will be among the first vendors to offer CCIX-enabled end-products in Q3 2019. With AMD also fully embracing CCIX, there’s some very exciting future potential for third-party accelerator hardware, and we be seeing new use-cases that just weren’t feasible before.

Power/Performance management

While it’s a bit weird to talk about power management in the context of implementation scalability (The average reader might think of it as a thermal/cooling consideration), there’s some very interesting implications in terms of how Arm simplifies the work needed to be done by the vendor.

Along a chip’s logical design, a vendor must also implement a power delivery network that will be able to adequately support the IP. In real-world use-cases this means that the PDN needs to be as robust as to deal with the worst-case power scenario of a component. This is actually quite a headache for many vendors as the design requires complex models and in most cases the PDN will need to be over-engineered in order to offer guarantees of stability, which in turn raises the complexity and cost of the implementation.

Arm seeks to alleviate these concerns by offering extremely fine-grained DVFS mechanisms in the form of a dedicated micro-controller. The controller access detailed activity monitoring units inside the CPU cores, seeing what actual blocks and how many transistors are actually actively switching, and feeding this information back to the system controller to change DVFS states. This provides a certain level of hard-guarantee as to when the CPU enters power-virus-like workloads which can cause current spikes, and avoid them in time. This enables vendors to design their PDNs to more conservative tolerances, saving on implementation cost.

The Neoverse N1 CPU: No-Compromise Performance Performance Targets: What Are The Numbers
Comments Locked

101 Comments

View All Comments

  • Antony Newman - Thursday, February 21, 2019 - link

    (Arbitrary example)

    If a SoC can run at 5GHz when 8 cores single core, but throttles down to 2.5GHz when 16 cores are active - then it cannot scale (due to the TDP limit).

    If ARM are designing their CPUs so that 128 (ie all) of them can run flat out without requiring throttling, then ARMs single core performance is indicative of the overall performance.

    If ARM increase their single core performance by 1.7 times in two years - and keep this same MO (of no Throttling cause to keep within the TDP) - it will be more than just data centres that want to buy into this new architecture.

    AJ
  • wumpus - Thursday, February 21, 2019 - link

    Very few problems scale without penalty. Having high single core performance (for each core in a multichip server CPU, obviously. The Intel result using all of its cache on one core is obviously irrelevant and why it was so anomalous vs. AMD) means far less cores are needed, when scaling up. Also adding more and more cores require as much cache or more. If not, your bandwidth will scale even worse.

    Single core is absolutely critical for servers, and why it is taking ARM so long to break in. IBM is the exception that proves the rule: but they rely on weird licensing rules and making sure all the threads can access the same cache.
  • eastcoast_pete - Thursday, February 21, 2019 - link

    I actually think we are in agreement. While this borders on semantics, per core performance is, of course, very important for servers, while high single (one) core is not. As you point out, Intel getting really high one core performance from a 18 core Xeon by running a strictly single core/thread test while allocating all the cache and much of the thermal envelope to that one core is an artificial situation for a server.
  • The_Assimilator - Wednesday, February 20, 2019 - link

    Remember when "system on chip" meant IO too? Apparently Arm doesn't.

    Remember when Arm chips didn't need HSFs to run? Pepperidge Farm remembers.

    I'm going to enjoy it when this, like all of Arm's previous attempts at the high-end, fails once again. Or when Lakefield eats Arm's lunch, whichever comes first.
  • wumpus - Wednesday, February 20, 2019 - link

    When your volume is 1400 chips (not all the same design) over 4 years, you use FPGA for anything you can. Doing anything else is pretty dumb. I'm surprised they bothered with an actual layout, but I suspect that they've been bitten by tiny details in FPGA simulation that never quite worked the same at speed.

    HSF? You want the MIPS, you burn the Watts. Presumably this is your "tell" in your troll.

    When has ARM made a previous attempt at the high-end? Certainly more than a few of their architectural licensees have, but there's a huge difference between a server architecture backed by ARM and even one backed by Qualcom. For one thing, they pretty much need to standardize remote adminstration to Intel levels (possibly circa ~2008ish to get off the ground). That's a lot of pesky little details, but something they absolutely need standardized to allow server use in the datacenter (yes, the Big Boys can roll their own, but everybody else needs a common server definition.
  • Antony Newman - Wednesday, February 20, 2019 - link

    Fascinating article.

    Do you think Ampere, Huawei, Cavium and Amazon will all switch to the Neoverse?

    In terms of IPC - do you have a view on if ARM have Caught up with Apples Vortex yet?

    Is there any reason why a mobile phone (or Tablet) maker would’t use the ARM ‘server’ chip in a fondleslab?

    AJ
  • ballsystemlord - Wednesday, February 20, 2019 - link

    Spelling and grammar corrections:
    ...the actual real-life performance improvements will higher due other SoC-level improvements as well as software improvements that aren't available in existing actual A72 silicon products.
    Missing be:
    ...the actual real-life performance improvements will behigher due other SoC-level improvements as well as software improvements that aren't available in existing actual A72 silicon products.

    The figured weren't run actual silicon but rather estimated on Arm's server farm in an emulation environment with RTL.
    Miswritten sentence:
    The figures weren't calculated on actual silicon but rather estimated on Arm's server farm in an emulation environment with RTL.

    The E1's CPU pipeline actually represents a brand new-design which (besides the A65) haven't seen employed before.
    Missing we:
    The E1's CPU pipeline actually represents a brand new-design which (besides the A65) we haven't seen employed before.

    Here we have to clusters of 8 cores in a small CMN-600 2x4 mesh network, ...
    Wrong 2:
    Here we have two clusters of 8 cores in a small CMN-600 2x4 mesh network, ...

    I was half asleep when I read it so there might be more.
  • sohntech43 - Wednesday, February 20, 2019 - link

    Could someone help me understand why the Spec CPU2006 results are so different from those recorded for the AMD 7601 (1000 - 1200 vs. 690.63) and Xeon Platinum results (1300+ vs 730) in the Spec data base?

    https://www.spec.org/cpu2006/results/cpu2006.html

    They are also different from what AMD was boasting at the time of the original EPYC launch:

    https://www.microway.com/download/whitepaper/AMD-E...

    I'm probably missing something obvious...
  • Wilco1 - Wednesday, February 20, 2019 - link

    Yes you're missing the fact these are GCC8 scores using -Ofast as mentioned in the article - ie. like when you build code yourself.

    Official SPEC scores are quite different and use special trick compilers to get the highest score. For example libquantum shows a completely unrealistic result in most SPEC submissions which artificially inflates the integer score by 30+%.
  • sohntech43 - Wednesday, February 20, 2019 - link

    Thanks - was surprised by the sheer magnitude of the delta caused by the compilers. Impressive results for N1 and will be interesting to see when silicon is available.

Log in

Don't have an account? Sign up now