128 Cores Mesh Setup & Memory Subsystem

Starting off the testing, one thing that is extremely intriguing about Ampere’s implementation of their Altra designs is the fact that they’re achieving more than 64 cores whilst still using Arm’s CMN-600 mesh network IP. In our more recent coverage earlier this year of Arm’s newer upcoming CMN-700 mesh network, we wrote about the fundamental structure of the CMN mesh and its building blocks, such as RN-F, HN-F, and components such as CALs.

In a typical deployment, a mesh consists of cross-points (XPs) to whose RN-F (Fully coherent request node) connect to either directly a CPU, or a CAL (component aggregation layer) which can house two CPUs.

Our initial confusion last year with the Quicksilver designs was that 80 cores was more cores than what the CMN would actually support when configured with the maximum mesh size and two cores per CAL per XP – at least officially. Ampere back then was generally coy about talking about the mesh setup, but more recent discussions with Arm and Ampere, the companies have divulged that it’s also possible to house the CPUs inside of a DSU (DynamiQ Shared Unit), the same cluster design that we find in mobile Arm SoCs with Cortex CPUs.

Ampere has since confirmed the mesh setup in regards to the CPUs: instead of housing cores directly to the mesh via RN-Fs, or even via a CAL on each XP, they are employing two DSUs, each with two Neoverse-N1 cores, connected to a CAL, connected to a single XP. That means each mesh cross-point houses four cores, vastly reducing the needed mesh size to be able get to such core numbers – this is both valid for the Quicksilver 80-core designs and the new Mystique 128-core designs. The only differences with the Mystique design is that Ampere has now simply increased the mesh size (we still don’t have official confirmation on the exact setup here).

From a topology perspective, the Altra Max is still a massive monolithic 128-core chip, with competitive core-to-core latencies within the same socket. Our better understanding of the use of DSUs in the design now also explains the more unique low-latency figures of 26ns which only happens between two core pairs – these would presumably be two sibling cores housed within a single DSU, and coherency and communications between the two doesn’t have to go out into the mesh, which incurs higher latencies.

We had discussed Ampere’s quite high inter-socket latencies in our review of the Altra last year, as a fresh reminder, this is because the design doesn’t have a single coherency protocol that spans from the mesh network to the remote mesh of the other socket – instead having to have to go through an intermediary cache-coherency protocol translation for inter-socket communication, CCIX in this case. In particular this wasn’t very efficient for when two cores within a socket have to work on a remote socket cache line – the communication between cores in DSU is very efficient here, however between cores in a mesh it means doing a round-trip to the remote socket, resulting in pretty awful latencies.

 

The good news for the new Altra Max design is that Ampere was able to vastly improve the inter-socket communication overhead by optimising the CCIX stack part of things. The results are that socket-to-socket core latencies have gone down from ~350ns to ~240ns, and the aforementioned core-to-core within a socket with a remote cache line from ~650ns to ~450ns – still pretty bad, but undoubtedly a large improvement.

Latencies within a socket can be up at the extremes, simply due to the larger mesh. Ampere has boosted the mesh frequency from 1800MHz to 2000MHz in this generation, so there is a slight boost there as well as associated bandwidth.


Looking at the memory latencies of the new part, comparing the Q80-33 to the M128-30 results at 64KB page size, of course the first thing that is noticeable is the fact that the new Altra Max system now only has 16MB of SLC, or system level cache, half of the 32MB of the Quicksilver design. This was one of the compromises the company decided to make when increasing the core count and mesh in the Mystique design.

L3/SLC latencies are also slightly up from 30 to 33.6ns, some of that is the 10% slower CPU clock, but most of it is because the larger mesh with more wire distance and more cross-points for data to travel across.

One thing that we hadn’t covered in our initial review was the chip running regular 4K pages – the most surprising aspect here is not the fact that things look a bit different due to the 4K pages themselves, but rather because the prefetchers now behave totally differently. In our first review we believed that Ampere had intentionally disabled the prefetchers due to the sheer core count of the system, but looking at the 4K page results here they appear to be in line with what we saw in behaviour in Amazon’s Graviton2. Notably the area/region prefetcher no longer pulls in whole pages in patterns which have strong region locality, such as the “R per R page” pattern (Random cache lines within a page followed by random pages traversal). Ampere confirmed that this was not an intentional configuration at 64KB pages, though we didn’t have an exact explanation for it. I theorise it’s maybe a microarchitectural aspect of the N1 cores trying to avoid increased cache pressure at larger page sizes.

This weird behaviour also explains the discrepancy in scores between Graviton2 and Altra in SPEC’s 507.cactuBSSN_r, which is actually due to the prefetchers working or not between 64/4KB pages.


It’s still possible to run the chip in either monolithic, hemisphere, or quadrant modes, segmenting the memory accesses between the various memory controller channels on the chip, as well as the SLC. Unfortunately, at 128 cores and only 16MB of SLC, the quadrant mode results in only 4MB of SLC, which is quite minuscule for a desktop machine, much less a server system. Each core still has 1MB of L2, however as we’ll see later in the tests, there are real-world implications of such tiny SLC sizes.

In terms of DRAM bandwidth, the Altra system on paper is equal to AMD’s EPYC Rome or Milan, or Intel’s newest Ice Lake-SP parts, due to all of them running 8-channel DDR4-3200. Ampere’s advantage comes from the fact that it is able to detect streaming memory workloads and automatically transform them into non-temporal writes, avoiding an extra memory read due to RFO (read for ownership) operations that “normal” designs have to go through. Intel’s newest Ice Lake-SP design has a somewhat similar optimisation, though working more on a cache-line basis and seemingly not able to extract as much bandwidth efficiency as the Arm design. AMD currently lacks any such optimisation and software has to have explicit usage of non-temporal writes to be able to fully extract the most out of the memory subsystem – which isn’t as optimal as a generic workload agnostic optimisation that Ampere or Intel currently employ.

Between the Q80-33 and M128-30, we’re seeing bandwidth curves that roughly match – up to a certain core count. The new M128-30 naturally goes further to 128 cores, but the resulting aggregate bandwidth also goes further down due to resource contention on the SoC – something very important to keep in mind as we explore more detailed workload results on the next pages.

 

At lower core count load, we’re seeing the M128-30 bandwidth exceed that of the Q80-33 even though it’s at lower CPU frequencies, again this is likely due to the fact that the mesh is now running 11% faster in frequency on the new design. AMD’s EPYC Milan still has access to the most per-core bandwidth in low thread situations.

 

Test Bed and Setup - Compiler Options SPEC - Multi-Threaded Performance - Subscores
Comments Locked

60 Comments

View All Comments

  • Unashamed_unoriginal_username_x86 - Thursday, October 7, 2021 - link

    M112-30 in the table on first page says 96 cores?
  • Andrei Frumusanu - Friday, October 8, 2021 - link

    That was a typo.
  • Unashamed_unoriginal_username_x86 - Friday, October 8, 2021 - link

    I'm glad we agree
  • nandnandnand - Thursday, October 7, 2021 - link

    Let's all run out and buy the $800 32-core.
  • mode_13h - Friday, October 8, 2021 - link

    Add in platform costs and it's not going to look like much of a bargain compared with a Threadripper or Xeon W-3300 system.
  • mode_13h - Friday, October 8, 2021 - link

    Of course, if you happen to need specifically an ARM-based workstation, I'm not aware of any better option than Altra.
  • Brutalizer - Sunday, October 10, 2021 - link

    The old SPARC T8 cpu with 32 cores, is still almost faster than all these cpus. Here in the SPECjbb2015, a single cpu achieves 153.500 max, and 90.000 crit.
    https://blogs.oracle.com/bestperf/specjbb2015:-spa...
  • Wilco1 - Sunday, October 10, 2021 - link

    Those are results based on weeks/months of tuning, so not at all comparable with this review (the same is true for SPEC scores). In your link a 1S 8180 does 84100 max-jOPS, while the faster 8280 gets 81700 in this review. Similarly the best critical-jOPS is 62600 for the 8180 while the 8280 gets just 47900.
  • Sudharshan Anbarasu - Sunday, October 17, 2021 - link

    How about Monero Mining performance...

    Since $(cache) plays major role in x86 platform, curious to know how it works in Arm architecture.

Log in

Don't have an account? Sign up now