128 Cores Mesh Setup & Memory Subsystem

Starting off the testing, one thing that is extremely intriguing about Ampere’s implementation of their Altra designs is the fact that they’re achieving more than 64 cores whilst still using Arm’s CMN-600 mesh network IP. In our more recent coverage earlier this year of Arm’s newer upcoming CMN-700 mesh network, we wrote about the fundamental structure of the CMN mesh and its building blocks, such as RN-F, HN-F, and components such as CALs.

In a typical deployment, a mesh consists of cross-points (XPs) to whose RN-F (Fully coherent request node) connect to either directly a CPU, or a CAL (component aggregation layer) which can house two CPUs.

Our initial confusion last year with the Quicksilver designs was that 80 cores was more cores than what the CMN would actually support when configured with the maximum mesh size and two cores per CAL per XP – at least officially. Ampere back then was generally coy about talking about the mesh setup, but more recent discussions with Arm and Ampere, the companies have divulged that it’s also possible to house the CPUs inside of a DSU (DynamiQ Shared Unit), the same cluster design that we find in mobile Arm SoCs with Cortex CPUs.

Ampere has since confirmed the mesh setup in regards to the CPUs: instead of housing cores directly to the mesh via RN-Fs, or even via a CAL on each XP, they are employing two DSUs, each with two Neoverse-N1 cores, connected to a CAL, connected to a single XP. That means each mesh cross-point houses four cores, vastly reducing the needed mesh size to be able get to such core numbers – this is both valid for the Quicksilver 80-core designs and the new Mystique 128-core designs. The only differences with the Mystique design is that Ampere has now simply increased the mesh size (we still don’t have official confirmation on the exact setup here).

From a topology perspective, the Altra Max is still a massive monolithic 128-core chip, with competitive core-to-core latencies within the same socket. Our better understanding of the use of DSUs in the design now also explains the more unique low-latency figures of 26ns which only happens between two core pairs – these would presumably be two sibling cores housed within a single DSU, and coherency and communications between the two doesn’t have to go out into the mesh, which incurs higher latencies.

We had discussed Ampere’s quite high inter-socket latencies in our review of the Altra last year, as a fresh reminder, this is because the design doesn’t have a single coherency protocol that spans from the mesh network to the remote mesh of the other socket – instead having to have to go through an intermediary cache-coherency protocol translation for inter-socket communication, CCIX in this case. In particular this wasn’t very efficient for when two cores within a socket have to work on a remote socket cache line – the communication between cores in DSU is very efficient here, however between cores in a mesh it means doing a round-trip to the remote socket, resulting in pretty awful latencies.

 

The good news for the new Altra Max design is that Ampere was able to vastly improve the inter-socket communication overhead by optimising the CCIX stack part of things. The results are that socket-to-socket core latencies have gone down from ~350ns to ~240ns, and the aforementioned core-to-core within a socket with a remote cache line from ~650ns to ~450ns – still pretty bad, but undoubtedly a large improvement.

Latencies within a socket can be up at the extremes, simply due to the larger mesh. Ampere has boosted the mesh frequency from 1800MHz to 2000MHz in this generation, so there is a slight boost there as well as associated bandwidth.


Looking at the memory latencies of the new part, comparing the Q80-33 to the M128-30 results at 64KB page size, of course the first thing that is noticeable is the fact that the new Altra Max system now only has 16MB of SLC, or system level cache, half of the 32MB of the Quicksilver design. This was one of the compromises the company decided to make when increasing the core count and mesh in the Mystique design.

L3/SLC latencies are also slightly up from 30 to 33.6ns, some of that is the 10% slower CPU clock, but most of it is because the larger mesh with more wire distance and more cross-points for data to travel across.

One thing that we hadn’t covered in our initial review was the chip running regular 4K pages – the most surprising aspect here is not the fact that things look a bit different due to the 4K pages themselves, but rather because the prefetchers now behave totally differently. In our first review we believed that Ampere had intentionally disabled the prefetchers due to the sheer core count of the system, but looking at the 4K page results here they appear to be in line with what we saw in behaviour in Amazon’s Graviton2. Notably the area/region prefetcher no longer pulls in whole pages in patterns which have strong region locality, such as the “R per R page” pattern (Random cache lines within a page followed by random pages traversal). Ampere confirmed that this was not an intentional configuration at 64KB pages, though we didn’t have an exact explanation for it. I theorise it’s maybe a microarchitectural aspect of the N1 cores trying to avoid increased cache pressure at larger page sizes.

This weird behaviour also explains the discrepancy in scores between Graviton2 and Altra in SPEC’s 507.cactuBSSN_r, which is actually due to the prefetchers working or not between 64/4KB pages.


It’s still possible to run the chip in either monolithic, hemisphere, or quadrant modes, segmenting the memory accesses between the various memory controller channels on the chip, as well as the SLC. Unfortunately, at 128 cores and only 16MB of SLC, the quadrant mode results in only 4MB of SLC, which is quite minuscule for a desktop machine, much less a server system. Each core still has 1MB of L2, however as we’ll see later in the tests, there are real-world implications of such tiny SLC sizes.

In terms of DRAM bandwidth, the Altra system on paper is equal to AMD’s EPYC Rome or Milan, or Intel’s newest Ice Lake-SP parts, due to all of them running 8-channel DDR4-3200. Ampere’s advantage comes from the fact that it is able to detect streaming memory workloads and automatically transform them into non-temporal writes, avoiding an extra memory read due to RFO (read for ownership) operations that “normal” designs have to go through. Intel’s newest Ice Lake-SP design has a somewhat similar optimisation, though working more on a cache-line basis and seemingly not able to extract as much bandwidth efficiency as the Arm design. AMD currently lacks any such optimisation and software has to have explicit usage of non-temporal writes to be able to fully extract the most out of the memory subsystem – which isn’t as optimal as a generic workload agnostic optimisation that Ampere or Intel currently employ.

Between the Q80-33 and M128-30, we’re seeing bandwidth curves that roughly match – up to a certain core count. The new M128-30 naturally goes further to 128 cores, but the resulting aggregate bandwidth also goes further down due to resource contention on the SoC – something very important to keep in mind as we explore more detailed workload results on the next pages.

 

At lower core count load, we’re seeing the M128-30 bandwidth exceed that of the Q80-33 even though it’s at lower CPU frequencies, again this is likely due to the fact that the mesh is now running 11% faster in frequency on the new design. AMD’s EPYC Milan still has access to the most per-core bandwidth in low thread situations.

 

Test Bed and Setup - Compiler Options SPEC - Multi-Threaded Performance - Subscores
Comments Locked

60 Comments

View All Comments

  • Jurgen B - Thursday, October 7, 2021 - link

    Love your thorough article and testing. This is some serious firing power from the Ampere and makes some great competition for Intel and AMD. I really like the 256T runs on the AMD Dual socket EPYCs (they really are serving me well in floating point research computing), but it seems that future holds some nice innovations in the field!
  • mode_13h - Thursday, October 7, 2021 - link

    Lack of cache seems to be a serious liability, though. For many, it'll be a deal breaker.
  • Wilco1 - Friday, October 8, 2021 - link

    Yet it still beats AMD's 7763 with its humongous 256MB L3 in all the multithreaded benchmarks. Sure, it would be even faster if it had a 64MB L3 cache, however it doesn't appear to be a serious liability. Doing more with far less silicon at a lower price (and power) is an interesting design point (and apparently one that cloud companies asked for).
  • Jurgen B - Friday, October 8, 2021 - link

    Yes, Cache will play a role for many. However, people buying such servers likely have a very specific workload in mind. And thus they now have more choices which of the manufacturer options they prefer, and these choices are really good to see. Compared to 10 years ago, when AMD was much less competitive, it is wonderful to see the innovation.
  • schujj07 - Friday, October 8, 2021 - link

    That isn't true at all. The SPEC java benchmarks have the Epyc ahead, SpecINT Base Rate-N Estimated they are almost equal (despite having half the cores), FP Base Rate-N Estimated the Epyc is ahead, compiling the Epyc is ahead. Anything that will tax the memory subsystem by not fitting into the small cache of the Altra and the performance is lower for the Altera. Per core performance isn't even close.
  • mode_13h - Saturday, October 9, 2021 - link

    Thanks for correcting the record, @schujj07.

    The whole concept of adding 60% more cores while halving cache is mighty suspicious. In the most charitable view, this is intended to micro-target specific applications with low memory bandwidth requirements. From a more cynical perspective, it's merely an exercise in specsmanship and maybe trying to gin up a few specific benchmark numbers.
  • Wilco1 - Saturday, October 9, 2021 - link

    If you're that cynical one could equally claim that adding *more* cache is mighty suspicious and gaming benchmark numbers. Obviously nobody would spend a few hundred million on a chip just to game benchmarks. The fact is there is a market for chips with lots of cores. Half the SPEC subtests show huge gains from 60% extra cores despite the lower frequency and halved L3. So clearly there are lots of applications that benefit from more cores and don't need a huge L3.
  • Wilco1 - Saturday, October 9, 2021 - link

    The Altra Max wins the more useful critical-jOPS benchmark by over 30%. It also wins the LLVM compile test and SPECINT_rate by a few percent. The 7763 only wins SPECFP by 18% (not Altra's market) and max-jOPS by 13%.

    So yes my point is spot on, the small cache does not look at all like a serious liability. Per-core performance isn't interesting when comparing a huge SMT core with a tiny non-SMT core - you can simply double the number of cores to make up for SMT and still use half the area...
  • mode_13h - Saturday, October 9, 2021 - link

    > Per-core performance isn't interesting when comparing ...

    Trying to change the subject? We didn't mention that. We were talking only about cache.

    > The Altra Max wins the more useful critical-jOPS benchmark by over 30%.

    That's really about QoS, which is a different story. Surely, relevant for some. I wonder if x86 CPUs would do better on that front with SMT disabled.

    > the small cache does not look at all like a serious liability.

    Of course it's a liability! It's just a very workload-dependent one. You need only note the cases where Max significantly underperforms, relative to its 80-core sibling, to see where the cache reduction is likely an issue.

    The reason why there are so many different benchmarks is that you can't just seize on the aggregate numbers to tell the whole story.
  • mode_13h - Saturday, October 9, 2021 - link

    Apologies, I now see where schujj07 mentioned per-core performance. I even searched for "per-core" but not "per core".

Log in

Don't have an account? Sign up now