128 Cores Mesh Setup & Memory Subsystem

Starting off the testing, one thing that is extremely intriguing about Ampere’s implementation of their Altra designs is the fact that they’re achieving more than 64 cores whilst still using Arm’s CMN-600 mesh network IP. In our more recent coverage earlier this year of Arm’s newer upcoming CMN-700 mesh network, we wrote about the fundamental structure of the CMN mesh and its building blocks, such as RN-F, HN-F, and components such as CALs.

In a typical deployment, a mesh consists of cross-points (XPs) to whose RN-F (Fully coherent request node) connect to either directly a CPU, or a CAL (component aggregation layer) which can house two CPUs.

Our initial confusion last year with the Quicksilver designs was that 80 cores was more cores than what the CMN would actually support when configured with the maximum mesh size and two cores per CAL per XP – at least officially. Ampere back then was generally coy about talking about the mesh setup, but more recent discussions with Arm and Ampere, the companies have divulged that it’s also possible to house the CPUs inside of a DSU (DynamiQ Shared Unit), the same cluster design that we find in mobile Arm SoCs with Cortex CPUs.

Ampere has since confirmed the mesh setup in regards to the CPUs: instead of housing cores directly to the mesh via RN-Fs, or even via a CAL on each XP, they are employing two DSUs, each with two Neoverse-N1 cores, connected to a CAL, connected to a single XP. That means each mesh cross-point houses four cores, vastly reducing the needed mesh size to be able get to such core numbers – this is both valid for the Quicksilver 80-core designs and the new Mystique 128-core designs. The only differences with the Mystique design is that Ampere has now simply increased the mesh size (we still don’t have official confirmation on the exact setup here).

From a topology perspective, the Altra Max is still a massive monolithic 128-core chip, with competitive core-to-core latencies within the same socket. Our better understanding of the use of DSUs in the design now also explains the more unique low-latency figures of 26ns which only happens between two core pairs – these would presumably be two sibling cores housed within a single DSU, and coherency and communications between the two doesn’t have to go out into the mesh, which incurs higher latencies.

We had discussed Ampere’s quite high inter-socket latencies in our review of the Altra last year, as a fresh reminder, this is because the design doesn’t have a single coherency protocol that spans from the mesh network to the remote mesh of the other socket – instead having to have to go through an intermediary cache-coherency protocol translation for inter-socket communication, CCIX in this case. In particular this wasn’t very efficient for when two cores within a socket have to work on a remote socket cache line – the communication between cores in DSU is very efficient here, however between cores in a mesh it means doing a round-trip to the remote socket, resulting in pretty awful latencies.

 

The good news for the new Altra Max design is that Ampere was able to vastly improve the inter-socket communication overhead by optimising the CCIX stack part of things. The results are that socket-to-socket core latencies have gone down from ~350ns to ~240ns, and the aforementioned core-to-core within a socket with a remote cache line from ~650ns to ~450ns – still pretty bad, but undoubtedly a large improvement.

Latencies within a socket can be up at the extremes, simply due to the larger mesh. Ampere has boosted the mesh frequency from 1800MHz to 2000MHz in this generation, so there is a slight boost there as well as associated bandwidth.


Looking at the memory latencies of the new part, comparing the Q80-33 to the M128-30 results at 64KB page size, of course the first thing that is noticeable is the fact that the new Altra Max system now only has 16MB of SLC, or system level cache, half of the 32MB of the Quicksilver design. This was one of the compromises the company decided to make when increasing the core count and mesh in the Mystique design.

L3/SLC latencies are also slightly up from 30 to 33.6ns, some of that is the 10% slower CPU clock, but most of it is because the larger mesh with more wire distance and more cross-points for data to travel across.

One thing that we hadn’t covered in our initial review was the chip running regular 4K pages – the most surprising aspect here is not the fact that things look a bit different due to the 4K pages themselves, but rather because the prefetchers now behave totally differently. In our first review we believed that Ampere had intentionally disabled the prefetchers due to the sheer core count of the system, but looking at the 4K page results here they appear to be in line with what we saw in behaviour in Amazon’s Graviton2. Notably the area/region prefetcher no longer pulls in whole pages in patterns which have strong region locality, such as the “R per R page” pattern (Random cache lines within a page followed by random pages traversal). Ampere confirmed that this was not an intentional configuration at 64KB pages, though we didn’t have an exact explanation for it. I theorise it’s maybe a microarchitectural aspect of the N1 cores trying to avoid increased cache pressure at larger page sizes.

This weird behaviour also explains the discrepancy in scores between Graviton2 and Altra in SPEC’s 507.cactuBSSN_r, which is actually due to the prefetchers working or not between 64/4KB pages.


It’s still possible to run the chip in either monolithic, hemisphere, or quadrant modes, segmenting the memory accesses between the various memory controller channels on the chip, as well as the SLC. Unfortunately, at 128 cores and only 16MB of SLC, the quadrant mode results in only 4MB of SLC, which is quite minuscule for a desktop machine, much less a server system. Each core still has 1MB of L2, however as we’ll see later in the tests, there are real-world implications of such tiny SLC sizes.

In terms of DRAM bandwidth, the Altra system on paper is equal to AMD’s EPYC Rome or Milan, or Intel’s newest Ice Lake-SP parts, due to all of them running 8-channel DDR4-3200. Ampere’s advantage comes from the fact that it is able to detect streaming memory workloads and automatically transform them into non-temporal writes, avoiding an extra memory read due to RFO (read for ownership) operations that “normal” designs have to go through. Intel’s newest Ice Lake-SP design has a somewhat similar optimisation, though working more on a cache-line basis and seemingly not able to extract as much bandwidth efficiency as the Arm design. AMD currently lacks any such optimisation and software has to have explicit usage of non-temporal writes to be able to fully extract the most out of the memory subsystem – which isn’t as optimal as a generic workload agnostic optimisation that Ampere or Intel currently employ.

Between the Q80-33 and M128-30, we’re seeing bandwidth curves that roughly match – up to a certain core count. The new M128-30 naturally goes further to 128 cores, but the resulting aggregate bandwidth also goes further down due to resource contention on the SoC – something very important to keep in mind as we explore more detailed workload results on the next pages.

 

At lower core count load, we’re seeing the M128-30 bandwidth exceed that of the Q80-33 even though it’s at lower CPU frequencies, again this is likely due to the fact that the mesh is now running 11% faster in frequency on the new design. AMD’s EPYC Milan still has access to the most per-core bandwidth in low thread situations.

 

Test Bed and Setup - Compiler Options SPEC - Multi-Threaded Performance - Subscores
Comments Locked

60 Comments

View All Comments

  • dullard - Thursday, October 7, 2021 - link

    Far too many people mistakenly think it is AMD vs Intel. In reality it is ARM vs (AMD + Intel together).
  • TheinsanegamerN - Thursday, October 7, 2021 - link

    In reality it's AMD VS INTEL, with ARM the red headed stepchild with 3 extra chromosomes drooling in the corner. x86 still commands 99% of the server market.
  • DougMcC - Thursday, October 7, 2021 - link

    And the reason is price/performance. These chips are pricey for what they deliver, and it shows in amazon instance costs. We looked at moving to graviton 2 instances in aws and even with the in-house pricing advantage there we would be losing 55+% performance for <25% price advantage.
  • eastcoast_pete - Thursday, October 7, 2021 - link

    Was/is it really that bad? Wow! I thought AWS is making a value play for their gravitons, your example suggests that isn't working so great.
  • mode_13h - Thursday, October 7, 2021 - link

    Could be that demand is simply outstripping their supply. Amazon isn't immune from chip shortages either, you know?
  • DougMcC - Thursday, October 7, 2021 - link

    It was for us. Could be that there are workload issues specific to us, though as a pretty basic j2ee app it's somewhat hard for me to imagine that we are unique.
  • lightningz71 - Friday, October 8, 2021 - link

    It is VERY workload dependent.
  • lemurbutton - Friday, October 8, 2021 - link

    Graviton2 is now 50% of all new instances at AWS.
  • DougMcC - Friday, October 8, 2021 - link

    Not super surprising. Even with the massive loss of performance, it's still cheaper. If you don't need performance, why wouldn't you choose the cheapest thing?
  • Wilco1 - Friday, October 8, 2021 - link

    In most cases Graviton is not only cheaper but also significantly faster. It's easy to find various examples:

    https://docs.keydb.dev/blog/2020/03/02/blog-post/
    https://about.gitlab.com/blog/2021/08/05/achieving...
    https://yegorshytikov.medium.com/aws-graviton-2-ar...
    https://www.instana.com/blog/best-practices-for-an...

Log in

Don't have an account? Sign up now