As we’re wrapping up 2020, one last large review item for the year is Ampere’s long promised new Altra Arm server processor. This year has indeed been the year where Arm servers have had a breakthrough; Arm’s new Neoverse-N1 CPU core had been the IP designer’s first true dedicated server core, promising focused performance and efficiency for the datacentre.

Earlier in the year we had the chance to test out the first Neoverse-N1 silicon in the form of Amazon’s Graviton2 inside of AWS EC2 cloud compute offering. The Graviton2 seemed like a very impressive design, but was rather conservative in its goals, and it’s also a piece of hardware that the general public cannot access outside of Amazon’s own cloud services.

Ampere Computing, founded in 2017 by former Intel president Renée James, built upon initial IP and design talent of AppliedMicro’s X-Gene CPUs, and with Arm Holdings becoming an investor in 2019, is at this moment in time the sole “true” merchant silicon vendor designing and offering up Neoverse-N1 server designs.

To date, the company has had a few products out in the form of the eMAG chips, but with rather disappointing performance figures - understandable given that those were essentially legacy products based on the old X-Gene microarchitecture.

Ampere’s new Altra product line, on the other hand is the culmination of several years of work and close collaboration with Arm – and the company first “true” product which can be viewed as Ampere pedigree.

Today, with hardware in hand, we’re finally taking a look at the very first publicly available high-performance Neoverse based Arm server hardware, designed for nothing less than maximum achievable performance, aiming to battle the best designs from Intel and AMD.

Mount Jade Server with Altra Quicksilver

Ampere has supplied us with the company’s server reference design, dubbed “Mount Jade”, a 2-socket 2U rack unit sever. The server came supplied with two Altra Q80-33 processors, Ampere’s top-of-the-line SKU with each featuring 80 cores running at up to 3.3GHz, with TDP reaching up to 250W per socket.

The server was designed with close collaboration with Wiwynn for this dual socket, and with GIGABYTE for the single socket variant, as previously hinted by the two company’s announcements of leading hyperscale deployments of the Altra platforms. The Ampere-branded Mount Jade DVT reference motherboard comes in a typical server blue colour scheme and features 2 sockets with up to 16 DIMM slots per socket, reaching up to 4TB DRAM capacity per socket, although our review unit came equipped with 256GB per socket across 8 DIMMs to fully populate the chip’s 8-channel memory controllers.

This is also our first look at Ampere’s first-generation socket design. The company doesn’t really market any particular name to the socket, but it’s a massive LGA4926 socket with a pin-count in excess of any other commercial server socket from AMD or Intel. The holding mechanism is somewhat similar to that of AMD’s SP3 system, with a holding mechanism tensioned by a 5-point screw system.

The chip itself is absolutely humongous and amongst the current publicly available processors is the biggest in the industry, out-sizing AMD’s SP3 form-factor packaging, coming in at around 77 x 66.8mm – about the same length but considerably wider than AMD’s counterparts.

Although it’s a massive chip with a huge IHS, the Mount Jade server surprised me with its cooling solution as the included 250W type cooler only made contact with about 1/4th the surface area of the heat spreader.

Ampere here doesn’t have a recessed “lip” around the IHS for the mounting bracket to hold onto the chip like on AMD or Intel systems, so the actual IHS surface is actually recessed in relation to the bracket which means you cannot have a flat surface cooler design across the whole of the chip surface.

Instead, the included 250W design cooler uses a huge vapour chamber design with a “pedestal” to make contact with the chip. Ampere explains that they’ve experimented with different designs and found that a smaller area pedestal actually worked better for heat dissipation – siphoning heat off from the actual chip die which is notably smaller than the IHS and chip package.

The cooler design is quite complex, with vertical fin stacks dissipating heat directly off the vapour chamber, with additional large horizontal fins dissipating heat from 6 U-shaped heat pipes that draw heat from the vapour chamber. It’s definitely a more complex and high-end design than what we’re used to in server coolers.

Although the Mount Jade server is definitely a very interesting piece of hardware, our focus today lies around the actual new Altra processors themselves, so let’s dive into the new Q80-33 80-core chip next.

1st Generation Neoverse-N1 80-Core Server SoC
Comments Locked

148 Comments

View All Comments

  • Brane2 - Saturday, December 19, 2020 - link

    Meh. Nothing special. it has been benchmarked on Phoronix and it performed more or less on par with Rome. 80 newest ARM cores against 64 mature x86 cores within constrained power envelope.
    Naples is just about to come out and I suspect some time after that AMD will have something like really wide new RISC-V cores.
  • Wilco1 - Saturday, December 19, 2020 - link

    It won most benchmarks on Phoronix while using significantly less power. Yes Milan is about to be released, and it will have to compete with the 128-core Altra Max. Which do you believe is going to win - 64 SMT cores or 128 real cores?
  • mode_13h - Sunday, December 20, 2020 - link

    It actually won less than half of the benchmarks on phoronix, since a number of those graphs just re-state the results in score/W. There are also questions over some of the compiler options used on those benchmarks, since many of the tests are compiled with options that won't enable AVX on benchmarks where it should be beneficial (yet, not having SVE, the N1 cores are at no such disadvantage).
  • Wilco1 - Monday, December 21, 2020 - link

    "should be beneficial" -> "might help in a few limited cases". AVX/AVX512 isn't that useful for general C/C++ code. You typically only see large gains when people optimize using intrinsics.
  • mode_13h - Monday, December 21, 2020 - link

    Intrinsics don't compile if they're for a CPU arch beyond what the compiler is being instructed to target. So, even packages where people take the time to optimize with intrinsics need to guard them with compile-time checks to ensure the CPU target is capable of executing those instructions.

    Compilers do generate vectorized code. I don't know how well GCC is doing on that front, lately, but the TNN tests should be a good way to see that. Too bad those tests don't use -march=native.

    What's interesting about TNN is I'm looking at the exact source revision Phoronix is using, and it seems they've completely dropped their backend for x86. The source/tnn/device/x86/ is simply missing. So, I wonder if they decided the compiler was good enough that they didn't need to bother with their own hand-optimized code for it, or if they just decided they don't care how fast their stuff runs on it.

    See:
    * https://openbenchmarking.org/innhold/83a730ed41d4e...
    * https://github.com/Tencent/TNN/tree/v0.2.3
  • Wilco1 - Monday, December 21, 2020 - link

    TNN does not benefit from -march=native. Phoronix uses the generic C++ version which doesn't benefit from vectorization. Try it yourself.

    Optimized versions using intrinsics typically use runtime checks so you automatically get the fastest version that works on your CPU. The makefile selects the right ISA variant for any files using intrinsics. But none of this is used in the TNN test.
  • mode_13h - Monday, December 21, 2020 - link

    > TNN does not benefit from -march=native. Phoronix uses the generic C++ version which doesn't benefit from vectorization. Try it yourself.

    At this point, I probably will.

    > Optimized versions using intrinsics typically use runtime checks so you automatically get the fastest version that works on your CPU.

    That's a whole additional level of effort for the developers. For them to bother compiling and conditionally calling different versions only makes sense if they think their main userbase aren't going to bother recompiling specifically for their hardware. In the case of specialized packages, it's reasonable to expect your users to take a little trouble for the best performance. It's really things like very low-level libs or multimedia code where you tend to see the sort of elaborate runtime detection and dynamic codepath selection that you're describing.
  • mode_13h - Monday, December 21, 2020 - link

    I think Basis Universal and High Performance Conjugate Gradient are some other cases where the wider SIMD of Zen2 and Skylake-SP should confer significant benefit.
  • Wilco1 - Monday, December 21, 2020 - link

    "should give significant benefit" -> "might give some benefit". I suggest you try out. Autovectorization is not nearly as good as you seem to believe, and the overall speedup is often disappointing even if some loops are 10-20x faster.
  • vinayshivakumar - Saturday, December 19, 2020 - link

    I am a bit puzzled why none of these processors support SMT... Can someone shed light on why this is the case ?

Log in

Don't have an account? Sign up now