The X-Gene microarchitecture was never particularly stellar and by the time eMag rolled around it was woefully obsolete. I did some testing on eMag a few months back and it was pretty dire. When I spent some time on Graviton2 last week, it was like night and day compared to eMag (frequently 2+ times the single-thread perf despite a much lower clock), so I have high hopes for Altra.
By the way, Andrei, you may want to correct the ST SPECFP subtest result graph - it looks like you used Graviton as a template and forgot to change the labels to eMag, because right now it only mentions Graviton1, and Graviton2, and Intel, not eMag.
Interesting to see even if this hardware only makes sense for very specialized purposes. ARM processors have gone from only applicable to mobile devices to something that would have made sense in a server a few years ago.
This isn't exactly a good representative of ARM processors; chips like Graviton2 are competitive for server workloads today, and make eMag look like a toy by comparison.
Thanks Andrei, good and in-depth review! You and others here have already commented on the great difference of this legacy CPU to Ampere's Altra or Amazon's Graviton 2. What I am also very curious about is Fujitsu's ARM-based multicore CPU (A64FX). Amongst other features, it supports 512-bit scalable vector extensions (SVEs), so same width as Intel's AVX512. I wonder if someone at Fujitsu reads Anandtech, and maybe send you a setup for review, although a PRIMEHPC might be out of the scope here. Still, that's an ARM v8 design that should beat the Graviton 2 and the Altra, especially if applications can make use of the wide SVEs.
Based on what we know of the A64FX, it’ll almost certainly *only* beat Graviton 2/Altra in cases where it can heavily utilize wide vectors. In all other scenarios it really doesn’t have a lot of execution width, and only runs at 2.2Ghz. The disclosures in their Microarchitecture guide also don’t showcase anything impressive looking on the branch predictor, which is fine for the typical HPC workloads it will run. That thing is very heavily purpose designed for HPC, and it’s clear they focused on that and not general performance.
Agree with you and anonomouse on general purpose loads; my interest in wide vectors is mainly due to their utility for video processing and encoding, if (!) the software supports it. For those applications, AVX512 is what keeps Intel competitive with EPYCs in the x64 space. As a question, is anything like an AV1 encoder even available for ARM v8, and specifically to use wide SVEs?
There are many AV1 codecs which have AArch64 optimizations, but most focus on older mobile phone cores (eg. http://www.jbkempf.com/blog/post/2019/dav1d-0.5.0-... ), so likely need further work on latest microarchitectures with up to 4 128-bit Neon pipes.
It's early days for SVE, the first version (as in A64FX) is aimed at HPC. Video codecs will be optimized for SVE2 when hardware becomes available.
They really need ARM systems that are a little higher than Raspberry-PI but a little lower than x86, perhaps in the $100-$200 price range, for personal network appliances.
I'd be interested in what you would use that one for? And why exactly those specs? Lower power than x86 at "good enough" performance levels? If that is the base, why not do an undervolted / down clocked x86 build? Ryzen can get to some pretty great voltage/frequency levels. :D Or is it the ATX form factor as well? That one is a bit trickier, either go with a 12/19V native motherboard or get a nice pico PSU with ATX cables and a 12/19V input. :) Unless I'm way off base in my assumptions. :D
The graph at the end suggests that 10.0 was a significant regression for many tests, though, so that should probably be taken with a pinch of salt. <^_^>
There are some tests (mostly vectorization-related) where it's really helped, though.
Out of curiosity, how would the performance of the eMag compare to a typical single-board ARM computer? My reference point would be the RPi3 or 4, but there seem to be a variety of others ranging up to a couple hundred dollars with (allegedly) 'better' performance than the RPi.
You can't just reference the RPi 3 and 4 interchangeably. RPi 4 ranges from 2x to 10x faster than the RPi 3 depending on workload. Most SBCs surpassed the RPi 3 merely by choosing an SoC without its terrible I/O constraints. A few have 2xA72. The RPi 4 has 4xA72 at a better process node -> better clockspeed for the same thermal constraints, and no FSB limitations. Its CPU performance is ahead of all but the top-end hardware development kit boards.
It would likely win by a small to moderate amount against the Pi4 on ST, and obviously by a factor of several times on MT.
Altra will increase those numbers considerably, since it should be doing 2-3x the ST eMag and a much larger factor for MT due to the core count increase.
There's a decent amount of spelling errors and wrong word errors in this article, for example:
"... having an Arm system like this is the fact that it enables YOUR (I think you mean "you") native software development, without having to worry about cross-compiling code and all of the kerfuffle that that ENTRAILS (I think you mean "entails")"
There's a few of those on every page, did anyone even proof read this once before publishing?
He quite exaggerated the effort, because it makes little difference if you compile GCC for the host architecture or a different one: Just a matter of configuration and that's it.
You have to understand that pretty much every compiler has to compile itsself, because nobody wants to code it in machine binary or assembly. The code for all supported target architectures comes with the compiler source tree and you just need to pick the proper parts to use.
It's just a tad more involved than simply running cc off the shelf.
"You must first compile the compiler, to then compile your code" Sounds pretty crazy. Isn't the compilers also written in c++, which are compiling c++? My brain hurts.
Well, recursion really grows on you after a bit of use :-)
While I am pretty sure gcc is written in C++ these days, obviously the first C++ compiler still had to be written in C, because otherwise there was nothing to compile it with. Only after the C++ compiler had been compiled and was ready to run, the compiler could be refactored in C++, which I am pretty sure was done rather gradually, perhaps never fully.
These days I doubt that the GNU Fortran, Objective-C, Go or plain old C-compiler are written in anything but C++, because there would be no benefit in doing so. But of course, it could be done (I wouldn't want to write a compiler in Fortran, but I guess some of the early ones were, perhaps with lots of assembly sprinkled in).
The GNU bootstrapping was done a long time ago, perhaps with a K&R compiler and you don't typically have to go through the full process described in the article GreenReaper linked to. Pretty sure LLVM was bootstrapped in GCC and now you could do the same the other way around, if you didn't know what else to do with your day.
I hear the Rust guys want to do a full bootstrap now, but so far their compiler was probably just done in C++. Not that they really have to, probably just because "eat your own dogfood" gets on their nerves.
The process Andrei had to use is pretty much whatever the guy who put 'cc/c++' on the shelf of your Unix/Linux had to do, except that Andrei had to explicitly configure an ARM64 v8 target during the compile, while by default the Makefile or script will pick the host architecture.
Really a pretty minor effort, trivial if you are used to build Unix/Linux applications or even a kernel or distribution from source.
And if you are developing for Android, that's what's happening all the time under the hood, there: So far nobody will want to build Android on an Android device, because it's rather slow already, even on a big server with dozens of cores.
If you'd asked me 5-7 years ago, I thought I'd already be running an ARM-based server or workstation, by now. Maybe I was off by a few years?
Anyway, I think we'll look back on this as a milestone. It's not the very first ARM-based workstation I've seen (for that, check out https://www.phoronix.com/scan.php?page=article&... ), but certainly the most compelling.
Linking is, in principle, pretty parallelizable. Static libraries are a problem for parallel linking because you have to know which symbols are referenced but not defined by files preceding the library before you can determine which object files in the library are needed, but these days people use shared libraries instead of static libraries. Generating the memory layout is a single threaded operation, but a quick one.
There are a lot of companies that would benefit from the existence of a parallelized linker. Avantek would have a more compelling product. Any company that does lots of software development would benefit from shorter build times. So I expect that eventually someone will fund the development of such a linker.
We always enjoy your articles its inspired a lot by reading your articles day by day. So please accept my thanks and congrats for success of your latest series. https://www.schmhyd.edu.in/
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
35 Comments
Back to Article
SarahKerrigan - Friday, May 22, 2020 - link
The X-Gene microarchitecture was never particularly stellar and by the time eMag rolled around it was woefully obsolete. I did some testing on eMag a few months back and it was pretty dire. When I spent some time on Graviton2 last week, it was like night and day compared to eMag (frequently 2+ times the single-thread perf despite a much lower clock), so I have high hopes for Altra.SarahKerrigan - Friday, May 22, 2020 - link
By the way, Andrei, you may want to correct the ST SPECFP subtest result graph - it looks like you used Graviton as a template and forgot to change the labels to eMag, because right now it only mentions Graviton1, and Graviton2, and Intel, not eMag.Andrei Frumusanu - Friday, May 22, 2020 - link
Thanks, good catch.Flunk - Friday, May 22, 2020 - link
Interesting to see even if this hardware only makes sense for very specialized purposes. ARM processors have gone from only applicable to mobile devices to something that would have made sense in a server a few years ago.SarahKerrigan - Friday, May 22, 2020 - link
This isn't exactly a good representative of ARM processors; chips like Graviton2 are competitive for server workloads today, and make eMag look like a toy by comparison.eastcoast_pete - Friday, May 22, 2020 - link
Thanks Andrei, good and in-depth review! You and others here have already commented on the great difference of this legacy CPU to Ampere's Altra or Amazon's Graviton 2. What I am also very curious about is Fujitsu's ARM-based multicore CPU (A64FX). Amongst other features, it supports 512-bit scalable vector extensions (SVEs), so same width as Intel's AVX512. I wonder if someone at Fujitsu reads Anandtech, and maybe send you a setup for review, although a PRIMEHPC might be out of the scope here. Still, that's an ARM v8 design that should beat the Graviton 2 and the Altra, especially if applications can make use of the wide SVEs.anonomouse - Friday, May 22, 2020 - link
Based on what we know of the A64FX, it’ll almost certainly *only* beat Graviton 2/Altra in cases where it can heavily utilize wide vectors. In all other scenarios it really doesn’t have a lot of execution width, and only runs at 2.2Ghz. The disclosures in their Microarchitecture guide also don’t showcase anything impressive looking on the branch predictor, which is fine for the typical HPC workloads it will run. That thing is very heavily purpose designed for HPC, and it’s clear they focused on that and not general performance.SarahKerrigan - Friday, May 22, 2020 - link
Indeed. It's a specialized chip. I would expect no miracles from it on general-purpose loads.eastcoast_pete - Friday, May 22, 2020 - link
Agree with you and anonomouse on general purpose loads; my interest in wide vectors is mainly due to their utility for video processing and encoding, if (!) the software supports it. For those applications, AVX512 is what keeps Intel competitive with EPYCs in the x64 space. As a question, is anything like an AV1 encoder even available for ARM v8, and specifically to use wide SVEs?Wilco1 - Saturday, May 23, 2020 - link
There are many AV1 codecs which have AArch64 optimizations, but most focus on older mobile phone cores (eg. http://www.jbkempf.com/blog/post/2019/dav1d-0.5.0-... ), so likely need further work on latest microarchitectures with up to 4 128-bit Neon pipes.It's early days for SVE, the first version (as in A64FX) is aimed at HPC. Video codecs will be optimized for SVE2 when hardware becomes available.
vFunct - Friday, May 22, 2020 - link
They really need ARM systems that are a little higher than Raspberry-PI but a little lower than x86, perhaps in the $100-$200 price range, for personal network appliances.Death666Angel - Friday, May 22, 2020 - link
I'd be interested in what you would use that one for? And why exactly those specs? Lower power than x86 at "good enough" performance levels? If that is the base, why not do an undervolted / down clocked x86 build? Ryzen can get to some pretty great voltage/frequency levels. :D Or is it the ATX form factor as well? That one is a bit trickier, either go with a 12/19V native motherboard or get a nice pico PSU with ATX cables and a 12/19V input. :) Unless I'm way off base in my assumptions. :Dlmcd - Friday, May 22, 2020 - link
I haven't used an RPi 4 yet but I'd be willing to bet the 4GB variant would meet vFunct's needs.vFunct - Friday, May 22, 2020 - link
Network file server with ZFS, Or, a mail server.Need storage & memory, but don't need intense CPU
vFunct - Friday, May 22, 2020 - link
Network file server with ZFS, Or, a mail server.Need storage & memory, but don't need intense CPU.
Wilco1 - Friday, May 22, 2020 - link
It's worth pointing out for future reviews that GCC 10 is out and shows a 10.5% performance gain on Neoverse N1: https://community.arm.com/developer/tools-software...SarahKerrigan - Friday, May 22, 2020 - link
Doesn't mean much for eMag, though.Wilco1 - Saturday, May 23, 2020 - link
Indeed, eMag is quite old, so it won't benefit nearly as much as the latest microarchitectures.GreenReaper - Sunday, May 24, 2020 - link
The graph at the end suggests that 10.0 was a significant regression for many tests, though, so that should probably be taken with a pinch of salt. <^_^>There are some tests (mostly vectorization-related) where it's really helped, though.
mrvco - Friday, May 22, 2020 - link
Out of curiosity, how would the performance of the eMag compare to a typical single-board ARM computer? My reference point would be the RPi3 or 4, but there seem to be a variety of others ranging up to a couple hundred dollars with (allegedly) 'better' performance than the RPi.lmcd - Friday, May 22, 2020 - link
You can't just reference the RPi 3 and 4 interchangeably. RPi 4 ranges from 2x to 10x faster than the RPi 3 depending on workload. Most SBCs surpassed the RPi 3 merely by choosing an SoC without its terrible I/O constraints. A few have 2xA72. The RPi 4 has 4xA72 at a better process node -> better clockspeed for the same thermal constraints, and no FSB limitations. Its CPU performance is ahead of all but the top-end hardware development kit boards.lmcd - Friday, May 22, 2020 - link
Apparently I'm a moron that didn't see the ODROID-N2 release. That CPU is noticeably better.SarahKerrigan - Friday, May 22, 2020 - link
It would likely win by a small to moderate amount against the Pi4 on ST, and obviously by a factor of several times on MT.Altra will increase those numbers considerably, since it should be doing 2-3x the ST eMag and a much larger factor for MT due to the core count increase.
Dodozoid - Saturday, May 23, 2020 - link
Would have been interesting if AMDs planned K12 worked out. Any idea if any part of that architecture is still alive?AnarchoPrimitiv - Sunday, May 24, 2020 - link
There's a decent amount of spelling errors and wrong word errors in this article, for example:"... having an Arm system like this is the fact that it enables YOUR (I think you mean "you") native software development, without having to worry about cross-compiling code and all of the kerfuffle that that ENTRAILS (I think you mean "entails")"
There's a few of those on every page, did anyone even proof read this once before publishing?
LordConrad - Sunday, May 24, 2020 - link
"...without having to worry about cross-compiling code and all of the kerfuffle that that entrails."Wow, who did you have to disembowel to get the cross-compiling done?
abufrejoval - Sunday, May 24, 2020 - link
He quite exaggerated the effort, because it makes little difference if you compile GCC for the host architecture or a different one: Just a matter of configuration and that's it.You have to understand that pretty much every compiler has to compile itsself, because nobody wants to code it in machine binary or assembly. The code for all supported target architectures comes with the compiler source tree and you just need to pick the proper parts to use.
It's just a tad more involved than simply running cc off the shelf.
Fataliity - Sunday, May 24, 2020 - link
"You must first compile the compiler, to then compile your code"
Sounds pretty crazy. Isn't the compilers also written in c++, which are compiling c++?
My brain hurts.
GreenReaper - Sunday, May 24, 2020 - link
It involves a frequently non-trivial, multi-step process called bootstrapping:https://en.wikipedia.org/wiki/Bootstrapping_(compi...
abufrejoval - Sunday, May 24, 2020 - link
Well, recursion really grows on you after a bit of use :-)While I am pretty sure gcc is written in C++ these days, obviously the first C++ compiler still had to be written in C, because otherwise there was nothing to compile it with. Only after the C++ compiler had been compiled and was ready to run, the compiler could be refactored in C++, which I am pretty sure was done rather gradually, perhaps never fully.
These days I doubt that the GNU Fortran, Objective-C, Go or plain old C-compiler are written in anything but C++, because there would be no benefit in doing so. But of course, it could be done (I wouldn't want to write a compiler in Fortran, but I guess some of the early ones were, perhaps with lots of assembly sprinkled in).
The GNU bootstrapping was done a long time ago, perhaps with a K&R compiler and you don't typically have to go through the full process described in the article GreenReaper linked to. Pretty sure LLVM was bootstrapped in GCC and now you could do the same the other way around, if you didn't know what else to do with your day.
I hear the Rust guys want to do a full bootstrap now, but so far their compiler was probably just done in C++. Not that they really have to, probably just because "eat your own dogfood" gets on their nerves.
The process Andrei had to use is pretty much whatever the guy who put 'cc/c++' on the shelf of your Unix/Linux had to do, except that Andrei had to explicitly configure an ARM64 v8 target during the compile, while by default the Makefile or script will pick the host architecture.
Really a pretty minor effort, trivial if you are used to build Unix/Linux applications or even a kernel or distribution from source.
And if you are developing for Android, that's what's happening all the time under the hood, there: So far nobody will want to build Android on an Android device, because it's rather slow already, even on a big server with dozens of cores.
mode_13h - Sunday, May 24, 2020 - link
Heh, yeah. Hopefully, just a typo.mode_13h - Sunday, May 24, 2020 - link
Cool review. Thanks.If you'd asked me 5-7 years ago, I thought I'd already be running an ARM-based server or workstation, by now. Maybe I was off by a few years?
Anyway, I think we'll look back on this as a milestone. It's not the very first ARM-based workstation I've seen (for that, check out https://www.phoronix.com/scan.php?page=article&... ), but certainly the most compelling.
KAlmquist - Monday, May 25, 2020 - link
Linking is, in principle, pretty parallelizable. Static libraries are a problem for parallel linking because you have to know which symbols are referenced but not defined by files preceding the library before you can determine which object files in the library are needed, but these days people use shared libraries instead of static libraries. Generating the memory layout is a single threaded operation, but a quick one.There are a lot of companies that would benefit from the existence of a parallelized linker. Avantek would have a more compelling product. Any company that does lots of software development would benefit from shorter build times. So I expect that eventually someone will fund the development of such a linker.
schm121 - Tuesday, May 26, 2020 - link
We always enjoy your articles its inspired a lot by reading your articles day by day. So please accept my thanks and congrats for success of your latest series.https://www.schmhyd.edu.in/
futurepastnow - Thursday, June 11, 2020 - link
The next Mac Pro?