Especially considering the barrier to entry. With AMD you have OpenCL and Nvidia has CUDA/OpenCL. These Phi cards are closer to complete systems than coprocessors. Programming for them is non-trivial and requires a lot of sifting through documentation.
No, not anymore. As the slide states and has been known for a while Knight Landing is binary compatible to Haswell (except for TSX instructions). Makes sense because after all these are Silvermont (Atom) cores with some added FP64 punch. That's also why it is available as socketed variant. this thing can run Windows/Linux and any other x86 code your current Xeon is running.
It's exactly the opposite to what you claimed. Barrier for entry is much lower with this new Xeon Phi.
Itanium IA64 wasn't bad technology, but where Intel screwed up was pressuring HP to push it into the consumer space where it never belonged. AMD knew, and executed on, a x86-compatible 64-bit instruction set for consumer applications.
So by binary compatible, is the Phi x86-based or does it emulate x86? I assumed ditching P54c indicates they ditched x86 but I'm not familiar with Silvermonts architecture? Does it just emulate x86 instructions?
You must have the Intel Compiler Collection and Intel Parallel Studio to use it, which are not free for non-academic use (not to mention the usual headache of Intel licensing). The Linux that the Phi runs is a highly specialized BSP built by Intel with a ton of proprietary code. It's not as if you can just throw Ubuntu on this thing and start programming. They had to patch the kernel to allow this card to nfs boot the OS, and for the host processor to move data back and forth.
By comparison, CUDA is a package install away and is free for anyone to use for any purpose. OpenCL for AMD requires a little bit of compiling for most distros or a single package on Debian. As a bonus, writing OpenCL code allows me to target both Nvidia and AMD gpus with a negligible performance impact (<5%) compared to CUDA native.
There is a version of GCC for Knights Corner, but it is not recommended for compute intensive code yet as the compiler is not very good at utilize the vector capabilities of Xeon Phi.
And regarding ease of programming: I was under the impression that Xeon Phi cannot match Nvidias/AMDs top of the line GPUs in theoretical FLOPS, but Xeon Phi was still a success in the HPC area just because it was a lot easier to program (i.e. much easier to reach a certain fraction of the theoretical maximum).
Yes. The cores are Silvermont with FP and vector enhancements. They will code that normal Silvermont cores will run now. The Intel compiler is needed to take advantage of the enhancements, at least until gcc catches up.
What isn't being looked at is that it doesn't exactly compare 1:1 with GPGPU solutions. There is nothing preventing the use of both at the same time. A system could certainly be built with both Phi and GPUs. GPUs are more vector oriented and require thread grouping (SIMD), whereas Phi is MIMD (mostly) and so has an advantage for problems that are more serial or iterative by nature.
This isn't true. I've rutinely used gcc to cross-compile native phi binaries. One can use pthread or openmp without problems. From there you ssh into the PHI, run your app from the nfs mounted filesystem you set up, etc. Given the mostly self-contained nature of the old PHIs, it is easy to see wny intel is moving to dropping the host system.
As a side note, the PHI networking is run over the in-kernel non-transparent pci-e bridge driver.
Reading comprehension? What you say is true for current Xeon Phi, which is only available as an Add-on card. Here we are talking about the next iteration, code named Knights Landing. It is available also as socketed version and also acts as the CPU. It literally can run Windows or Ubuntu. The cores in this new Xeon Phi are beefed up Silvermont cores which are small x86 64-bit capable cores already used in the Atom series of processors.
Quote: Probably most important: Knights Landing is *not* an add-on product that requires a regular system with a regular CPU to run... it is a completely self-hosting Xeon processor that boots & runs Linux or Windows and makes 100% of the computing resources visible to regular software running on the system. Any multi-threaded program that can scale out to the core count and is setup to use the AVX-512 instructions can push this card up to those theoretical numbers.
Anything not using AVX-512 can still use all 72 cores. You can take your existing x86 multi-threaded application and it will run on 72 cores if it is programmed to use all cores. This is huge and a huge thread for NV/AMD.
It depends on what you mean by compatible. If you want to run existing multithreaded or AVX assembly tuned applications on a bunch of Atom cores, then it is much easier to port to Xeon PHI. If you are trying to write a new application that gets good utilization out of thousands of vector lanes, I think GPUs actually have a better programming model.
AMD has mostly (but not completely) working OpenCL, NV has working CUDA and sometimes working OpenCL. If Intel delivers non-buggy the usual x86 platform – it _will_ be very interesting for serious computations.
Most consumer workloads benefit from fewer faster threads. These cards run at lower frequencies, but benefit from data parallel execution units. That's useful when you want a good mix of integer and floating point HPC performance, but not at all helpful for highly sequential client software.
Different horses for different courses. Desktop workloads usually only need a few simultaneous threads and only infrequently take advantage of parallelism.
That said, we could easily see efficiency improvements from 4-way SMT with how wide the cores are. Disappointed that we still only see 2-way SMT over a decade later.
The SMT scheduling is part of the out-of-order execution engine, not the kernel, so it's a lot easier to scale that in hardware. Xeon Phi cores are 4-way, making it easier to port that scheduling to another microarchitecture. Power7 was 4-way, and Power8 is 8-way.
Frankly, the biggest limitation is cache bandwidth. You need a much wider cache to keep the threads happy. However, the efficiency gains are tremendous. P4 with HT saw up to 28% improvement in performance for as little as 5-6% area bump. And that was considered a poor implementation.
I almost wonder what an Intel dedicated gaming GPU would be like, they have a lot of the right pieces in play. Their GPU architecture is decent enough, and with their on package memory, I wonder what that could be if they scaled up the EUs and bandwidth accordingly.
Exactly. They only recently managed to catch up with AMD in GPU performance for their on-chip engines, and that's only by having their 128MB eDRAM cache in the Iris Pro designs.
They've been on a roll recently; but the p4 architecture and the Itanic are proof that Intel can screw the pooch on released hardware as badly as anyone else.
Maybe, but I'm told that xeon phi's maximum gflops is more easily achievable than on current gpgpus. Obviously, hsa should be bringing gpgpus closer to cpu's in terms of context-switching and minimum scheduable elements, but they're not there yet.
Getting 90% of 3TFLOP versus getting 70% of 6-8+TF... I'll take the 70%.
It is a lot more interesting if FP64, where the only current GPU that challenges are professional Hawaii cards. Although, even at that point, Intel with a 100% utlization (I believe their FP64 rate is 1/2 FP32) for about 1500GFLOPs relies on AMD's cards failing to hit over 55%.
Of course, there are applications where the ability to have more CPU-like performance will serve better than a more GPU like performance.
the 90% and 70% I made up. Safe to say, I'm guessing the difference isn't much larger than that. Also, talking to someone who went to a workshop, that was not the impression that fully utilizing it was not as easy a GPUs, but, that can change rapidly depending on software.
Xeon Phi does get 3 TFLOPS on FP64. Each core has two AVX512 vector units per core, so with each capable of an FMA for each of the eight FP64 you get 32 FLOPs per core per cycle. 72 cores a chip gets you 2304 FLOPs per cycle total. At the estimated frequency of 1.3 GHz that comes out to be 2.995 TFLOPS of FP64.
That's peak throughput though. It will drop dramatically as the number of non FMAC instructions increases. Unless you are doing a lot of matrix operations or DSP algorithms, it will be hard to realize even half of that. For example:
Of course it is impressive that the throughput is so high in the first place, but the FirePro S9170 was just launched with 2.62 TFLOP/s of FP64 at no more than 275W. And all my OpenCL code will just run faster than it used to. No porting necessary.
There's also a register benefit with Xeon Phi, allowing for much larger per-thread kernel executions with data in-flight. On GPGPU it's a little restrictive in that sense, and you have to make threads lightweight. Xeon Phi threads can be super branchy and still be efficient.
This opens up the door for algorithms which could not yet be accalerated by GPUs at all. That's what Intel is targeting: more flexibility, new algorithms and binary compatibility. Those are things the current GPUs can't do, despite matching the raw horse power (Hawaii).
Uh... I'd take 100% of 100TFlops myself since your numbers are completely made-up and I can make better made-up numbers.
In the real-world, 3 Tflops of double precision is what Nvidia is hoping to achieve in 2016 with Pascal and AMD isn't even participating in this market. That's not even taking into account the benefits of having real CPU hardware in your parallel processor instead of having to rely on shaders that were intended to play video games and were re-tasked for the HPC world.
Don't let marketing nonsense from AMD about mythical "8Tflop" compute performance on the Furry fool you. First of all, that's *single precision* and these cards are meant for real computers doing double precision workloads, not glorified video game systems. At DP a Knights Landing part is between 5 to 6 times faster than the Furry-X that only clocks in at about 0.5 TFlops DP.
Sorry, I disagree. Significant amount of compute requirements in HPC can do away with SP and GPUs can do a great job in it. Upcoming needs like huge Deep Learning, NN implementations need just SP. Wherever DP is a must, I still have CPUs to do it!
That's why GPUs are accelerators, and accompany CPU. Plus they provide power advantages as well. Xeon Phi is useful, I think, only when there are many threads and each thread is heavy in compute requirement.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
53 Comments
Back to Article
mdriftmeyer - Monday, July 13, 2015 - link
HBM 2.0 GPGPUs for AMD FirePro and Nvidia will make this platform far less compelling.SaberKOG91 - Monday, July 13, 2015 - link
Especially considering the barrier to entry. With AMD you have OpenCL and Nvidia has CUDA/OpenCL. These Phi cards are closer to complete systems than coprocessors. Programming for them is non-trivial and requires a lot of sifting through documentation.beginner99 - Tuesday, July 14, 2015 - link
No, not anymore. As the slide states and has been known for a while Knight Landing is binary compatible to Haswell (except for TSX instructions). Makes sense because after all these are Silvermont (Atom) cores with some added FP64 punch. That's also why it is available as socketed variant. this thing can run Windows/Linux and any other x86 code your current Xeon is running.It's exactly the opposite to what you claimed. Barrier for entry is much lower with this new Xeon Phi.
coolbho3k - Tuesday, July 14, 2015 - link
>Knight Landing is binary compatible to Haswell (except for TSX instructions)Haswell supports TSX instructions now? News to me. ;)
close - Tuesday, July 14, 2015 - link
"Intel is announcing that they have teamed up with HP"Whenever I see Intel teaming up with HP I think Itanium :).
Refuge - Tuesday, July 14, 2015 - link
Haha, I had the exact same thought when I read that. :PSamus - Tuesday, July 14, 2015 - link
Itanium IA64 wasn't bad technology, but where Intel screwed up was pressuring HP to push it into the consumer space where it never belonged. AMD knew, and executed on, a x86-compatible 64-bit instruction set for consumer applications.So by binary compatible, is the Phi x86-based or does it emulate x86? I assumed ditching P54c indicates they ditched x86 but I'm not familiar with Silvermonts architecture? Does it just emulate x86 instructions?
LukaP - Tuesday, July 14, 2015 - link
Silvermont is a native x86-64 architecture. :) Its used in Atom CPUs and with the extra SIMD units, in KLs :)68k - Tuesday, July 14, 2015 - link
Haswell based Xeon E7 do support TSXhttp://techreport.com/news/28225/xeon-e7-v3-boasts...
It is however broken on consumer products, and according to the TR-article, also broken on Haswell based Xeon E5.
Refuge - Friday, July 17, 2015 - link
As far as I know it has always been supported by Haswell chips.SaberKOG91 - Tuesday, July 14, 2015 - link
I'm sorry, but that's simply not true.You must have the Intel Compiler Collection and Intel Parallel Studio to use it, which are not free for non-academic use (not to mention the usual headache of Intel licensing). The Linux that the Phi runs is a highly specialized BSP built by Intel with a ton of proprietary code. It's not as if you can just throw Ubuntu on this thing and start programming. They had to patch the kernel to allow this card to nfs boot the OS, and for the host processor to move data back and forth.
By comparison, CUDA is a package install away and is free for anyone to use for any purpose. OpenCL for AMD requires a little bit of compiling for most distros or a single package on Debian. As a bonus, writing OpenCL code allows me to target both Nvidia and AMD gpus with a negligible performance impact (<5%) compared to CUDA native.
68k - Tuesday, July 14, 2015 - link
OpenCL is supported even on exiting Xeon Phihttps://int2-software.intel.com/en-us/articles/ope...
gcc support to offload OpenMP to Xeon Phi is being worked on for Knights landing
https://gcc.gnu.org/wiki/Offloading
There is a version of GCC for Knights Corner, but it is not recommended for compute intensive code yet as the compiler is not very good at utilize the vector capabilities of Xeon Phi.
And regarding ease of programming: I was under the impression that Xeon Phi cannot match Nvidias/AMDs top of the line GPUs in theoretical FLOPS, but Xeon Phi was still a success in the HPC area just because it was a lot easier to program (i.e. much easier to reach a certain fraction of the theoretical maximum).
Jaybus - Wednesday, July 15, 2015 - link
Yes. The cores are Silvermont with FP and vector enhancements. They will code that normal Silvermont cores will run now. The Intel compiler is needed to take advantage of the enhancements, at least until gcc catches up.What isn't being looked at is that it doesn't exactly compare 1:1 with GPGPU solutions. There is nothing preventing the use of both at the same time. A system could certainly be built with both Phi and GPUs. GPUs are more vector oriented and require thread grouping (SIMD), whereas Phi is MIMD (mostly) and so has an advantage for problems that are more serial or iterative by nature.
patrickjp93 - Tuesday, July 14, 2015 - link
Not true. You can use OpenCL or OepnMP under GCC, Clang, or ICC.patrickjp93 - Tuesday, July 14, 2015 - link
I forgot to mention they also work with OpenACC.darthscsi - Tuesday, July 14, 2015 - link
This isn't true. I've rutinely used gcc to cross-compile native phi binaries. One can use pthread or openmp without problems. From there you ssh into the PHI, run your app from the nfs mounted filesystem you set up, etc. Given the mostly self-contained nature of the old PHIs, it is easy to see wny intel is moving to dropping the host system.As a side note, the PHI networking is run over the in-kernel non-transparent pci-e bridge driver.
beginner99 - Wednesday, July 15, 2015 - link
Reading comprehension? What you say is true for current Xeon Phi, which is only available as an Add-on card. Here we are talking about the next iteration, code named Knights Landing. It is available also as socketed version and also acts as the CPU. It literally can run Windows or Ubuntu. The cores in this new Xeon Phi are beefed up Silvermont cores which are small x86 64-bit capable cores already used in the Atom series of processors.Quote:
Probably most important: Knights Landing is *not* an add-on product that requires a regular system with a regular CPU to run... it is a completely self-hosting Xeon processor that boots & runs Linux or Windows and makes 100% of the computing resources visible to regular software running on the system. Any multi-threaded program that can scale out to the core count and is setup to use the AVX-512 instructions can push this card up to those theoretical numbers.
Anything not using AVX-512 can still use all 72 cores. You can take your existing x86 multi-threaded application and it will run on 72 cores if it is programmed to use all cores. This is huge and a huge thread for NV/AMD.
Loki726 - Friday, July 17, 2015 - link
It depends on what you mean by compatible. If you want to run existing multithreaded or AVX assembly tuned applications on a bunch of Atom cores, then it is much easier to port to Xeon PHI. If you are trying to write a new application that gets good utilization out of thousands of vector lanes, I think GPUs actually have a better programming model.Senti - Tuesday, July 14, 2015 - link
AMD has mostly (but not completely) working OpenCL, NV has working CUDA and sometimes working OpenCL.If Intel delivers non-buggy the usual x86 platform – it _will_ be very interesting for serious computations.
doids - Wednesday, July 15, 2015 - link
Xeon Phi supports OpenCL so it is as trivial as using Cuda/OpenCL on graphics cardspatrickjp93 - Tuesday, July 14, 2015 - link
Really, because the current Knight's Corner chips are spanking the Nvidia Tesla out of contracts left and right.marraco - Monday, July 13, 2015 - link
How pathetic is the number of cores sold to the desktop consumer.SaberKOG91 - Monday, July 13, 2015 - link
Most consumer workloads benefit from fewer faster threads. These cards run at lower frequencies, but benefit from data parallel execution units. That's useful when you want a good mix of integer and floating point HPC performance, but not at all helpful for highly sequential client software.SirNuke - Monday, July 13, 2015 - link
Different horses for different courses. Desktop workloads usually only need a few simultaneous threads and only infrequently take advantage of parallelism.SaberKOG91 - Monday, July 13, 2015 - link
That said, we could easily see efficiency improvements from 4-way SMT with how wide the cores are. Disappointed that we still only see 2-way SMT over a decade later.testbug00 - Monday, July 13, 2015 - link
And, who would be writing that scheduler? SMT2 isn't that hard relative to no SMT. Over SMT2 and things get much trickier, from my understanding.SaberKOG91 - Tuesday, July 14, 2015 - link
The SMT scheduling is part of the out-of-order execution engine, not the kernel, so it's a lot easier to scale that in hardware. Xeon Phi cores are 4-way, making it easier to port that scheduling to another microarchitecture. Power7 was 4-way, and Power8 is 8-way.This paper is a solid read if you want to know how difficult it is: http://lazowska.cs.washington.edu/SMT.pdf
Frankly, the biggest limitation is cache bandwidth. You need a much wider cache to keep the threads happy. However, the efficiency gains are tremendous. P4 with HT saw up to 28% improvement in performance for as little as 5-6% area bump. And that was considered a poor implementation.
tipoo - Monday, July 13, 2015 - link
Look at the die size of the simpler cores in the Phi, vs Haswell or Broadwell cores.tipoo - Monday, July 13, 2015 - link
I almost wonder what an Intel dedicated gaming GPU would be like, they have a lot of the right pieces in play. Their GPU architecture is decent enough, and with their on package memory, I wonder what that could be if they scaled up the EUs and bandwidth accordingly.DigitalFreak - Monday, July 13, 2015 - link
Phi was originally supposed to be available in a video card configuration as well, but that idea was dropped prior to launch.SaberKOG91 - Monday, July 13, 2015 - link
Source? The Larrabeen microarch was to be used for GPU designs, but I don't remember seeing anything about a discrete GPU version of Phi.testbug00 - Monday, July 13, 2015 - link
Why wouldn't Intel sell a product to a market if they had a viable product? ???ImSpartacus - Tuesday, July 14, 2015 - link
They would.So it probably wasn't viable. Intel is pretty spectacular at predicting the success of a given product. They just don't fuck up.
SaberKOG91 - Tuesday, July 14, 2015 - link
Exactly. They only recently managed to catch up with AMD in GPU performance for their on-chip engines, and that's only by having their 128MB eDRAM cache in the Iris Pro designs.Refuge - Tuesday, July 14, 2015 - link
And AMD put 4g of HBM memory on their FURY X and still lose to their chosen opponent.Intel has made good ground, and is catching up quick.
It doesn't matter how they get it done, as long as they get it done. You can do better, then do it yourself and be rich and famous.
DanNeely - Tuesday, July 14, 2015 - link
They've been on a roll recently; but the p4 architecture and the Itanic are proof that Intel can screw the pooch on released hardware as badly as anyone else.Refuge - Friday, July 17, 2015 - link
Haha, that was a difficult time in my household...Ktracho - Monday, July 13, 2015 - link
As they say, good enough is the enemy of great - Intel doesn't see the need for great graphics because theirs is already good enough.Refuge - Tuesday, July 14, 2015 - link
I seem to remember owning an Intel GPU when I was younger... much younger... like late 90's...I also remembering being much happier when I upgraded to a VOODOO and was finally able to play Final Fantasy 7. :D
Klimax - Wednesday, July 15, 2015 - link
i740. Have it in collection...Refuge - Friday, July 17, 2015 - link
+1 good sir! :DtuxRoller - Monday, July 13, 2015 - link
Maybe, but I'm told that xeon phi's maximum gflops is more easily achievable than on current gpgpus.Obviously, hsa should be bringing gpgpus closer to cpu's in terms of context-switching and minimum scheduable elements, but they're not there yet.
testbug00 - Monday, July 13, 2015 - link
Getting 90% of 3TFLOP versus getting 70% of 6-8+TF... I'll take the 70%.It is a lot more interesting if FP64, where the only current GPU that challenges are professional Hawaii cards. Although, even at that point, Intel with a 100% utlization (I believe their FP64 rate is 1/2 FP32) for about 1500GFLOPs relies on AMD's cards failing to hit over 55%.
Of course, there are applications where the ability to have more CPU-like performance will serve better than a more GPU like performance.
testbug00 - Monday, July 13, 2015 - link
the 90% and 70% I made up. Safe to say, I'm guessing the difference isn't much larger than that. Also, talking to someone who went to a workshop, that was not the impression that fully utilizing it was not as easy a GPUs, but, that can change rapidly depending on software.cmikeh2 - Tuesday, July 14, 2015 - link
Xeon Phi does get 3 TFLOPS on FP64. Each core has two AVX512 vector units per core, so with each capable of an FMA for each of the eight FP64 you get 32 FLOPs per core per cycle. 72 cores a chip gets you 2304 FLOPs per cycle total. At the estimated frequency of 1.3 GHz that comes out to be 2.995 TFLOPS of FP64.SaberKOG91 - Tuesday, July 14, 2015 - link
That's peak throughput though. It will drop dramatically as the number of non FMAC instructions increases. Unless you are doing a lot of matrix operations or DSP algorithms, it will be hard to realize even half of that. For example:90/10 split => 95% efficiency
80/20 split => 90%
70/30 split => 85%
50/50 split => 75%
20/80 split => 60% efficiency
Of course it is impressive that the throughput is so high in the first place, but the FirePro S9170 was just launched with 2.62 TFLOP/s of FP64 at no more than 275W. And all my OpenCL code will just run faster than it used to. No porting necessary.
Ian Cutress - Tuesday, July 14, 2015 - link
There's also a register benefit with Xeon Phi, allowing for much larger per-thread kernel executions with data in-flight. On GPGPU it's a little restrictive in that sense, and you have to make threads lightweight. Xeon Phi threads can be super branchy and still be efficient.MrSpadge - Tuesday, July 14, 2015 - link
This opens up the door for algorithms which could not yet be accalerated by GPUs at all. That's what Intel is targeting: more flexibility, new algorithms and binary compatibility. Those are things the current GPUs can't do, despite matching the raw horse power (Hawaii).CajunArson - Tuesday, July 14, 2015 - link
Uh... I'd take 100% of 100TFlops myself since your numbers are completely made-up and I can make better made-up numbers.In the real-world, 3 Tflops of double precision is what Nvidia is hoping to achieve in 2016 with Pascal and AMD isn't even participating in this market. That's not even taking into account the benefits of having real CPU hardware in your parallel processor instead of having to rely on shaders that were intended to play video games and were re-tasked for the HPC world.
Don't let marketing nonsense from AMD about mythical "8Tflop" compute performance on the Furry fool you. First of all, that's *single precision* and these cards are meant for real computers doing double precision workloads, not glorified video game systems. At DP a Knights Landing part is between 5 to 6 times faster than the Furry-X that only clocks in at about 0.5 TFlops DP.
MrSpadge - Tuesday, July 14, 2015 - link
To be fair, AMDs Hawaii with 1/2 FP64 rate is actually participating at 2.6 DP GFlops peak performance in the currently fasted FirePro.karthik.hegde - Wednesday, July 15, 2015 - link
Sorry, I disagree. Significant amount of compute requirements in HPC can do away with SP and GPUs can do a great job in it. Upcoming needs like huge Deep Learning, NN implementations need just SP. Wherever DP is a must, I still have CPUs to do it!That's why GPUs are accelerators, and accompany CPU. Plus they provide power advantages as well. Xeon Phi is useful, I think, only when there are many threads and each thread is heavy in compute requirement.
Ktracho - Tuesday, July 14, 2015 - link
Any mention of performance per watt or how power efficiency compares to current solutions?Refuge - Friday, July 17, 2015 - link
Not that I've seen yet no.If you do find some take them worth a grain of salt at the moment though.