Not finished reading, but I already have one complaint:
> Gen11’s smallest wavefront width is 8 threads wide (SIMD8), so it can take multiple clock cycles to execute a single wavefront, with Intel interleaving multiple threads as a form of latency hiding.
Wow. Mixing 2 different definitions of "thread" in the same sentence? Please don't.
Last I checked Nvidia is the only one talking about SIMD lanes as if they're threads. In Intel's Gen 9 whitepaper, it uses "threads" in a manner equivalent to CPU threads, and they talk about SIMD lanes as SIMD lanes.
And speaking of Gen 9, they claim it has 7-way SMT. Did they ever specify this, for Gen 11? I don't recall seeing it in their Gen 11 whitepaper, which went into significantly less detail on the EUs than previous whitepapers.
"I guess your article could be self-consistent by replacing the second use of "thread" in that quoted sentence with "wavefront"?"
You are correct sir! That was supposed to be "wavefront".
And Intel tends to use "wave" in its literature, though I prefer to collapse it down to just wavefront to keep things reasonably consistent. We don't need 2 nearly-identical terms for the same thing.
IMO, the reason Nvidia has long called their Warp elements "threads" is so they can claim that each SIMD lane is a "core", to make their GPUs *sound* more impressive.
Since Volta finally fixed their per-lane IP register (which is basically just a fancy form of branch predication), there's almost a touch of truth in that characterization, and I'd finally agree that their ISA is more than just a straight-forward combination of SIMD + SMT.
AMD feels more confusing. Their base unit is a "stream processor" which seems to suggest something larger than it really is. But a group of stream processors is called a Compute Unit, which that seems to suggest something smaller than it really is.
Though looking at some of the programming literature for GPUs, I can see where the "thread" terminology comes from. So this looks more like a problem of someone coming up with their own language instead of the industry coming together to standardize on it. However, given that NVIDIA, AMD, and Intel have their own way of doing things, it may not be possible to do that and for the sake of clarity, having their own terminology is more or less correct.
Admittedly I haven't read the whole article yet, but it strikes me how the presentations seems to be comparing the new GPU to the previous GPU, rather than presenting it as a new architecture. Does this confirm that using the "Xe" moniker for this product is just marketing, and that it in fact is an evolution of previous Gen architectures?
I mean, I don't mind if that's the case, I just wish they wouldn't overmarket it.
It is an evolution of the previous Gen architectures. A major evolution, but an evolution none the less. Not even Intel is going to do a clean sheet design when they have bits and pieces that already work fine.
Certainly, they're not going to create a new clean-slate ALU design just for the sake of it, but it has always been my impression that Xe (at least Xe-HPC) was going to be a more-or-less new architecture. Maybe that has just been my misunderstanding the whole, and Xe-HPC too is going to be fundamentally Gen-based (though I seem to recall that being explicitly denied at some point), but what I was getting at here was that Xe-HPC is going to be the new architecture, and meanwhile this is "merely" an evolution of Gen for which they're just borrowing the product name of their higher-end offering to make it seem like more than what it is.
You should distinguish between the ISA and uArch of the shader cores (EUs) vs. the macro-architecture of the GPU (e.g. buses, memories, caches, fixed-function units, etc.).
So, you can have a macro-architecture that's *very* different, even while the ISA is a small evolution and the uArch of the EUs is somewhere in between.
RDNA 1 still has significant GCN bits in it, I'm sure Nvidia does the same a few generations in a row, there's no necessary contention between it being an evolution and it being marked as something substantially new.
IMHO the overhead of multi GPU rendering with an iGPU and dGPU can't really be offset by the small contribution the iGPU is likely to make to a beefy dGPU.
More likely will be dGPU via Thunderbolt 4 and very seamless transitions on docking/undocking and that's good enough.
Too bad that won't work nearly as well with Ryzen notebooks so there again consumer choice goes down the drain somewhat. Not that I believe TB dGPU is a really an attractive market unless prices change dramatically.
Agreed. I think it would work much better to task the iGPU with other compute tasks that involve less communication bandwidth with the dGPU. Things like physics, AI, audio processing, etc.
"On the capacity front, the L3 cache can now be as large as 16MB"
I apologize for being off topic, but I just had a surreal moment realizing that this piddly little iGPU can have the same amount of L3 cache as my Voodoo 3 had video ram. How far we've come.
"As a result, integer throughput has also doubled: Xe-LP can put away 8 INT32 ops or 32 INT16 ops per clock cycle, up from 4 and 16 respectively on Gen11." -- but the graph says 4 and 8 respectively on Gen11. (The following line also appears odd as a result.)
In addition to groaning at the joke at the end of page 1, I find the timing to be perfect as I just last night got my partner to start watching the Stargate series
As always here on AT, an absolutely excellent article, distilling a pile of complex information down to something both understandable and interesting. I'm definitely looking forward to seeing how Tiger Lake's Xe iGPU performs, and the DG1 too. I doubt their drivers will be up to par for a few years, but a third contender should be good for the GPU market (though with a clear incumbent leader there's always a chance of the small fish eating each other rather than taking chunks out of the bigger one). Looking forward to the next few years of GPUs, this is bound to be interesting!
The approach taken with DG1 seems a little odd. It's too similar to the iGPU by itself, just with more power/thermal headroom and less memory contention.
Unless it works in concert with the IGP, you'd think it better to either remove the iGPU from the CPU entirely (significantly reducing die size) and package DG1 with the CPU die when a more powerful GPU is not going to be used, or to add a HBM controller to the CPU and make the addition of a HBM die the graphics upgrade option when the Base iGPU is not quite enough.
Soooooo much die space & transistors needed for just barely better performance than Renoir's absolutely freaking MINISCULE Vega 8 iGPU block.... Consider me seriously unimpressed. The suuuuuper early DDR5 support on the IMC is incredibly intriguing and I'm really curious to see what the performance gains from that will be like, but other than that.... epic yawn. Wake me up when it doesn't take Intel half the damn die for them to compete with absolutely teeny-tiny implementations of AMD's 2-3 year old GPU tech....
Makes sense now why AMD's going for Vega again for Cézanne. Some extra frequency & arch tweaks are all they'd need to one-up Intel again, & going RDNA/2 would have had a SIGNIFICANTLY larger die space requirement (an RDNA dCU ["Dual Compute Unit"] is much, MUCH larger than 2x Vega II /"Enhanced" CU's), that just doesn't really make much sense to make until DDR5 shows up with Zen 4 and such a change can be properly taken advantage of.
(Current iGPU's are ALREADY ridiculously memory bandwidth bottlenecked. A beefy RDNA 2 iGPU block would bring even 3200MHz DDR4 to its absolute KNEES, & LPDDR4X is just too uncommon/expensive to bank on it being used widely enough for the huge die space cost vs iterating Vega again to make sense. Also, as we saw with Renoir; with some additional TLC, Vega has had a LOT more left in the tank than probably anyone of us would have thought).
"t’s worth noting that this change is fairly similar to what AMD did last year with its RDNA (1) architecture, eliminating the multi-cycle execution of a wavefront by increasing their SIMD size and returning their wavefront size. In AMD’s case this was done to help keep their SIMD slots occupied more often and reduce instruction latency, and I wouldn’t be surprised if it’s a similar story for Intel."
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
33 Comments
Back to Article
mode_13h - Thursday, August 13, 2020 - link
As always, thanks for the deep coverage.Not finished reading, but I already have one complaint:
> Gen11’s smallest wavefront width is 8 threads wide (SIMD8), so it can take multiple clock cycles to execute a single wavefront, with Intel interleaving multiple threads as a form of latency hiding.
Wow. Mixing 2 different definitions of "thread" in the same sentence? Please don't.
Last I checked Nvidia is the only one talking about SIMD lanes as if they're threads. In Intel's Gen 9 whitepaper, it uses "threads" in a manner equivalent to CPU threads, and they talk about SIMD lanes as SIMD lanes.
And speaking of Gen 9, they claim it has 7-way SMT. Did they ever specify this, for Gen 11? I don't recall seeing it in their Gen 11 whitepaper, which went into significantly less detail on the EUs than previous whitepapers.
mode_13h - Thursday, August 13, 2020 - link
I guess your article could be self-consistent by replacing the second use of "thread" in that quoted sentence with "wavefront"?Although, "wavefront" is an AMD term (Nvidia calls them "Warps"). However, Intel's slides suggest they still call them "threads".
Ryan Smith - Thursday, August 13, 2020 - link
"I guess your article could be self-consistent by replacing the second use of "thread" in that quoted sentence with "wavefront"?"You are correct sir! That was supposed to be "wavefront".
And Intel tends to use "wave" in its literature, though I prefer to collapse it down to just wavefront to keep things reasonably consistent. We don't need 2 nearly-identical terms for the same thing.
mode_13h - Thursday, August 13, 2020 - link
Cool. Thanks for the reply!BTW, I don't mind the term "wavefront" - I said that more to point it out to those who might not know.
mode_13h - Thursday, August 13, 2020 - link
IMO, the reason Nvidia has long called their Warp elements "threads" is so they can claim that each SIMD lane is a "core", to make their GPUs *sound* more impressive.Since Volta finally fixed their per-lane IP register (which is basically just a fancy form of branch predication), there's almost a touch of truth in that characterization, and I'd finally agree that their ISA is more than just a straight-forward combination of SIMD + SMT.
xenol - Thursday, August 13, 2020 - link
AMD feels more confusing. Their base unit is a "stream processor" which seems to suggest something larger than it really is. But a group of stream processors is called a Compute Unit, which that seems to suggest something smaller than it really is.Though looking at some of the programming literature for GPUs, I can see where the "thread" terminology comes from. So this looks more like a problem of someone coming up with their own language instead of the industry coming together to standardize on it. However, given that NVIDIA, AMD, and Intel have their own way of doing things, it may not be possible to do that and for the sake of clarity, having their own terminology is more or less correct.
mode_13h - Thursday, August 13, 2020 - link
Since Nvidia's Fermi and AMD's GCN, their architectures basically amount to SIMD + SMT. I'm not sure exactly when Intel added SMT.Anyway, I wouldn't characterize their architectures as fundamentally different. Intel is traditionally the most distinct, among the three.
jim bone - Friday, August 14, 2020 - link
recent editions of Hennessy and Patterson have a nice table mapping the CPU terminology to nvidia’s GPU terminology:https://books.google.ca/books?id=cM8mDwAAQBAJ&...
jim bone - Friday, August 14, 2020 - link
and yes for reasons nvidia calls a vertical slice of simd instructions a threadkpx86 - Thursday, August 13, 2020 - link
I believe the SW libraries like DirectX and OpenGL use threads this way.From MSFT website: The maximum number of threads is limited to D3D11_CS_4_X_THREAD_GROUP_MAX_THREADS_PER_GROUP (768) per group.
mode_13h - Thursday, August 13, 2020 - link
I can't speak to Direct 3D, but OpenGL talks about work group invocations. I don't believe "threads" is mentioned anywhere in the API.Dolda2000 - Thursday, August 13, 2020 - link
Admittedly I haven't read the whole article yet, but it strikes me how the presentations seems to be comparing the new GPU to the previous GPU, rather than presenting it as a new architecture. Does this confirm that using the "Xe" moniker for this product is just marketing, and that it in fact is an evolution of previous Gen architectures?I mean, I don't mind if that's the case, I just wish they wouldn't overmarket it.
Ryan Smith - Thursday, August 13, 2020 - link
" is an evolution of previous Gen architectures?"It is an evolution of the previous Gen architectures. A major evolution, but an evolution none the less. Not even Intel is going to do a clean sheet design when they have bits and pieces that already work fine.
Dolda2000 - Thursday, August 13, 2020 - link
Certainly, they're not going to create a new clean-slate ALU design just for the sake of it, but it has always been my impression that Xe (at least Xe-HPC) was going to be a more-or-less new architecture. Maybe that has just been my misunderstanding the whole, and Xe-HPC too is going to be fundamentally Gen-based (though I seem to recall that being explicitly denied at some point), but what I was getting at here was that Xe-HPC is going to be the new architecture, and meanwhile this is "merely" an evolution of Gen for which they're just borrowing the product name of their higher-end offering to make it seem like more than what it is.mode_13h - Thursday, August 13, 2020 - link
You should distinguish between the ISA and uArch of the shader cores (EUs) vs. the macro-architecture of the GPU (e.g. buses, memories, caches, fixed-function units, etc.).So, you can have a macro-architecture that's *very* different, even while the ISA is a small evolution and the uArch of the EUs is somewhere in between.
tipoo - Thursday, August 13, 2020 - link
RDNA 1 still has significant GCN bits in it, I'm sure Nvidia does the same a few generations in a row, there's no necessary contention between it being an evolution and it being marked as something substantially new.abufrejoval - Thursday, August 13, 2020 - link
IMHO the overhead of multi GPU rendering with an iGPU and dGPU can't really be offset by the small contribution the iGPU is likely to make to a beefy dGPU.More likely will be dGPU via Thunderbolt 4 and very seamless transitions on docking/undocking and that's good enough.
Too bad that won't work nearly as well with Ryzen notebooks so there again consumer choice goes down the drain somewhat. Not that I believe TB dGPU is a really an attractive market unless prices change dramatically.
mode_13h - Thursday, August 13, 2020 - link
Agreed. I think it would work much better to task the iGPU with other compute tasks that involve less communication bandwidth with the dGPU. Things like physics, AI, audio processing, etc.brucethemoose - Thursday, August 13, 2020 - link
Maybe post processing? Like an Intel version lf ReShade? IIRC the frames have to come back to the IGPU's display block anyway.tipoo - Thursday, August 13, 2020 - link
In this case the IGP would be nearly equivalent to DG1regsEx - Thursday, August 13, 2020 - link
HPG will use EM cores for ray tracing?Mr Perfect - Thursday, August 13, 2020 - link
"On the capacity front, the L3 cache can now be as large as 16MB"I apologize for being off topic, but I just had a surreal moment realizing that this piddly little iGPU can have the same amount of L3 cache as my Voodoo 3 had video ram. How far we've come.
Brane2 - Thursday, August 13, 2020 - link
As usual, no useful info.They'll make a GPU that looks every bit like... GPU.
What a shocker.
Who knew ?
GreenReaper - Thursday, August 13, 2020 - link
"As a result, integer throughput has also doubled: Xe-LP can put away 8 INT32 ops or 32 INT16 ops per clock cycle, up from 4 and 16 respectively on Gen11." -- but the graph says 4 and 8 respectively on Gen11. (The following line also appears odd as a result.)Ryan Smith - Thursday, August 13, 2020 - link
Thanks! That was meant to be 16 ops for Gen11 in the table.neogodless - Thursday, August 13, 2020 - link
> from reviews of Ice Lake and Ryzen 3000 “Renoir” laptops,It is my understanding that the Renoir codename refers to what are commercially Ryzen 4000 mobile APUs, like the 4700U, 4800H and 4900HS.
FullmetalTitan - Thursday, August 13, 2020 - link
In addition to groaning at the joke at the end of page 1, I find the timing to be perfect as I just last night got my partner to start watching the Stargate seriesValantar - Friday, August 14, 2020 - link
As always here on AT, an absolutely excellent article, distilling a pile of complex information down to something both understandable and interesting. I'm definitely looking forward to seeing how Tiger Lake's Xe iGPU performs, and the DG1 too. I doubt their drivers will be up to par for a few years, but a third contender should be good for the GPU market (though with a clear incumbent leader there's always a chance of the small fish eating each other rather than taking chunks out of the bigger one). Looking forward to the next few years of GPUs, this is bound to be interesting!onewingedangel - Friday, August 14, 2020 - link
The approach taken with DG1 seems a little odd. It's too similar to the iGPU by itself, just with more power/thermal headroom and less memory contention.Unless it works in concert with the IGP, you'd think it better to either remove the iGPU from the CPU entirely (significantly reducing die size) and package DG1 with the CPU die when a more powerful GPU is not going to be used, or to add a HBM controller to the CPU and make the addition of a HBM die the graphics upgrade option when the Base iGPU is not quite enough.
Digidi - Friday, August 14, 2020 - link
Nice article! The Fron end look huge. 2 Rasterizer for only 700 Shaders is a massive Change.Cooe - Saturday, August 15, 2020 - link
Soooooo much die space & transistors needed for just barely better performance than Renoir's absolutely freaking MINISCULE Vega 8 iGPU block.... Consider me seriously unimpressed. The suuuuuper early DDR5 support on the IMC is incredibly intriguing and I'm really curious to see what the performance gains from that will be like, but other than that.... epic yawn. Wake me up when it doesn't take Intel half the damn die for them to compete with absolutely teeny-tiny implementations of AMD's 2-3 year old GPU tech....Makes sense now why AMD's going for Vega again for Cézanne. Some extra frequency & arch tweaks are all they'd need to one-up Intel again, & going RDNA/2 would have had a SIGNIFICANTLY larger die space requirement (an RDNA dCU ["Dual Compute Unit"] is much, MUCH larger than 2x Vega II /"Enhanced" CU's), that just doesn't really make much sense to make until DDR5 shows up with Zen 4 and such a change can be properly taken advantage of.
(Current iGPU's are ALREADY ridiculously memory bandwidth bottlenecked. A beefy RDNA 2 iGPU block would bring even 3200MHz DDR4 to its absolute KNEES, & LPDDR4X is just too uncommon/expensive to bank on it being used widely enough for the huge die space cost vs iterating Vega again to make sense. Also, as we saw with Renoir; with some additional TLC, Vega has had a LOT more left in the tank than probably anyone of us would have thought).
Oxford Guy - Tuesday, August 18, 2020 - link
MadTV is back, with an episode called Anandtech Literally?Oxford Guy - Tuesday, August 18, 2020 - link
"t’s worth noting that this change is fairly similar to what AMD did last year with its RDNA (1) architecture, eliminating the multi-cycle execution of a wavefront by increasing their SIMD size and returning their wavefront size. In AMD’s case this was done to help keep their SIMD slots occupied more often and reduce instruction latency, and I wouldn’t be surprised if it’s a similar story for Intel."Retuning or returning to?