The most important part is what Anand highlighted from this walkthrough; namely that the Rogue series has a chip that is on equal balance if not even stronger than Tegra K1.
That remains to be seen - but it does appear to be competitive at this point. We also don't know what the entire Denver+K1 package will do.
What kind of surprises me about K1 though is that Maxwell has already been released for the desktop. I would think "M1" (to guess at a name) would be the architecture to build on in the next year.
Uhhh no it doesn't. The article failed to mention this important piece of information, but you realize the 6XT series probably won't come to market before 2015 right? Likely 2H 2015 according to an earlier article published here on Anandtech.
That's because there is none. It was an estimate given by Ryan Smith based on prior Imagination GPU announcements, and the time it usually takes chipmakers to integrate the design and bring a device to market.
"Finally, while Imagination doesn’t provide a timeframe for consumer availability (since they only sell designs to chipmakers), based on the amount of time needed to integrate these designs into new products and then get those products in the hands of consumers, we should be looking at a timetable similar to the original Series6 designs. In which case Series6XT equipped SoCs would start appearing in 2015, likely in the latter half."
Really? You don't think their biggest customer, Apple, which has shown the ability to beat an entire industry to market by almost a year (64 bit ARMv8) and one of the first PVR6 customers to market as well? Anand though it would be 2014 when PVR6 would show up to market, and the A9600 that was supposed to show up in 2013 never did (Apple's A7 did though!)
So why do you rule out the real possibility that the Apple A8 would ship with a 6XT this year?
My theory for the ~18 month estimate is that it's about how long Apple took to integrate and bring their current series 6 GPU to market in the A7. I suppose if any company had the resources to accelerate that schedule, it would be Apple. But then the question becomes why and does it make sense? There will be faster Series 6 SKU's available for the A8 in the interim.
But don't forget, Apple has been involved in this design long before Imagination's public reveal. In fact, Apple informed Imaginations design with their real world experience and their own needs. I always expect Apple to get a 6 to 12 month lead. And, Apple has shown themselves to put a very high value on their SoC GPU leadership.
... and the 64-bit architecture in the A7 is a completely different story. Apple wasn't first to market with 64-bit because they had an accelerated development schedule compared to other chipmakers. Apple was first to market because they started development first, ahead of everyone else.
Apple doesn't develop Series 6, they license it, and we know when it becomes available for integration and when devices based on it come to market. We also know how long it usually takes between tapeout of an SOC, production, and final availability in devices, and based on this it would be very difficult if not impossible for Apple to put a 6XT in the A8 if they keep to their regular release schedule. I think going from an ~18 to ~8 month schedule is a bit much even for Apple, especially considering the new process shrink.
It's available for license now. Doesn't that normally mean that, as soon as the computer finishes synthesizing the design, it can be taped out now as well? The issue then is how much work needs to be done to get the design to work at the desired power envelope and clock, and if said work is worth it.
Considering the highly parallel nature of graphics processing, PowerVR's low core count and non-linear arrangement will make then weak in real-world gaming but strong in synthetic GFLOPS since their cores are stronger.
For example, ATI has the weakest GPU cores of everybody but they cram over 2000 onto a GPU, seems to be doing pretty good for their performance.
Your example is irrelevant since ATI has no mobile solution and PowerVR has been the strongest in real world gaming since, well, until the K1 ships, PowerVR has had no real competition.
Adreno is technically competent, but Qualcomm isn't seeking to push their transistor budget too high.
So they have 2x Float32 ALU cores, 4x Float16 ALU cores, and a SFU core. This has no mention of Integer cores. Am I to assume that integers won't run on the F32/F16 cores but instead of the SFU core so using integers will be 1/6th the speed of floats?
Seems like a large drawback, Mali and Adreno both run integers at the same speed as 32bit floats.
I'm not sure what exactly is being babbled on about with tile based deferred rendering. It's just software, anyone can write and run it. Go onto a friendly GPU programming forum and they'll take you through it step by step.
Deferred rendering is a software solution. Tile-based deferred rendering is a hardware solution. The GPU cuts up the triangles in a set of tiles. Inside the GPU, there is a superfast 'framebuffer' the size of a tile (think of a special L1-cache). The GPU renders one tile at a time into this buffer, solving overdraw very quickly and efficiently, then it burst-writes the tile out to the framebuffer in videomemory. PowerVR has been using this technology since the early days of 3D acceleration (I have a PowerVR PCX2 card myself, and did a blog on it a while ago: http://scalibq.wordpress.com/2012/12/18/just-keepi... I suggest you read up on it, it is very interesting technology, and unlike any competing GPU.
No it isn't. Anyone with the proper feature set can do tile based deferred, most next gen games are going to be culling light lists out on something like an 8x8 pixel per tile basis, whether that's for forward rending or deferred. Which sounds exactly like what you described.
It might be nice that there's some special little cache for it in PowerVr. But the basic idea as you've described it sounds exactly the same in principle as what DICE/EA's Frostbite does, as well as any number of other papers and games coming do.
No, I don't think you quite get it. Culling lights in tiles is something different. In this case the geometry is batched up before drawing, then binned to tiles, and then the visibility (z-order) is solved on a per-tile basis.
It may sound the same as deferred rendering tricks in software, but it is not quite the same. These software tricks depend on multiple rendering passes, with z/stenciltesting to determine which pixels to shade. PowerVR can do it in a single pass (as far as the software is concerned).
In fact, the PowerVR PCX2 card did not even need a z-buffer in videomemory at all. What "feature set" on a regular GPU would be able to render properly without a z-buffer?
Exactly, the Z-buffer is on chip. Incidentally, Multi-sampling AA increases your Z-buffer and framebuffer bandwidth requirements by a factor of x (for 4x AA). What if that were all on chip?
I can't believe IMGTEC haven't made more noise about this.
So that is a pretty decent GPU even from Desktop perspective. But Why we dont see this being used on Laptop or Desktop? It doesn't seem hard to scale the Imagination PVR GX6650 to NVIDIA GTX 650 level.
Imagination used to build graphics processors for the Desktop, they were unable to compete with ATI, nVidia, Matrox, S3, 3dfx, NEC etc'. - Instead they shifted their focus to a niche market, the low-powered market, if only the other players knew how big that market would eventually grow to.
Intel has also used PowerVR graphics chips for it's IGP's in the past like the GMA 3600 in the Intel Atom. In general, they are far from ideal, they leave much to be desired in the drivers department. One of the earlier PowerVR chips in the Intel Atom still doesn't have it's decoder functioning.
Imagination is losing the war for the exact same reason they lost the last time - their tile-based rendering, that was only meant for low-end "embedded" chips. But the chips are becoming "desktop class" these days, and need to work on a lot more advanced content with super high resolutions - and that's why Imagination's tile-based rendering will fail. Tile-based rendering is meant for simple operations, and that's where it shows its greater efficiency. The more complex those operations (the games) get the harder it will be for the PowerVR architecture to keep up.
It used to be that their competitors couldn't even touch them. Now every single one will match or exceed their performance and features, and I imagine next year's 16nm FinFET Mobile Maxwell will leave it in the dust (wouldn't surprise me to see higher performance than Xbox One in it, or at least 1 TF).
You mean Imagination is still winning the war because everyone else only just realized there was a market in SoC? Intel is only barely in the game, AMD isn't, and Mali and Adreno is the only real competitor in terms of unit share. Unlike GPU, this market is tied to the success of your SoC, and PVR has a strong ally in Apple unless NVIDIA can convince Apple to license some of their GPU tech.
Apple ships something like 1 in 5 smartphones, 1 in 2 tablets, etc. They have a huge presence in the market right now. Qualcomm definitely ships more SoC, but their GPUs don't all sit in the high end performance space.
How do you figure that TBDR is only for simple operations? It actually excels at more complex pixel operations, because it defers most of the shading and texturing until after visibility has been solved.
Haven't you been saying the same thing for the past two years with the release of Tegra3 and Tegra4? And nVidia is still way behind.
Series 6 is out now in products you can actually buy. Tegra K1 isn't.
Series 6 XT will be out in the next year-ish. Tegra K1 will probably be out by then.
The follow-up to 6XT will probably be out in two years. Who knows when the next Tegra after K1 will actually be out?
Until there's actual, physical devices out there with an nVidia chipset in it that betters the other, actual, physical devices out there, you're just barking smoke.
The reason PoewrVR left the desktop market was simply because ST Microelectronics sold their graphics division. The KYRO was an extremely successful card and would have continued to be. IIRC Via tried to buy it up and carry on selling the KYRO 3 but could not reach a licence deal with STMicro - who claimed copyright on the chip design (the non powervr parts)
They could... But Imagination is just like ARM: they don't build GPUs themselves, they only license the designs. It has been possible for years to license a PowerVR design and scale it up to an interesting desktop GPU. It's just that so far, no company has done that. Probably too big a risk to take, trying to compete with giants such as nVidia and AMD. The last desktop PowerVR cards mainly failed because of poor software support. Aside from the drivers not being all that mature, there was also the problem that many games made assumptions that simply would not hold on a TBDR architecture, and rendering bugs were the result. If you were to build a PowerVR-based desktop solution today, chances are that you run into quite a lot of incompatibilities with existing games.
I didn't want the word Apple in it to retrain from trolls and flame war, so i didn't write it out clearly the first time. The sole reason why PowerVR failed in the first place were their Drivers And the same reason why most other GPU company failed as well. Much like S3. Drivers in the GPU market means literally everything. It doesn't matter if their GPU is insanely great if it doesn't run any of the latest games and error upon error it simply wont sell. Unlike CPU which you actually get down to the mental programming.
Nvidia famously pointed out they have much more software engineers then Hardware. Writing a decent performing drivers takes time and money. Hence why not many GPU manufacturer survive. Most of them dont have enough resources to scale. Same goes with PowerVR. I still remember my Kyro Graphics Card I love, until it doesn't work on games I want to play.
But this time it is different. The Mobile Market has already exceed the PC market and will likely exceed the total GPU shipped in PC + Console Combined! Since the drivers you are writing for Mobile iOS can in many case effectively be used on MacOSX as well. That is why using PowerVR on Mac makes an appealing case.
May be the industry leader view Tablet / Mobile Phone + Console being the next trend, while PC & Mac will simply relinquish from Gaming?
"The sole reason why PowerVR failed in the first place were their Drivers"
As I said, it was not necessarily the drivers themselves. A nice example is 3DMark2001. Some scenes did not work correctly because of illegal assumptions about z-buffer contents. When 3DMark2001SE was released, one of the changes was that it now worked correctly on Kyro cards.
It is unclear where PowerVR stands today, since both their hardware and the 3D APIs and engines have changed massively. The only thing we know for sure is that there are various engines and games that work correctly on iPhone/iPad.
i wonder how AAPL will handle the FP16 cores ... they are moving to 64bit in CPUs and they would have hoped to move to FP64 in GPU ... it would have given them real talking point in the keynote for iPad 6 (or whatever they call it) .. " next-gen 192 core GPU FP64 architechture ..4x graphics power etc etc .. :P
Couple of small errors/typos: Under How Rogues Get Executed: Wavefronts & Superscalar ILP - the diagram should probably have 16, not 20 pipelines - looks like an extra row slipped in!
The page before: " With Series 6, Imagination has an interesting setup where there FP16 ALUs can process up to 3 operations in one cycle." There should read their.
Bottom of page 2 "And with that behind us, we can now take a look at the PowerVR Series 6/6XT Unfired Shading Cluster." - Unfired should read Unified.
One question for Ryan: you said that having 12 ROPs is alittle bit strange given the bandwidth constraints in the mobile world. In an earlier article (2011 http://www.anandtech.com/show/4686/samsung-galaxy-... ) anand draw a picture explaing the savings in term of bandwidth of a TBDR architecture: http://images.anandtech.com/reviews/smartphones/sa... Do you think that taking into account this bandwidth savings, those 12 ROPs would make more sense? Thank you in advance and sorry for my english.
If the front end runs at half the ALU clock I wonder if the ROPS might also run at half clock? In this case it would make sense to put more of them in.
From everything I've seen so far, PowerVR5 Series was more ahead of the competition than PowerVR6 Series is right now. In fact Nvidia has already surpassed them, especially when you consider the full OpenGL 4.4 API support, and Adreno and Mali have become very competitive, too, and Mali T760 should also have around 380 Gflops of performance, along with hardware assisted global illumination.
I think the days of PowerVR/Apple devices having higher GPU performance than competition are behind us, and it's for the best.
Architecture wise, PowerVR seems more alike AMD's VLIW then nvidia's Kepler (or G200 or Fermi or Maxwell). That means PowerVR is going to have the same issues AMD had with VLIW and general computing performances and ILP. There are also many interesting facts that could be analysed: 1. AMD went from having 5 computing ALUs to 4 to improve efficiency before switching to a completely new architetecture (CGN). PowerVR went from 5 to 7 ALUs (if we consider them all as separate units, are you sure it can process 16bit instructions togheter with 32bits one and not that those 32bits units can each execute 2x16bits instructions alternatively?) 2. PowerVR is using the same marketing politics used by AMD to count their computing core. They showed they had more computing core than nvidia competitor's architecture, but in the end, for the fact that they couldn't keep all of them feeded, they were less efficient. 3. nvidia passing from Kepler desktop to Kepler mobile removed ROPs and TMU. So, probably they think their architetcure (and GPUs on mobile in general) are less bottlenecked under those terms. PowerVR went incresing them, so they possibly think ROPs and TMU are more important then shaders... which is which? Both of them are trying to hide some deficiency of their respective architetcture? 4. We do not really know anything about PowerVR geometry power. nvidia in Kepler SMX has special function units (polymorph engine) that is connected directly to the shaders. That seems to give a enormous boost to geometric performaces (expecially tesselation) that rightly scale with the number of active SMX. PowerVR seems to have chosen AMD implementation with off-computational-core tesselator that do not scale automatically. How's going to behave PowerVR with future games that may need more geometric performances? 5. Again, as someone altready asked, tile based rendering was used on the desktop but was soon abandoned as it could not give any real advantages over the raw power of other architectures that grew much faster that what PowerVR could optimize their algorithms, making tile based rendering less and less profitable. What makes that scenario different that what we are witnessing in this period where mobile resolutions are growing to be even bigger than desktop monitors and that games complexity is gonig to increase for the arrival of these really powerful GPUs (K1 in primis)? 6. We lack the die area occupation comparison. How big is a 6 modules Rogue with respect to nvidia K1? If it is, just to say, twice nvidia die area, that would be a problem even thought the power consumption is the same. If it half, that would mean that PowerVR could make double K1 perfomance (if we believe Rogue 192 shaders perform like Kepler 192 ones). That would mean nvidia is in trouble just before beginning the high end socket race. 7. It seems PowerVR is behaving a bit like 3DFx did at the time, till it died. They were using their advanced but old technology to the exterme, so they rendered at 16bit instead of 32, used 16bit Zbuffer instead of 24 and many more "tricks" that were forced to try to hide what was quite clear: 3DFX didn't have the right architecture to compete with new companies like nvidia and ATI that started their story with the right step and much more powerful architectures (TNT2 simply destroyed Voodoo3 under all points, and beware, I was an Voodoo3 unfortunate owner). Will PowerVR go the same end trying to force the use of obsolete technics while all the others competitors are clearly pointing to constantly increasing raw power with no trade-offs (or with minimal ones?)
AMDs shift from VLIW5 to VLIW4 was driven by the decline of DX9. DX9 was explicitly designed around a 5 step path; VLIW5 was tied directly to that. DX10's more flexible workflow rarely allowed for a 5 wide execution path.
For VLIW4 AMD tied functional units together more than Imagination appears to've done here. They have 4 normal ALUs that match with the 4x 16bit ALUs in Rouge; but to do a special function operation they used 3 of the 32bit ALUs instead of using dedicated hardware. The tradeoff was that a special function cost a lot more normal processing capacity than it did before. Power VR doesn't appear to have put enough general purpose computing power place to do the same, and is required to use a dedicated SFU by default (even assuming they felt the tradeoff was worth like AMD does).
The main thing I'm curious about is if the 16 and 32bit ALUs are separate hardware; or if they implemented them similar to how SSE/AVX are done on x86 where the same hardware can do 2 32 (16) bit or 1 64 (32) bit operation.
"The main thing I'm curious about is if the 16 and 32bit ALUs are separate hardware; or if they implemented them similar to how SSE/AVX are done on x86 where the same hardware can do 2 32 (16) bit or 1 64 (32) bit operation."
They're separate hardware. Just as how NVIDIA uses separate FP32 and FP64 CUDA cores.
We're nothing like VLIW4/5, mobile Kepler still has ROPs and texture hardware, the area is absolutely nowhere near where you think it is and the architectural features we have in the front-end remain class leading and entirely sensible for mobile.
Sorry, maybe I was not that clear. I didn't meant they removed completely ROPs and TMUs, I was hinting to the fact that they decreased their number in a SMSX for mobile with respect to a SMX for desktop. ROPs are tied to memory channel, and that may be the cause. But TMUs are not, so they could be the same number as they are in desktop implementation. It seems nvidia sees those many ROPs and TMUs bottlenecked by RAM bandwidth so they spare space and power by not adding them. PowerVR on the contrary has a ROPs and TMUs ratio with respect to shaders (or computing core) that is much higher. One or the other took the wrong assumption (also tied to the memory controller width, that may be higher as you want but costs in terms of die size and power). I'm curious to know who made it.
Ah, I see. Our ALU:TEX:ROP is different to Kepler (and again to Maxwell), yes. We're focused on still being strong for the basics (texturing, pixel fill) while still having a lot of shading to go with it. I can't speak for NV's design choices, just that both have pros and cons depending on market.
The rest of your comment still has a lot of problems in respect to the PowerVR Rogue architecture and how it works, how it works in mobile, and how it compares to K1 and pre-GCN AMD.
"Architecture wise, PowerVR seems more alike AMD's VLIW then nvidia's Kepler (or G200 or Fermi or Maxwell). That means PowerVR is going to have the same issues AMD had with VLIW and general computing performances and ILP."
To be honest I had the same thought at first. We've known that Rogue has multiple slots per pipeline since the Apple A7 came out, so when I first heard that I had the same thought. Given the greater simplicity of mobile SoCs, it would certainly make sense.
That said, after finally having access to IMG's technical details, it's clear to me that this is not the case, which was part of the reason I was so excited to work on this article. It's sort of like Fermi and it's sort of like VLIW5, but in reality it's neither.
The most important point is that in AMD's VLIW designs they had 4/5 ALUs all alike (for the sake of this discussion we'll ignore the T-unit). So to maximize a Streaming Processor's utilization, you needed to be able to extract a full 4-5 instructions of ILP out of a thread. Which was easy to do under DX9 (RGBA + lighting) and a lot harder to do under DX10.
Rogue on the other hand doesn't have ILP requirements nearly as high due to the fact that the 6 ALUs are not identical and are rarely all going to be in use at once (we don't even count the FP16 units in our GFLOPs calculations). They do have ILP requirements, unlike GCN, but for FP32 it's only 2 instructions for the 2 FP32 ALUs. This is in fact rather similar to Kepler (but not Maxwell) in that NVIDIA has a similar reliance on ILP to keep all of their CUDA cores fed. Half of the threads on Kepler need to co-issue another FP32 op to fill the other 64 CUDA cores in an SMX; Rogue is a bit worse in this regard since every thread needs to co-issue to fill every second FP32 ALU.
FP16 on the other hand is trickier, since that's a full 4 ALU setup. Worst case scenario is that IMG needs to pull off 4 instructions of ILP to maximize their utilization, but this is a bit murkier since we don't know why Series 6 had the unusual 3 operator FP16 ALUs in the first place. As such I'm less familiar with where FP16 is being used in mobile today, so it's harder to draw comparisons for what FP16 utilization may be like. That said, there's also the unknown of die size and power requirements of using FP16 units for FP16 math versus using FP32 units for the same task. I'm not sure if IMG has reason to be worried about FP16 utilization if they can pack 2x as much hardware in the same die size and power envelope.
Ultimately I'd classify Rogue as being closer to Fermi/Kepler than VLIW, which is why those are the comparisons we went with in the article. The 2 wide FP32 pipeline isn't nearly as narrow as AMD's VLIW, and the instructions themselves aren't the inflexible chaos that was VLIW as a language.
"Again, as someone altready asked, tile based rendering was used on the desktop but was soon abandoned as it could not give any real advantages over the raw power of other architectures that grew much faster that what PowerVR could optimize their algorithms, making tile based rendering less and less profitable. What makes that scenario different that what we are witnessing in this period where mobile resolutions are growing to be even bigger than desktop monitors and that games complexity is gonig to increase for the arrival of these really powerful GPUs (K1 in primis)?"
One of the problems IMG faced in the old days was that DirectX and Windows weren't very well suited for their TBDR design; they pretty much had to fight the API at times to get what they wanted. For iOS/Android it's difficult to draw comparisons - though I'd note iOS has always been driven by IMG GPUs and hence always used TBDR - but Windows for its part has since gotten much better. In particular there are API hooks to allow applications to see if the GPU is TBDR. I'm not sure if that's enough, but it does mean things have changed at least a little bit since the old days.
I didn't realize that in counting those GFlops you ignored the 16bit ALUs, Issuing two instructions should be much easier than issuing 3, 4 or even 5, not to speak about 6 or 7. However I bet that PowerVR next architecture (or the next one again) will remove those 16bit ALUs and will introduce a couple of them able to issue 32 OR 2x16bit instruction, so that they can pack more shaders in the same area.
About TBDR design... the alternative to DirectX is OpenGL. Is it more suited for TBDR architecture than "brute force" ones? Still I perceive PowerVR architecture as something from the past that has survived to to now until the big one have entered the mobile game for real. Kepler is a very efficient architecture and Maxwell has demonstrated that it can be even better. How is PowerVR going to fight against an architecture as flexible as nvidia ones that can also be used for CUDA computing and thus being adopted into markets (and for other tasks) PowerVR cannot with their current architecture? Not to forget that nvidia can now easily bring to their mobile versions whatever engine exists for their desktop GPUs. Will extreme (but not so flexible) efficiency win against something not that efficient but able to do much more things in a easier way? Would mobile game engines bet more on computing shaders power or memory bandwidth? Will new DX10/DX11-alike engines (whose features are supported by new architectures) still be TBDR friendly? Does TBDR design still scale for the modern ultra high resolution displays or as for desktops "brute force" (or simply more power more performance) will rule out?
"the alternative to DirectX is OpenGL. Is it more suited for TBDR architecture than "brute force" ones?"
In theory, D3D used to be more suited to TBDR than OpenGL, because it had explicit BeginScene() and EndScene() markers. But those have been dropped after D3D9. I can't really think of anything off the top of my head that would make one API more suited than the other these days. They're both very similar: just bind your textures/buffers/shaders, update the constants, and fire off your geometry.
"How is PowerVR going to fight against an architecture as flexible as nvidia ones that can also be used for CUDA computing and thus being adopted into markets (and for other tasks) PowerVR cannot with their current architecture?"
PowerVR supports OpenCL, so they too can do the GPGPU-game. It all depends on who delivers the best package in terms of features, performance, power usage, etc.
>Again, as someone altready asked, tile based rendering was used on the desktop but was soon >abandoned as it could not give any real advantages over the raw power of other architectures >that grew much faster that what PowerVR could optimize their algorithms, making tile based >rendering less and less profitable. What makes that scenario different that what we are > witnessing in this period where mobile resolutions are growing to be even bigger than desktop > monitors and that games complexity is gonig to increase for the arrival of these really powerful >GPUs (K1 in primis)?
It is a common misconception that PowerVR's desktop parts where not competitive or had compatibility problems. The last card they produced, the Kyro II was actual very competitive with the offerings from both NVidia and ATi. The claims of incompatibility where largely unfounded marketing FUD from competitors, with later drivers running the majority of content without problems. Further they did not leave the market because they could not compete on performance, instead their partner at the time, STM, decided to pull out of the market for unstated reasons, although this was most likely due to them not being able to invest the amount of money they needed to in order to take the high ground.
Mobile is VERY different to desktop space, NV and ATi where able to brute force their way to the top as both power consumption and memory bandwidth had extremely wide envelopes, this is not the case in mobile space. In Kyro II PowerVR had demonstrated that they were able to compete with considerable higher specification part (for clock & memory BW) from both NV and AMD, with considerably lower memory BW and power requirements. Although NV and AMD have evolved so have PowerVR, as such there is no reason to assume that they don't still have advantages.
>It seems PowerVR is behaving a bit like 3DFx did at the time, till it died. They were using their >advanced but old technology to the exterme, so they rendered at 16bit instead of 32, used 16bit >Zbuffer instead of 24 and many more "tricks" that were forced to try to hide what was quite >clear: 3DFX didn't have the right architecture to compete with new companies like nvidia > and ATI that started their story with the right step and much more powerful architectures > (TNT2 simply destroyed Voodoo3 under all points, and beware, I was an Voodoo3 unfortunate > owner).
This simply makes no sense in the context of the market these cores are being target at. At the fundamental level the primary API currently used in mobile is OGLES2.0 which does not mandate anything higher than FP16 within fragment shaders. This means that the vast majority of current mobile content only use FP16 in fragment shaders, in these circumstances do you think it make sense to through area at higher precision paths? Of course it doesn’t! Further it’s not like the PowerVR architecture looks like slouch at FP32.
At the end of the day they truth will only be seen in benchmark in actual devices, not in marketing claims and FUD from various companies.
BTW you do realise that NV run many of these benchmark at 16 bit Z and 16 bit frame buffer don't you? They do thsi because the become even less comeptitive when forced to use 32 bits. So who is actually using old "technology" to hide real deficiencies in thir architectures?
If you can't see a difference due to using 16 bit somewhere along the rendering path, then it's actually the smart thing to do. It saves power, which can be better used elsewhere (higher clocks). Well, that was actually 3DFX's argument for why Glide with 16 bit + dithering was the smart choice back then. But "they've got the bigger bits" won (along with other advantages of the early nVidias).
OpenCL tests on a GPU that isn't even in production? Before you say "test it on the iphone 5S", there is no OpenCL public libraries available on iOs, as far as I know.
While this is surely important the raw number of registers doesn't tell you that much either. In fact, it could be very misleading. How many registers you need first and foremost depends on the out of order window (in CPUs) or here the number of threads in flight. Which is something they didn't tell us either. Also different cache sizes and latencies would determine how bad it is to run out of register space.
Flexible hardware in the context I used it would mean programmable.
TMUs aren't flexible. They fetch texels, apply filtering, and that's it.
Shaders are flexible. They accept threads of instructions and the result on a pixel/vertex will be whatever the program dictates, as opposed to a fixed outcome.
I believe your math is off on the last table. GTX 650 would produce 230.4 GFLOPS @ 300MHz, not 330.4. Its interesting to me that these mobile designs are so close to full desktop performance (albeit with the low 300MHz clock). But memory bandwidth, power, and clocks will always hold these SoCs back in the real world. Thanks for the great article, as always.
I love these kinds of articles, simply because of the explanations of how various hardware and underlying systems work. Probably the best part of Anandtech reviews.
Gotta give imagination tech some credit. They have the highest performing gpu's in the soc market. Very flexible too, they make a 2 core cluster , 3 core, 4 core and 6 core. If only we could have the best cpu performance (qualcomm snapdragon krait cpu's) mixed with the best gpu (powervr 6xt 6 core version) on the same soc. Tho nvidia dual core denver design that throws away all the extra cores and devotes more die space to 2 higher performing more complex cores might take the cpu crown since there arent many workloads on a phone that needs more then 2 threads going.
Are you sure that Rogue is superscalar rather than VLIW?
Briefly the difference between the two comes down to when independent instructions (ones that can execute in parallel) are identified. In a VLIW it happens at compile time, while in a superscalar design it happens at runtime. It would actually surprise me if Rogue does runtime dependency analysis for such a wide backend - If I had to bet I'd say "VLIW".
is there a similar article comparing Nvidia to AMD? I've seen block diagrams of Nvidia chips on their web site but haven't found any for AMD. Even if I did, an article like this one would be better than me trying to make inferences and decode the vendor spin. I want to buy a compute engine and I keep getting the impression AMDs offering is better but would like to be convinced from a technical discussion rather than stats about game performance and unpacking textures.
I wonder how many instructions they can dispatch per clock. That's a significant factor when discussing how to feed up to 7 execution units. Actually I'd be surprised if it's more than 2, maybe 3 under special circumstances.. which would make me wonder how they're feeding 4 FP16 ALUs. But then I also wonder if these are truly 4 independebt units.. I guess not.
Re tile based rendering the Mali guys on their blog outlined their thought process on why they chose tile based rendering.
Ultimately it is all about power. TBR allows them to avoid memory reads and writes by keeping the color and z-buffer on chip. They also keep the multisample buffer on chip Which allows them to do 16x msaa at very low power cost.
The biggest power hog though is textures which still reside in memory. This means that complex or large textures will basically overwhelm Amy power savings. Basically TBR becomes less efficient when pushed to the max of its capabilities like in many benchmarks.
Oh, I also wanted to note that the ARM Mali blog announced that they were going to talk about their shader architecture in an upcoming blog post. The last part in a series about the latest Mali and openGL ES compliance. The previous part talked about tile based rendering and how to get power savings from it.
The interesting things I've learned.from the blog include some great stuff not even discussed by imagination.
Mali has a way of comparing the current state of the frame buffer with incoming tiles, if they are identical then the new tile is discarded instead of taking power to write to the frame buffer.
Also, texture compression can save a lot of memory.
The biggest eye opener though was that the way Android handles draw calls means that both CPU and GPU are starved for work and dvfs isn't responsive enough or granular enough to allow for power gating. It seems the problem has to do with screen orientation and android preventing ideal asynchronous Possessing of work.
uptil I looked at the receipt which had said $9859 , I did not believe that...my... sister had been actualie receiving money in there spare time at their computer. . there aunts neighbour had bean doing this for only about fifteen months and just now took care of the morgage on there appartment and purchased a great Porsche 911 .
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
95 Comments
Back to Article
jjj - Monday, February 24, 2014 - link
Far from ideal timing with so many news around, guess i'll have to read it after MWC.rpg1966 - Monday, February 24, 2014 - link
Thank you for sharing.Mondozai - Monday, February 24, 2014 - link
The most important part is what Anand highlighted from this walkthrough; namely that the Rogue series has a chip that is on equal balance if not even stronger than Tegra K1.Poor Nvidia.
Sabresiberian - Monday, February 24, 2014 - link
That remains to be seen - but it does appear to be competitive at this point. We also don't know what the entire Denver+K1 package will do.What kind of surprises me about K1 though is that Maxwell has already been released for the desktop. I would think "M1" (to guess at a name) would be the architecture to build on in the next year.
dragonsqrrl - Monday, February 24, 2014 - link
Uhhh no it doesn't. The article failed to mention this important piece of information, but you realize the 6XT series probably won't come to market before 2015 right? Likely 2H 2015 according to an earlier article published here on Anandtech.grahaman27 - Monday, February 24, 2014 - link
Interesting. Which article mentioned that? Would you mind linking me to it?I can't find an expected release date for this chip anywhere.
dragonsqrrl - Tuesday, February 25, 2014 - link
That's because there is none. It was an estimate given by Ryan Smith based on prior Imagination GPU announcements, and the time it usually takes chipmakers to integrate the design and bring a device to market."Finally, while Imagination doesn’t provide a timeframe for consumer availability (since they only sell designs to chipmakers), based on the amount of time needed to integrate these designs into new products and then get those products in the hands of consumers, we should be looking at a timetable similar to the original Series6 designs. In which case Series6XT equipped SoCs would start appearing in 2015, likely in the latter half."
http://www.anandtech.com/show/7629/imagination-tec...
michael2k - Tuesday, February 25, 2014 - link
Really? You don't think their biggest customer, Apple, which has shown the ability to beat an entire industry to market by almost a year (64 bit ARMv8) and one of the first PVR6 customers to market as well? Anand though it would be 2014 when PVR6 would show up to market, and the A9600 that was supposed to show up in 2013 never did (Apple's A7 did though!)So why do you rule out the real possibility that the Apple A8 would ship with a 6XT this year?
dragonsqrrl - Tuesday, February 25, 2014 - link
I don't know, ask Ryan Smith.My theory for the ~18 month estimate is that it's about how long Apple took to integrate and bring their current series 6 GPU to market in the A7. I suppose if any company had the resources to accelerate that schedule, it would be Apple. But then the question becomes why and does it make sense? There will be faster Series 6 SKU's available for the A8 in the interim.
stingerman - Wednesday, February 26, 2014 - link
But don't forget, Apple has been involved in this design long before Imagination's public reveal. In fact, Apple informed Imaginations design with their real world experience and their own needs. I always expect Apple to get a 6 to 12 month lead. And, Apple has shown themselves to put a very high value on their SoC GPU leadership.dragonsqrrl - Tuesday, February 25, 2014 - link
... and the 64-bit architecture in the A7 is a completely different story. Apple wasn't first to market with 64-bit because they had an accelerated development schedule compared to other chipmakers. Apple was first to market because they started development first, ahead of everyone else.Apple doesn't develop Series 6, they license it, and we know when it becomes available for integration and when devices based on it come to market. We also know how long it usually takes between tapeout of an SOC, production, and final availability in devices, and based on this it would be very difficult if not impossible for Apple to put a 6XT in the A8 if they keep to their regular release schedule. I think going from an ~18 to ~8 month schedule is a bit much even for Apple, especially considering the new process shrink.
michael2k - Wednesday, February 26, 2014 - link
It's available for license now. Doesn't that normally mean that, as soon as the computer finishes synthesizing the design, it can be taped out now as well? The issue then is how much work needs to be done to get the design to work at the desired power envelope and clock, and if said work is worth it.Samus - Monday, February 24, 2014 - link
Considering the highly parallel nature of graphics processing, PowerVR's low core count and non-linear arrangement will make then weak in real-world gaming but strong in synthetic GFLOPS since their cores are stronger.For example, ATI has the weakest GPU cores of everybody but they cram over 2000 onto a GPU, seems to be doing pretty good for their performance.
michael2k - Tuesday, February 25, 2014 - link
Your example is irrelevant since ATI has no mobile solution and PowerVR has been the strongest in real world gaming since, well, until the K1 ships, PowerVR has had no real competition.Adreno is technically competent, but Qualcomm isn't seeking to push their transistor budget too high.
extide - Friday, February 28, 2014 - link
Wow, so you basically didn't read the article at all did you? This nonsense is exactly what this article is trying to prevent. Lol.Anders CT - Tuesday, February 25, 2014 - link
And what chip is using a GX6650 GPU? None existing that I know of.The Tegra K1 is a chip. Kepler and PowerVR 6XT is GPU architectures. And kepler has been around for several years.
dragonsqrrl - Wednesday, February 26, 2014 - link
Not sure how to respond to this, other than you completely misinterpreted my comment.dragonsqrrl - Wednesday, February 26, 2014 - link
Apologies, you weren't responding to my comment. I was wondering why it made no sense in the context of what I said.The comments section on Anandtech makes it really difficult sometimes to see who's responding to who, especially when it gets really long like this.
Sonicadvance1 - Monday, February 24, 2014 - link
So they have 2x Float32 ALU cores, 4x Float16 ALU cores, and a SFU core.This has no mention of Integer cores.
Am I to assume that integers won't run on the F32/F16 cores but instead of the SFU core so using integers will be 1/6th the speed of floats?
Seems like a large drawback, Mali and Adreno both run integers at the same speed as 32bit floats.
ryszu - Monday, February 24, 2014 - link
Integer happens on the F32 hardware.Sonicadvance1 - Tuesday, February 25, 2014 - link
Thanks for the response. Good to know. The article didn't really note anything about it.ryszu - Tuesday, February 25, 2014 - link
I misspoke actually, integer is a separate pipe in Rogue.Sonicadvance1 - Thursday, February 27, 2014 - link
Alright, then how much slower is Integer performance compared to floating point? Integer performance is an area that Nvidia struggles with as well.MrSpadge - Saturday, March 1, 2014 - link
This sounds different from any material Ryan showed or discussed. Could you elaborate, may directly to Ryan and have him update the article?Frenetic Pony - Monday, February 24, 2014 - link
I'm not sure what exactly is being babbled on about with tile based deferred rendering. It's just software, anyone can write and run it. Go onto a friendly GPU programming forum and they'll take you through it step by step.Scali - Monday, February 24, 2014 - link
Deferred rendering is a software solution. Tile-based deferred rendering is a hardware solution. The GPU cuts up the triangles in a set of tiles. Inside the GPU, there is a superfast 'framebuffer' the size of a tile (think of a special L1-cache). The GPU renders one tile at a time into this buffer, solving overdraw very quickly and efficiently, then it burst-writes the tile out to the framebuffer in videomemory. PowerVR has been using this technology since the early days of 3D acceleration (I have a PowerVR PCX2 card myself, and did a blog on it a while ago: http://scalibq.wordpress.com/2012/12/18/just-keepi...I suggest you read up on it, it is very interesting technology, and unlike any competing GPU.
Frenetic Pony - Monday, February 24, 2014 - link
No it isn't. Anyone with the proper feature set can do tile based deferred, most next gen games are going to be culling light lists out on something like an 8x8 pixel per tile basis, whether that's for forward rending or deferred. Which sounds exactly like what you described.It might be nice that there's some special little cache for it in PowerVr. But the basic idea as you've described it sounds exactly the same in principle as what DICE/EA's Frostbite does, as well as any number of other papers and games coming do.
Scali - Monday, February 24, 2014 - link
No, I don't think you quite get it. Culling lights in tiles is something different.In this case the geometry is batched up before drawing, then binned to tiles, and then the visibility (z-order) is solved on a per-tile basis.
It may sound the same as deferred rendering tricks in software, but it is not quite the same. These software tricks depend on multiple rendering passes, with z/stenciltesting to determine which pixels to shade. PowerVR can do it in a single pass (as far as the software is concerned).
Again, I suggest you read up on it.
Scali - Monday, February 24, 2014 - link
In fact, the PowerVR PCX2 card did not even need a z-buffer in videomemory at all. What "feature set" on a regular GPU would be able to render properly without a z-buffer?MrPoletski - Sunday, March 9, 2014 - link
Exactly, the Z-buffer is on chip. Incidentally, Multi-sampling AA increases your Z-buffer and framebuffer bandwidth requirements by a factor of x (for 4x AA). What if that were all on chip?I can't believe IMGTEC haven't made more noise about this.
MrPoletski - Sunday, March 9, 2014 - link
Factor of 4X, where is the edit button?iwod - Monday, February 24, 2014 - link
So that is a pretty decent GPU even from Desktop perspective. But Why we dont see this being used on Laptop or Desktop? It doesn't seem hard to scale the Imagination PVR GX6650 to NVIDIA GTX 650 level.StevoLincolnite - Monday, February 24, 2014 - link
Imagination used to build graphics processors for the Desktop, they were unable to compete with ATI, nVidia, Matrox, S3, 3dfx, NEC etc'. - Instead they shifted their focus to a niche market, the low-powered market, if only the other players knew how big that market would eventually grow to.Intel has also used PowerVR graphics chips for it's IGP's in the past like the GMA 3600 in the Intel Atom.
In general, they are far from ideal, they leave much to be desired in the drivers department.
One of the earlier PowerVR chips in the Intel Atom still doesn't have it's decoder functioning.
Krysto - Monday, February 24, 2014 - link
Imagination is losing the war for the exact same reason they lost the last time - their tile-based rendering, that was only meant for low-end "embedded" chips. But the chips are becoming "desktop class" these days, and need to work on a lot more advanced content with super high resolutions - and that's why Imagination's tile-based rendering will fail. Tile-based rendering is meant for simple operations, and that's where it shows its greater efficiency. The more complex those operations (the games) get the harder it will be for the PowerVR architecture to keep up.It used to be that their competitors couldn't even touch them. Now every single one will match or exceed their performance and features, and I imagine next year's 16nm FinFET Mobile Maxwell will leave it in the dust (wouldn't surprise me to see higher performance than Xbox One in it, or at least 1 TF).
michael2k - Monday, February 24, 2014 - link
You mean Imagination is still winning the war because everyone else only just realized there was a market in SoC? Intel is only barely in the game, AMD isn't, and Mali and Adreno is the only real competitor in terms of unit share. Unlike GPU, this market is tied to the success of your SoC, and PVR has a strong ally in Apple unless NVIDIA can convince Apple to license some of their GPU tech.Apple ships something like 1 in 5 smartphones, 1 in 2 tablets, etc. They have a huge presence in the market right now. Qualcomm definitely ships more SoC, but their GPUs don't all sit in the high end performance space.
ryszu - Monday, February 24, 2014 - link
Our TBDR front-end is absolutely not just designed for simple operations. Pure FUD, it scales very well.Scali - Monday, February 24, 2014 - link
How do you figure that TBDR is only for simple operations? It actually excels at more complex pixel operations, because it defers most of the shading and texturing until after visibility has been solved.khanov - Monday, February 24, 2014 - link
Pure nonsense. Go back to the kiddy table.Tile-Based Deferred Rendering eliminates overdraw and the performance gains it achieves INCREASE with scene complexity.
phoenix_rizzen - Friday, February 28, 2014 - link
Haven't you been saying the same thing for the past two years with the release of Tegra3 and Tegra4? And nVidia is still way behind.Series 6 is out now in products you can actually buy. Tegra K1 isn't.
Series 6 XT will be out in the next year-ish. Tegra K1 will probably be out by then.
The follow-up to 6XT will probably be out in two years. Who knows when the next Tegra after K1 will actually be out?
Until there's actual, physical devices out there with an nVidia chipset in it that betters the other, actual, physical devices out there, you're just barking smoke.
MrPoletski - Sunday, March 9, 2014 - link
You are completely wrong!MrPoletski - Sunday, March 9, 2014 - link
The reason PoewrVR left the desktop market was simply because ST Microelectronics sold their graphics division. The KYRO was an extremely successful card and would have continued to be. IIRC Via tried to buy it up and carry on selling the KYRO 3 but could not reach a licence deal with STMicro - who claimed copyright on the chip design (the non powervr parts)Scali - Monday, February 24, 2014 - link
They could... But Imagination is just like ARM: they don't build GPUs themselves, they only license the designs.It has been possible for years to license a PowerVR design and scale it up to an interesting desktop GPU. It's just that so far, no company has done that. Probably too big a risk to take, trying to compete with giants such as nVidia and AMD.
The last desktop PowerVR cards mainly failed because of poor software support. Aside from the drivers not being all that mature, there was also the problem that many games made assumptions that simply would not hold on a TBDR architecture, and rendering bugs were the result.
If you were to build a PowerVR-based desktop solution today, chances are that you run into quite a lot of incompatibilities with existing games.
iwod - Monday, February 24, 2014 - link
I didn't want the word Apple in it to retrain from trolls and flame war, so i didn't write it out clearly the first time.The sole reason why PowerVR failed in the first place were their Drivers And the same reason why most other GPU company failed as well. Much like S3. Drivers in the GPU market means literally everything. It doesn't matter if their GPU is insanely great if it doesn't run any of the latest games and error upon error it simply wont sell. Unlike CPU which you actually get down to the mental programming.
Nvidia famously pointed out they have much more software engineers then Hardware. Writing a decent performing drivers takes time and money. Hence why not many GPU manufacturer survive. Most of them dont have enough resources to scale. Same goes with PowerVR. I still remember my Kyro Graphics Card I love, until it doesn't work on games I want to play.
But this time it is different. The Mobile Market has already exceed the PC market and will likely exceed the total GPU shipped in PC + Console Combined! Since the drivers you are writing for Mobile iOS can in many case effectively be used on MacOSX as well. That is why using PowerVR on Mac makes an appealing case.
May be the industry leader view Tablet / Mobile Phone + Console being the next trend, while PC & Mac will simply relinquish from Gaming?
Scali - Tuesday, February 25, 2014 - link
"The sole reason why PowerVR failed in the first place were their Drivers"As I said, it was not necessarily the drivers themselves. A nice example is 3DMark2001. Some scenes did not work correctly because of illegal assumptions about z-buffer contents. When 3DMark2001SE was released, one of the changes was that it now worked correctly on Kyro cards.
It is unclear where PowerVR stands today, since both their hardware and the 3D APIs and engines have changed massively. The only thing we know for sure is that there are various engines and games that work correctly on iPhone/iPad.
Sushisamurai - Monday, February 24, 2014 - link
Typo: "one thread per shader care, which like the shader cores are grouped together into what we call wavefronts." Should be shader core?In "Background: how GPU's work"
Ryan Smith - Monday, February 24, 2014 - link
Indeed it was. Thank you for pointing that out.chinmaythosar - Monday, February 24, 2014 - link
i wonder how AAPL will handle the FP16 cores ... they are moving to 64bit in CPUs and they would have hoped to move to FP64 in GPU ... it would have given them real talking point in the keynote for iPad 6 (or whatever they call it) .. " next-gen 192 core GPU FP64 architechture ..4x graphics power etc etc .. :PMrSpadge - Saturday, March 1, 2014 - link
Not sure what AAPL is, but pure FP64 for graphics would be horrible. You don't need the precision but waste lot's of die space and power.xeizo - Monday, February 24, 2014 - link
They would be more competitive and interesting to use if they published open drivers instead of "open architecture" pics .... :(Eckej - Monday, February 24, 2014 - link
Couple of small errors/typos:Under How Rogues Get Executed: Wavefronts & Superscalar ILP - the diagram should probably have 16, not 20 pipelines - looks like an extra row slipped in!
The page before: " With Series 6, Imagination has an interesting setup where there FP16 ALUs can process up to 3 operations in one cycle." There should read their.
Bottom of page 2 "And with that behind us, we can now take a look at the PowerVR Series 6/6XT Unfired Shading Cluster." - Unfired should read Unified.
Sorry to be picky.
Ryan Smith - Monday, February 24, 2014 - link
Aww geeze. I can't believe we put a whole extra row in there...Thank you for pointing that out. It has been corrected (along with everything else you mentioned).
boostern - Monday, February 24, 2014 - link
One question for Ryan: you said that having 12 ROPs is alittle bit strange given the bandwidth constraints in the mobile world. In an earlier article (2011 http://www.anandtech.com/show/4686/samsung-galaxy-... ) anand draw a picture explaing the savings in term of bandwidth of a TBDR architecture: http://images.anandtech.com/reviews/smartphones/sa...Do you think that taking into account this bandwidth savings, those 12 ROPs would make more sense?
Thank you in advance and sorry for my english.
ryszu - Monday, February 24, 2014 - link
The fillrate we have is largely agnostic of the TBDR and its bandwidth savings. It's there for high resolution UIs more than anything else.boostern - Monday, February 24, 2014 - link
Thank you Rys.Are you the Rys of Beyond 3D that works for IMGTec?
ryszu - Monday, February 24, 2014 - link
Guilty as charged!boostern - Monday, February 24, 2014 - link
What an honour :)MrPoletski - Sunday, March 9, 2014 - link
By the way, how does the PowerVR architecture do at cryptocoin mining?MrSpadge - Saturday, March 1, 2014 - link
If the front end runs at half the ALU clock I wonder if the ROPS might also run at half clock? In this case it would make sense to put more of them in.Krysto - Monday, February 24, 2014 - link
From everything I've seen so far, PowerVR5 Series was more ahead of the competition than PowerVR6 Series is right now. In fact Nvidia has already surpassed them, especially when you consider the full OpenGL 4.4 API support, and Adreno and Mali have become very competitive, too, and Mali T760 should also have around 380 Gflops of performance, along with hardware assisted global illumination.I think the days of PowerVR/Apple devices having higher GPU performance than competition are behind us, and it's for the best.
ryszu - Monday, February 24, 2014 - link
Why is it for the best?grahaman27 - Monday, February 24, 2014 - link
Because apple can't have all the cool stuff!CiccioB - Monday, February 24, 2014 - link
Architecture wise, PowerVR seems more alike AMD's VLIW then nvidia's Kepler (or G200 or Fermi or Maxwell).That means PowerVR is going to have the same issues AMD had with VLIW and general computing performances and ILP.
There are also many interesting facts that could be analysed:
1. AMD went from having 5 computing ALUs to 4 to improve efficiency before switching to a completely new architetecture (CGN). PowerVR went from 5 to 7 ALUs (if we consider them all as separate units, are you sure it can process 16bit instructions togheter with 32bits one and not that those 32bits units can each execute 2x16bits instructions alternatively?)
2. PowerVR is using the same marketing politics used by AMD to count their computing core. They showed they had more computing core than nvidia competitor's architecture, but in the end, for the fact that they couldn't keep all of them feeded, they were less efficient.
3. nvidia passing from Kepler desktop to Kepler mobile removed ROPs and TMU. So, probably they think their architetcure (and GPUs on mobile in general) are less bottlenecked under those terms. PowerVR went incresing them, so they possibly think ROPs and TMU are more important then shaders... which is which? Both of them are trying to hide some deficiency of their respective architetcture?
4. We do not really know anything about PowerVR geometry power. nvidia in Kepler SMX has special function units (polymorph engine) that is connected directly to the shaders. That seems to give a enormous boost to geometric performaces (expecially tesselation) that rightly scale with the number of active SMX. PowerVR seems to have chosen AMD implementation with off-computational-core tesselator that do not scale automatically. How's going to behave PowerVR with future games that may need more geometric performances?
5. Again, as someone altready asked, tile based rendering was used on the desktop but was soon abandoned as it could not give any real advantages over the raw power of other architectures that grew much faster that what PowerVR could optimize their algorithms, making tile based rendering less and less profitable. What makes that scenario different that what we are witnessing in this period where mobile resolutions are growing to be even bigger than desktop monitors and that games complexity is gonig to increase for the arrival of these really powerful GPUs (K1 in primis)?
6. We lack the die area occupation comparison. How big is a 6 modules Rogue with respect to nvidia K1? If it is, just to say, twice nvidia die area, that would be a problem even thought the power consumption is the same. If it half, that would mean that PowerVR could make double K1 perfomance (if we believe Rogue 192 shaders perform like Kepler 192 ones). That would mean nvidia is in trouble just before beginning the high end socket race.
7. It seems PowerVR is behaving a bit like 3DFx did at the time, till it died. They were using their advanced but old technology to the exterme, so they rendered at 16bit instead of 32, used 16bit Zbuffer instead of 24 and many more "tricks" that were forced to try to hide what was quite clear: 3DFX didn't have the right architecture to compete with new companies like nvidia and ATI that started their story with the right step and much more powerful architectures (TNT2 simply destroyed Voodoo3 under all points, and beware, I was an Voodoo3 unfortunate owner). Will PowerVR go the same end trying to force the use of obsolete technics while all the others competitors are clearly pointing to constantly increasing raw power with no trade-offs (or with minimal ones?)
DanNeely - Monday, February 24, 2014 - link
AMDs shift from VLIW5 to VLIW4 was driven by the decline of DX9. DX9 was explicitly designed around a 5 step path; VLIW5 was tied directly to that. DX10's more flexible workflow rarely allowed for a 5 wide execution path.For VLIW4 AMD tied functional units together more than Imagination appears to've done here. They have 4 normal ALUs that match with the 4x 16bit ALUs in Rouge; but to do a special function operation they used 3 of the 32bit ALUs instead of using dedicated hardware. The tradeoff was that a special function cost a lot more normal processing capacity than it did before. Power VR doesn't appear to have put enough general purpose computing power place to do the same, and is required to use a dedicated SFU by default (even assuming they felt the tradeoff was worth like AMD does).
The main thing I'm curious about is if the 16 and 32bit ALUs are separate hardware; or if they implemented them similar to how SSE/AVX are done on x86 where the same hardware can do 2 32 (16) bit or 1 64 (32) bit operation.
http://www.anandtech.com/show/4061/amds-radeon-hd-...
Ryan Smith - Monday, February 24, 2014 - link
"The main thing I'm curious about is if the 16 and 32bit ALUs are separate hardware; or if they implemented them similar to how SSE/AVX are done on x86 where the same hardware can do 2 32 (16) bit or 1 64 (32) bit operation."They're separate hardware. Just as how NVIDIA uses separate FP32 and FP64 CUDA cores.
ryszu - Monday, February 24, 2014 - link
We're nothing like VLIW4/5, mobile Kepler still has ROPs and texture hardware, the area is absolutely nowhere near where you think it is and the architectural features we have in the front-end remain class leading and entirely sensible for mobile.CiccioB - Monday, February 24, 2014 - link
Sorry, maybe I was not that clear. I didn't meant they removed completely ROPs and TMUs, I was hinting to the fact that they decreased their number in a SMSX for mobile with respect to a SMX for desktop. ROPs are tied to memory channel, and that may be the cause. But TMUs are not, so they could be the same number as they are in desktop implementation.It seems nvidia sees those many ROPs and TMUs bottlenecked by RAM bandwidth so they spare space and power by not adding them.
PowerVR on the contrary has a ROPs and TMUs ratio with respect to shaders (or computing core) that is much higher. One or the other took the wrong assumption (also tied to the memory controller width, that may be higher as you want but costs in terms of die size and power). I'm curious to know who made it.
ryszu - Monday, February 24, 2014 - link
Ah, I see. Our ALU:TEX:ROP is different to Kepler (and again to Maxwell), yes. We're focused on still being strong for the basics (texturing, pixel fill) while still having a lot of shading to go with it. I can't speak for NV's design choices, just that both have pros and cons depending on market.The rest of your comment still has a lot of problems in respect to the PowerVR Rogue architecture and how it works, how it works in mobile, and how it compares to K1 and pre-GCN AMD.
Ryan Smith - Monday, February 24, 2014 - link
"Architecture wise, PowerVR seems more alike AMD's VLIW then nvidia's Kepler (or G200 or Fermi or Maxwell).That means PowerVR is going to have the same issues AMD had with VLIW and general computing performances and ILP."
To be honest I had the same thought at first. We've known that Rogue has multiple slots per pipeline since the Apple A7 came out, so when I first heard that I had the same thought. Given the greater simplicity of mobile SoCs, it would certainly make sense.
That said, after finally having access to IMG's technical details, it's clear to me that this is not the case, which was part of the reason I was so excited to work on this article. It's sort of like Fermi and it's sort of like VLIW5, but in reality it's neither.
The most important point is that in AMD's VLIW designs they had 4/5 ALUs all alike (for the sake of this discussion we'll ignore the T-unit). So to maximize a Streaming Processor's utilization, you needed to be able to extract a full 4-5 instructions of ILP out of a thread. Which was easy to do under DX9 (RGBA + lighting) and a lot harder to do under DX10.
Rogue on the other hand doesn't have ILP requirements nearly as high due to the fact that the 6 ALUs are not identical and are rarely all going to be in use at once (we don't even count the FP16 units in our GFLOPs calculations). They do have ILP requirements, unlike GCN, but for FP32 it's only 2 instructions for the 2 FP32 ALUs. This is in fact rather similar to Kepler (but not Maxwell) in that NVIDIA has a similar reliance on ILP to keep all of their CUDA cores fed. Half of the threads on Kepler need to co-issue another FP32 op to fill the other 64 CUDA cores in an SMX; Rogue is a bit worse in this regard since every thread needs to co-issue to fill every second FP32 ALU.
FP16 on the other hand is trickier, since that's a full 4 ALU setup. Worst case scenario is that IMG needs to pull off 4 instructions of ILP to maximize their utilization, but this is a bit murkier since we don't know why Series 6 had the unusual 3 operator FP16 ALUs in the first place. As such I'm less familiar with where FP16 is being used in mobile today, so it's harder to draw comparisons for what FP16 utilization may be like. That said, there's also the unknown of die size and power requirements of using FP16 units for FP16 math versus using FP32 units for the same task. I'm not sure if IMG has reason to be worried about FP16 utilization if they can pack 2x as much hardware in the same die size and power envelope.
Ultimately I'd classify Rogue as being closer to Fermi/Kepler than VLIW, which is why those are the comparisons we went with in the article. The 2 wide FP32 pipeline isn't nearly as narrow as AMD's VLIW, and the instructions themselves aren't the inflexible chaos that was VLIW as a language.
"Again, as someone altready asked, tile based rendering was used on the desktop but was soon abandoned as it could not give any real advantages over the raw power of other architectures that grew much faster that what PowerVR could optimize their algorithms, making tile based rendering less and less profitable. What makes that scenario different that what we are witnessing in this period where mobile resolutions are growing to be even bigger than desktop monitors and that games complexity is gonig to increase for the arrival of these really powerful GPUs (K1 in primis)?"
One of the problems IMG faced in the old days was that DirectX and Windows weren't very well suited for their TBDR design; they pretty much had to fight the API at times to get what they wanted. For iOS/Android it's difficult to draw comparisons - though I'd note iOS has always been driven by IMG GPUs and hence always used TBDR - but Windows for its part has since gotten much better. In particular there are API hooks to allow applications to see if the GPU is TBDR. I'm not sure if that's enough, but it does mean things have changed at least a little bit since the old days.
Scali - Monday, February 24, 2014 - link
In D3D11 there is now a flag to indicate whether you are running on a TBDR device or not: http://msdn.microsoft.com/en-us/library/windows/de...CiccioB - Monday, February 24, 2014 - link
I didn't realize that in counting those GFlops you ignored the 16bit ALUs, Issuing two instructions should be much easier than issuing 3, 4 or even 5, not to speak about 6 or 7.However I bet that PowerVR next architecture (or the next one again) will remove those 16bit ALUs and will introduce a couple of them able to issue 32 OR 2x16bit instruction, so that they can pack more shaders in the same area.
About TBDR design... the alternative to DirectX is OpenGL. Is it more suited for TBDR architecture than "brute force" ones?
Still I perceive PowerVR architecture as something from the past that has survived to to now until the big one have entered the mobile game for real. Kepler is a very efficient architecture and Maxwell has demonstrated that it can be even better. How is PowerVR going to fight against an architecture as flexible as nvidia ones that can also be used for CUDA computing and thus being adopted into markets (and for other tasks) PowerVR cannot with their current architecture? Not to forget that nvidia can now easily bring to their mobile versions whatever engine exists for their desktop GPUs.
Will extreme (but not so flexible) efficiency win against something not that efficient but able to do much more things in a easier way?
Would mobile game engines bet more on computing shaders power or memory bandwidth?
Will new DX10/DX11-alike engines (whose features are supported by new architectures) still be TBDR friendly? Does TBDR design still scale for the modern ultra high resolution displays or as for desktops "brute force" (or simply more power more performance) will rule out?
I think this year will tell us very much.
Scali - Monday, February 24, 2014 - link
"the alternative to DirectX is OpenGL. Is it more suited for TBDR architecture than "brute force" ones?"In theory, D3D used to be more suited to TBDR than OpenGL, because it had explicit BeginScene() and EndScene() markers. But those have been dropped after D3D9.
I can't really think of anything off the top of my head that would make one API more suited than the other these days. They're both very similar: just bind your textures/buffers/shaders, update the constants, and fire off your geometry.
"How is PowerVR going to fight against an architecture as flexible as nvidia ones that can also be used for CUDA computing and thus being adopted into markets (and for other tasks) PowerVR cannot with their current architecture?"
PowerVR supports OpenCL, so they too can do the GPGPU-game. It all depends on who delivers the best package in terms of features, performance, power usage, etc.
Jhwzz - Tuesday, February 25, 2014 - link
>Again, as someone altready asked, tile based rendering was used on the desktop but was soon>abandoned as it could not give any real advantages over the raw power of other architectures
>that grew much faster that what PowerVR could optimize their algorithms, making tile based
>rendering less and less profitable. What makes that scenario different that what we are
> witnessing in this period where mobile resolutions are growing to be even bigger than desktop
> monitors and that games complexity is gonig to increase for the arrival of these really powerful
>GPUs (K1 in primis)?
It is a common misconception that PowerVR's desktop parts where not competitive or had compatibility problems. The last card they produced, the Kyro II was actual very competitive with the offerings from both NVidia and ATi. The claims of incompatibility where largely unfounded marketing FUD from competitors, with later drivers running the majority of content without problems. Further they did not leave the market because they could not compete on performance, instead their partner at the time, STM, decided to pull out of the market for unstated reasons, although this was most likely due to them not being able to invest the amount of money they needed to in order to take the high ground.
Mobile is VERY different to desktop space, NV and ATi where able to brute force their way to the top as both power consumption and memory bandwidth had extremely wide envelopes, this is not the case in mobile space. In Kyro II PowerVR had demonstrated that they were able to compete with considerable higher specification part (for clock & memory BW) from both NV and AMD, with considerably lower memory BW and power requirements. Although NV and AMD have evolved so have PowerVR, as such there is no reason to assume that they don't still have advantages.
>It seems PowerVR is behaving a bit like 3DFx did at the time, till it died. They were using their
>advanced but old technology to the exterme, so they rendered at 16bit instead of 32, used 16bit
>Zbuffer instead of 24 and many more "tricks" that were forced to try to hide what was quite
>clear: 3DFX didn't have the right architecture to compete with new companies like nvidia
> and ATI that started their story with the right step and much more powerful architectures
> (TNT2 simply destroyed Voodoo3 under all points, and beware, I was an Voodoo3 unfortunate > owner).
This simply makes no sense in the context of the market these cores are being target at. At the fundamental level the primary API currently used in mobile is OGLES2.0 which does not mandate anything higher than FP16 within fragment shaders. This means that the vast majority of current mobile content only use FP16 in fragment shaders, in these circumstances do you think it make sense to through area at higher precision paths? Of course it doesn’t! Further it’s not like the PowerVR architecture looks like slouch at FP32.
At the end of the day they truth will only be seen in benchmark in actual devices, not in marketing claims and FUD from various companies.
Jhwzz - Tuesday, February 25, 2014 - link
BTW you do realise that NV run many of these benchmark at 16 bit Z and 16 bit frame buffer don't you? They do thsi because the become even less comeptitive when forced to use 32 bits. So who is actually using old "technology" to hide real deficiencies in thir architectures?MrSpadge - Saturday, March 1, 2014 - link
If you can't see a difference due to using 16 bit somewhere along the rendering path, then it's actually the smart thing to do. It saves power, which can be better used elsewhere (higher clocks). Well, that was actually 3DFX's argument for why Glide with 16 bit + dithering was the smart choice back then. But "they've got the bigger bits" won (along with other advantages of the early nVidias).MrPoletski - Sunday, March 9, 2014 - link
I tmight be the smart thing to do, but when it's in a benchmark that's supposed to be at 32 bits then I call that cheating!allanmac - Monday, February 24, 2014 - link
If a new GPU architecture "deep dive" doesn't include the number of registers per multiprocessor then it's bordering on worthless.Both Intel, AMD and NVIDIA publish these numbers so the other mobile GPU vendors should as well.
Please dig up these numbers since then we can begin to compare these next gen mobile GPUs.
I suspect ARM, ImgTech and QCOM simply won't tell you... but you might be able to find the answer through a series of OpenCL tests.
boostern - Monday, February 24, 2014 - link
OpenCL tests on a GPU that isn't even in production? Before you say "test it on the iphone 5S", there is no OpenCL public libraries available on iOs, as far as I know.allanmac - Monday, February 24, 2014 - link
The first option is to ask the vendor.boostern - Monday, February 24, 2014 - link
Yes, maybe :DMrSpadge - Saturday, March 1, 2014 - link
While this is surely important the raw number of registers doesn't tell you that much either. In fact, it could be very misleading. How many registers you need first and foremost depends on the out of order window (in CPUs) or here the number of threads in flight. Which is something they didn't tell us either. Also different cache sizes and latencies would determine how bad it is to run out of register space.vladz - Monday, February 24, 2014 - link
So what does flexible hardware or flexible rendering mean exactly?Ryan Smith - Monday, February 24, 2014 - link
Flexible hardware in the context I used it would mean programmable.TMUs aren't flexible. They fetch texels, apply filtering, and that's it.
Shaders are flexible. They accept threads of instructions and the result on a pixel/vertex will be whatever the program dictates, as opposed to a fixed outcome.
ifrit39 - Monday, February 24, 2014 - link
I believe your math is off on the last table. GTX 650 would produce 230.4 GFLOPS @ 300MHz, not 330.4.Its interesting to me that these mobile designs are so close to full desktop performance (albeit with the low 300MHz clock). But memory bandwidth, power, and clocks will always hold these SoCs back in the real world.
Thanks for the great article, as always.
hoboville - Monday, February 24, 2014 - link
I love these kinds of articles, simply because of the explanations of how various hardware and underlying systems work. Probably the best part of Anandtech reviews.Laststop311 - Tuesday, February 25, 2014 - link
Gotta give imagination tech some credit. They have the highest performing gpu's in the soc market. Very flexible too, they make a 2 core cluster , 3 core, 4 core and 6 core. If only we could have the best cpu performance (qualcomm snapdragon krait cpu's) mixed with the best gpu (powervr 6xt 6 core version) on the same soc. Tho nvidia dual core denver design that throws away all the extra cores and devotes more die space to 2 higher performing more complex cores might take the cpu crown since there arent many workloads on a phone that needs more then 2 threads going.nosirrah123 - Tuesday, February 25, 2014 - link
Wow, this is an amazingly written article, good work!patrickjchase - Tuesday, February 25, 2014 - link
Are you sure that Rogue is superscalar rather than VLIW?Briefly the difference between the two comes down to when independent instructions (ones that can execute in parallel) are identified. In a VLIW it happens at compile time, while in a superscalar design it happens at runtime. It would actually surprise me if Rogue does runtime dependency analysis for such a wide backend - If I had to bet I'd say "VLIW".
D16700605001 - Wednesday, February 26, 2014 - link
is there a similar article comparing Nvidia to AMD? I've seen block diagrams of Nvidia chips on their web site but haven't found any for AMD. Even if I did, an article like this one would be better than me trying to make inferences and decode the vendor spin. I want to buy a compute engine and I keep getting the impression AMDs offering is better but would like to be convinced from a technical discussion rather than stats about game performance and unpacking textures.Bawl - Saturday, March 1, 2014 - link
Great article. Thank you so much. PowerVR has so much power inside, I can only think their power are underutilized because of the others GPUs.MrSpadge - Saturday, March 1, 2014 - link
I wonder how many instructions they can dispatch per clock. That's a significant factor when discussing how to feed up to 7 execution units. Actually I'd be surprised if it's more than 2, maybe 3 under special circumstances.. which would make me wonder how they're feeding 4 FP16 ALUs. But then I also wonder if these are truly 4 independebt units.. I guess not.errorr - Sunday, March 2, 2014 - link
Re tile based rendering the Mali guys on their blog outlined their thought process on why they chose tile based rendering.Ultimately it is all about power. TBR allows them to avoid memory reads and writes by keeping the color and z-buffer on chip. They also keep the multisample buffer on chip Which allows them to do 16x msaa at very low power cost.
The biggest power hog though is textures which still reside in memory. This means that complex or large textures will basically overwhelm Amy power savings. Basically TBR becomes less efficient when pushed to the max of its capabilities like in many benchmarks.
http://community.arm.com/groups/arm-mali-graphics/...
errorr - Sunday, March 2, 2014 - link
Oh, I also wanted to note that the ARM Mali blog announced that they were going to talk about their shader architecture in an upcoming blog post. The last part in a series about the latest Mali and openGL ES compliance. The previous part talked about tile based rendering and how to get power savings from it.The interesting things I've learned.from the blog include some great stuff not even discussed by imagination.
Mali has a way of comparing the current state of the frame buffer with incoming tiles, if they are identical then the new tile is discarded instead of taking power to write to the frame buffer.
Also, texture compression can save a lot of memory.
The biggest eye opener though was that the way Android handles draw calls means that both CPU and GPU are starved for work and dvfs isn't responsive enough or granular enough to allow for power gating. It seems the problem has to do with screen orientation and android preventing ideal asynchronous Possessing of work.
RAYBOYD44 - Monday, March 3, 2014 - link
uptil I looked at the receipt which had said $9859 , I did not believe that...my... sister had been actualie receiving money in there spare time at their computer. . there aunts neighbour had bean doing this for only about fifteen months and just now took care of the morgage on there appartment and purchased a great Porsche 911 .MrPoletski - Monday, March 10, 2014 - link
DAFUQ?hi.wonjoon - Sunday, May 4, 2014 - link
does this post say that rogue architecture doing some stuff like HyperThreading in Intel? I just wondering I understood in right way.