Tegra X1's GPU: Maxwell for Mobile

Going into today’s announcement of the Tegra X1, while NVIDIA’s choice of CPU had been something of a wildcard, the GPU was a known variable. As announced back at GTC 2014, Erista – which we now know as Tegra X1 – would be a future Tegra product with a Maxwell GPU.

Maxwell of course already launched on the PC desktop as a discrete GPU last year in the Maxwell 1 based GM107 and Maxwell 2 based GM204. However despite this otherwise typical GPU launch sequence, Maxwell marks a significant shift in GPU development for NVIDIA that is only now coming to completion with the launch of the X1. Starting with Maxwell, NVIDIA has embarked on a “mobile first” design strategy for their GPUs; unlike Tegra K1 and its Kepler GPU, Maxwell was designed for Tegra from the start rather than being ported after the fact.

By going mobile-first NVIDIA has been able to reap a few benefits. On the Tegra side in particular, mobile-first means that NVIDIA’s latest and greatest GPUs are appearing in SoCs earlier than ever before – the gap between Maxwell 1 and Tegra X1 is only roughly a year, versus nearly two years for Kepler in Tegra K1. But it also means that NVIDIA is integrating deep power optimizations into their GPU architectures at an earlier stage, which for their desktop GPUs has resulted chart-topping power efficiency, and these benefits are meant to cascade down to Tegra as well.

Tegra X1 then is the first SoC to be developed under this new strategy, and for NVIDIA this is a very big deal. From a feature standpoint NVIDIA gets to further build on their already impressive K1 feature set with some of Maxwell’s new features, and meanwhile from a power standpoint NVIDIA wants to build the best A57 SoC on the market. With everyone else implementing (roughly) the same CPU, the GPU stands to be a differentiator and this is where NVIDIA believes their GPU expertise translates into a significant advantage.

Diving into the X1’s GPU then, what we have is a Tegra-focused version of Maxwell 2. Compared to Kepler before it, Maxwell 2 introduced a slew of new features into the NVIDIA GPU architecture, including 3rd generation delta color compression, streamlined SMMs with greater efficiency per CUDA core, and graphics features such as conservative rasterization, volumetric tiled resources, and multi-frame anti-aliasing. All of these features are making their way into Tegra X1, and for brevity’s sake rather than rehashing all of this we’ll defer to our deep dive on the Maxwell 2 architecture from the launch of the GeForce GTX 980.

For X1 in particular, while every element helps, NVIDIA’s memory bandwidth and overall efficiency increases are going to be among the most important of these improvements since they address two of the biggest performance bottlenecks facing SoC-class GPUs. In the case of memory bandwidth optimizations, memory bandwidth has long been a bottleneck at higher performance levels and resolutions, and while it’s a solvable problem, the general solution is to build a wider (96-bit or 128-bit) memory bus, which is very effective but also drives up the cost and complexity of the SoC and the supporting hardware. In this case NVIDIA is sticking to a 64-bit memory bus, so memory compression is very important for NVIDIA to help drive X1. This coupled with a generous increase in memory bandwidth from the move to LPDDR4 helps to ensure that X1’s more powerful GPU won’t immediately get starved at the memory stage.

Meanwhile just about everything about SoC TDP that can be said has been said. TDP is a limiting factor in all modern mobile devices, which means deceased power consumption directly translates into increased performance, especially under sustained loads. Coupled with TSMC’s 20nm SoC process, Maxwell’s power optimizations will further improve NVIDIA’s SoC GPU performance.

Double Speed FP16

Last but certainly not least however, X1 will also be launching with a new mobile-centric GPU feature not found on desktop Maxwell.  For X1 NVIDIA is implanting what they call “double speed FP16” support in their CUDA cores, which is to say that they are implementing support for higher performance FP16 operations in limited circumstances.

As with Kepler and Fermi before it, Maxwell only features dedicated FP32 and FP64 CUDA cores, and this is still the same for X1. However in recognition of how important FP16 performance is, NVIDIA is changing how they are handling FP16 operations for X1. On K1 FP16 operations were simply promoted to FP32 operations and run on the FP32 CUDA cores; but for X1, FP16 operations can in certain cases be packed together as a single Vec2 and issued over a single FP32 CUDA core.

There are several special cases here, but in a nutshell NVIDIA can pack together FP16 operations as long as they’re the same operation, e.g. both FP16s are undergoing addition, multiplication, etc. Fused multiply-add (FMA/MADD) is also a supported operation here, which is important for how frequently it is used and is necessary to extract the maximum throughput out of the CUDA cores.

In this respect NVIDIA is playing a bit of catch up to the competition, and overall it’s hard to escape the fact that this solution is a bit hack-ish, but credit where credit is due to NVIDIA for at least recognizing and responding to what their competition has been doing. Both ARM and Imagination have FP16 capabilities on their current generation parts (be it dedicated FP16 units or better ALU decomposition), and even AMD is going this route for GCN 1.2. So even if it only works for a few types of operations, this should help ensure NVIDIA doesn’t run past the competition on FP32 only to fall behind on FP16.

So why are FP16 operations so important? The short answer is for a few reasons. FP16 operations are heavily used in Android’s display compositor due to the simplistic (low-precision) nature of the work and the power savings, and FP16 operations are also used in mobile games at certain points. More critical to NVIDIA’s goals however, FP16 can also be leveraged for computer vision applications such as image recognition, which NVIDIA needs for their DRIVE PX platform (more on that later). In both of these cases FP16 does present its own limitations – 16-bits just isn’t very many bits to hold a floating point number – but there are enough cases where it’s still precise enough that it’s worth the time and effort to build in the ability to process it quickly.

Tegra X1 GPU By The Numbers

Now that we’ve covered the X1’s GPU from a feature perspective, let’s take a look the GPU from a functional unit/specification perspective.

Overall the X1’s GPU is composed of 2 Maxwell SMMs inside a single GPC, for a total of 256 CUDA cores. This compares very favorably to the single SMX in K1, as it means certain per-SMM/SMX resources such as the geometry and texture units have been doubled. Furthermore Maxwell’s more efficient CUDA cores means that X1 is capable of further extending its lead over Kepler, as we’ve already seen in the desktop space.

NVIDIA Tegra GPU Specification Comparison
  K1 X1
CUDA Cores 192 256
Texture Units 8 16
ROPs 4 16
GPU Clock ~950MHz ~1000MHz
Memory Clock 930MHz (LPDDR3) 1600MHz (LPDDR4)
Memory Bus Width 64-bit 64-bit
FP16 Peak 365 GFLOPS 1024 GFLOPS
FP32 Peak 365 GFLOPS 512 GFLOPS
Architecture Kepler Maxwell
Manufacturing Process TSMC 28nm TSMC 20nm SoC

Meanwhile outside of the CUDA cores NVIDIA has also made an interesting move in X1’s ROP configuration. At 16 ROPs the X1 has four times the ROPs of K1, and is consequently comparatively ROP heavy. This is as many ROPs as is on a GM107 GPU, for example. With that said, due to NVIDIA’s overall performance goals and their desire to drive 4K displays at 60Hz, there is a definite need to go ROP-heavy to make sure they can push the necessary amount of pixels. This also goes hand-in-hand with NVIDIA’s memory bandwidth improvements (efficiency and actual) which will make it much easier to feed those ROPs. This also puts the ROP:memory controller ratio at 16:1, the same ratio as on NVIDIA’s desktop Maxwell parts.

Finally, let’s talk about clockspeeds and expected performance. While NVIDIA is not officially publishing the GPU clockspeeds for the X1, based on their performance figures it’s easy to figure out. With NVIDIA’s quoted (and promoted) 1 TFLOPs FP16 performance figure for the X1, the clockspeed works out to a full 1GHz for the GPU (1GHz * 2 FP 16 * 2 FMA * 256 = 1 TFLOPs).

This is basically a desktop-class clockspeed, and it goes without saying that is a very aggressive GPU clockspeed for an SoC-class part. We’re going to have to see what design wins X1 lands and what the devices are like, but right now it’s reasonable to expect that mobile devices will only burst here for short periods of time at best. However NVIDIA’s fixed platform DRIVE devices are another story; those can conceivably be powered and cooled well enough that the X1’s GPU can hit and sustain these clockspeeds.

Introduction, CPU, and Uncore GPU Performance Benchmarks
Comments Locked

194 Comments

View All Comments

  • Jumangi - Monday, January 5, 2015 - link

    Apple would never use Nvidia at the power consumption levels it brings. The power is pointless to them if it can't be put into a smartphone level device. Nvidia still doesn't get why nobody in the OEM market wants their tech for a phone.
  • Yojimbo - Monday, January 5, 2015 - link

    But the NVIDIA SOCs are on a less advanced process node, so how can you know that? You seem to be missing the whole point. The point is not what Apple wants or doesn't want. The point is to compare NVIDIA's GPU architecture to the PowerVR series 6XT GPU. You cannot directly compare the merits of the underlying architecture by comparing performance and power efficiency when the implementations are using different sized transistors. And the question is not the level of performance and power efficiency Apple was looking for for their A8. The question is simply peak performance per watt for each architecture.
  • OreoCookie - Tuesday, January 6, 2015 - link

    @Yojimbo
    The Shield was released with the Cortex A15-based Tegra K1, not the Denver-based K1. The former is not competitive with regards to CPU performance, the latter plays in the same league. AFAIK the first Denver-based K1 product was the Nexus 9. Does anyone know of any tablets which use the Denver-based K1?
  • lucam - Wednesday, January 7, 2015 - link

    Apple sell products that have an year life cycle, don't sell chips and therefore they don't need to do any marketing in advance as NV does punctually at any CES.
  • TheJian - Monday, January 5, 2015 - link

    It's going finfet 16nm later this year (parker). As noted here it's NOT in this chip due to time to market and probably not as much gained by shrinking that to 20nm vs. going straight to 16nm finfet anyway. Even Qcom went off the shelf for S810 again for time to market.

    Not sure how you get that Denver is a disappointment. It just came out...LOL. It's a drop in replacement for anyone using K1 32bit (pin compatible), so I'm guessing we'll see many more devices pop up quicker than the first rev, but even then it will have a short life due to X1 and what is coming H2 with Denver yet again (or an improved version).

    What do you mean K1 is in ONE device? You're kidding right? Jeez, just go to amazon punch Nvidia K1 into the search. Acer, HP, NV shield, Lenovo, Jetson, Nexus9, Xiaomi (mipad not sold on amazon but you get the point)...The first 4 socs were just to get us to desktop gpu. The real competition is just starting.

    Building the cpu wasn't just for mobile either. You can now go after desktops/higher end notebooks etc with NO WINTEL crap in them and all the regular PC trimmings (high psu, huge fan/heatsink, hd's, ssd's etc etc, discrete gpu if desired, 16-32GB of ram etc). All of this timed perfectly with 64bit OS getting polished up for MUCH more complicated apps etc. The same thing that happened to low-end notebooks with chromebooks, will now happen with low end PC's at worst and surely more later as apps advance on android etc and Socs move further up the food chain in power and start running desktop models at 4ghz with fan/heatsinks (with a choice of discrete gpu when desired). With no Wintel Fee (copy of windows + Intel cpu pricing), they will be great for getting poor people into great gaming systems that do most of what they'd want otherwise (internet, email, docs, media consumption). I hope they move here ASAP, as AMD is no longer competition for Intel CPU wise.

    Bring on the ARM full PC like box! Denver was originally supposed to be x86 anyway LOL. Clearly they want in on Intel/AMD cpu territory and why not at CPU vs. SOC pricing? NV could sell an amped up SOC at 4ghz for $110/$150 vs. Intel's top end i5/i7's ($229/339). A very powerful machine for $200 less cash but roughly ~perf (when taking out the Windows fee also, probably save $200 roughly). Most people in this group won't miss the windows apps (many won't even know what windows is, grew up on a phone/tablet etc). Developing nations will love these as apps like Adobe Suite (fully featured) etc get moved making these cheap boxes powerful content creators and potent gamers (duh, NV gpu in them). If they catch on in places like USA also, Wintel has an even bigger headache and will need to drop pricing to compete with ARM and all it's ecosystem brings. Good times ahead in the next few years for consumers everywhere. This box could potentially run android, linux, steamos, chrome in a quadboot giving massive software options etc at a great price for the hardware. Software for 64bit on Arm will just keep growing yearly (games and adv apps).
  • pSupaNova - Tuesday, January 6, 2015 - link

    Agree totally with your post, NVdia did try to put good mobile chips in netbooks with the ION & ION2 and Intel blocked them.

    Good to see that they have stuck at the job and now are in the position to starting eating Intels lunch.
  • darkich - Monday, January 5, 2015 - link

    That's just not true.

    The K1 has shipped in three high end Android Tablets - Nvidia shield, Xiaomi MiPad, and Nexus 9.

    Now, how many tablets got a Snapdragon 805?
    Exynos 5433?

    Tegra K1 market performance is simply the result of the fact that high end tablet market is taken up by Apple, and that it doesn't compete in mod range and low end.
  • darkich - Monday, January 5, 2015 - link

    *mid range
  • GC2:CS - Monday, January 5, 2015 - link

    It's the result of too high power compustion, that OEM's prefer to keep low.

    That's why tegra K1 is used by just foolish chinesse manufacteurs (like tegra 4 in a phone) like xiaomi, google in a desperate need for a non Apple high end 64-bit chip (to showcase how much it's 64-bit) and nvidia themselves.
  • Yojimbo - Monday, January 5, 2015 - link

    I think you're right that the K1 is geared more towards performance than other SOCs. The K1 does show good performance/watt, but it does so with higher performance, using more watts. And you're right that most OEMs have preferred a lower power usage. But it doesn't mean that the K1 is a poor SOC. NVIDIA is trying to work towards increasing the functionality of the platform by allowing it to be a gaming platform. That is their market strategy. It is probably partially their strategy because those are the tools they have available to them; that is their bread-and-butter. But presumably they also think mobile devices can really be made into a viable gaming platform. Thinking about it in the abstract it seems to be obvious... Mobile devices should at some point become gaming platforms. NVIDIA is trying to make this happen now.

Log in

Don't have an account? Sign up now