How Much Power?

All this hardcore testing just made us more curious. Would we be able to determine how much power the PCU of Nehalem actually saves? Let's add a little more machine code to our hardware C-state scripts. The MSR 3FCh contains the info we need. We test once again with two active chess threads.

PCU Sleep State Comparison
  Clockticks Ticks spent in C3 Ticks spent in C6 Percentage C3 Percentage C6
Core 1 2961889630 3497984 71450624 0.12% 2.41%
Core 2 2989850634 4128768 768581632 0.14% 25.71%
Core 3 3022277437 186195968 1032536064 6.16% 34.16%
Core 4 3033988899 171286528 387645440 5.65% 12.78%
Average       3.02% 18.76%

At first you may think that these measurements contradict our previous measurements even though they were measured in the same circumstances (two active threads + one measurement thread). But if you calculate how much time the cores spend on average in C6, you get 19%, in the same ballpark as our previous measurement (21%). Notice that the PCU forces the Xeon cores to move quickly from C3 to a deeper C6 sleep: only 3% (!) is spent in C3.

So this means that the ACPI C2 state consists of 13.85% C3 and 86.15% C6 (18.76/ (3.02 + 18.76). Let's take the ACPI readings again.

ACPI C-State Comparison
  % idle C1 C2 C3
Opteron 2435 86 100 0 0
Xeon L3426 81 7 93 0
Opteron 2389 72.44 100 0 0

So now we can calculate how much time the CPU actually spent in the real hardware C-states.

% time spent in C1 = 7% of 81% idle

The "software" ACPI C2 states are mapped by the Xeon CPU to two "hardware CPU" states:

  1. % time spent in C3 = 13.85% out of 93% C3, at 81% idle = +/- 10.3%
  2. % time spent in C6 = 86.15% out of 93% C3, at 81% idle = +/- 65%

So our two threads of Chess caused the L3426 cores to spend:

  • 19% in C0
  • 5.7% in C1
  • 10.3% in C3
  • 65% (!) in C6

…on average.

What effect would this have on the power consumption of the chip? Intel gives us a good idea of what each C-state consumes with the Xeon X3400 series. In the thermal specifications and design guidelines [6] we find this table.


Intel does not give us C1 power, but let's assume it is 25W on the L3426; our industry sources tell us this should be close enough. If the complex circuitry of the PCU was not available, the CPU would be limited to the C1 state to save power. Other C-states would only be available if all cores were idle or the system was idle. We assume that C0 consumes 45W, which is not far from the truth either as the CPUs with low TDP tend to be quicker.

Total power w/o PCU
= 45W * 19% (C0) + 25W * 81% (C1)
= 28.8W
Total Power with PCU
= 45W * 19% (C0) + 25W * 5.7% (C1) + 17W * 10.3% (C3) + 4W * 65% (C6)
= 14.5W

The actual absolute numbers are not that important, but our simplified calculation shows that the fact that the PCU forces the CPU to go very quickly to C6 allows the "Lynnfield" Xeon to morph from a rather mediocre low power CPU into a "real" low power CPU. 14W for four complex out of order processors is very impressive, less than 4W per core! Intel's claims are justified: the PCU enables the "Nehalem" based cores to run in a deep sleep C6 state, even if other cores are hard at work. To end with an interesting note: even with four threads active on the Xeon L3426 we found out that the cores spent 11% of the time in C6.

Analysis: What Happened? More Performance Please!
Comments Locked

35 Comments

View All Comments

  • JohanAnandtech - Monday, January 18, 2010 - link

    In which utility do you set/manage the frequency of a separate core?
  • n0nsense - Monday, January 18, 2010 - link

    Gnome panel applets. CPU frequency monitor I guess it uses cpufreq. Each instance monitors core. So i have 4 of them visible all the time. If you have enabled CPU Frequency scaling (kernel) than you can select the governor (performance, on demand, conservative etc) or a static frequency. I can do it for each core. And it displays what i have set.
    Of course processor should support frequency scaling.(power now and speed step).
    Most mainstream distributions (Ubuntu, Sabayon, Fedora) will use onedemand governor by default when processor with frequency scaling available. No user intervention required.
  • jordanclock - Monday, January 18, 2010 - link

    I really think you're mistaken. Core 2 CPUs don't have any mechanism to allow per-core frequencies. There is one FSB clock and one multiplier. There is no way to set CPU0 to a different frequency than CPU1 (or for quad core, CPU2 and CPU3) because the variables that control the clock speed are chip wide.
  • VJ - Tuesday, January 19, 2010 - link

    These people seem to be convinced of per-core Speedstep:

    https://bugs.launchpad.net/ubuntu/+source/linux-so...">https://bugs.launchpad.net/ubuntu/+source/linux-so...

    Maybe someone can ask David Tomaschik for the Intel documentation he refers to?
  • n0nsense - Monday, January 18, 2010 - link

    I heard it in past, but i still tend to believe my eyes :)
    while writing this reply, i saw any possible combination. My Q9300 has 2 states 2.0GHz and
    2.5GHz. It's not a server CPU. Have no reason to mislead you
  • VJ - Tuesday, January 19, 2010 - link

    If there's only two states, then it's possible that one core is in the C2 state while the other is in its C0 state.

    The core in state C2 may be shown to be operating at 2Ghz (its lowest frequency) while it's really off. The OS may simply be reporting the lowest possible frequency while the core is really not receiving a clock signal.

    So in general, if one core is showing its lowest frequency it may be off which still allows the other core to operate (at a different frequency).

    It would be very strange if both cores are operating greater than their lowest and less than their highest frequencies at different frequencies.

    From a different angle: Has anybody ever seen /proc/cpuinfo report a frequency less than the CPU/Core's lowest active frequency or even zero? Probably not.



  • n0nsense - Tuesday, January 19, 2010 - link

    Nice theory :)
    But in this case, I see that each core doing something. htop shows that each core somewhere in 15% usage. So the only options left, are
    1. Each core frequency can be controlled independently on C2D and C2Q (May be i3 i5 i7 too)
    2. The OS is completely unaware of whats going on :) (which is less possible)
  • mino - Thursday, January 21, 2010 - link

    "The OS is completely unaware of whats going on" is the right answer.
    :)

    BTW, only x86 CPU's able to change freq per core are >=K10 for AMD and >=Nehalem for Intel.
  • VJ - Tuesday, January 19, 2010 - link

    Not to defeat your argument/observations, rather for completeness' sake:

    It's also possible that the differences are due to the reading of the attributes. If the attributes are read in succession, then it's possible that the differences are due to the time of reading the attributes, while at any given instant, notwithstanding the allowable subtle differences in frequency described in this article, all cores are operating at the same frequency.

    There's a lot of time at the bottom.
  • JanR - Tuesday, January 19, 2010 - link

    Hi,

    I completely agree to this:

    "It's also possible that the differences are due to the reading of the attributes."

    The point is that desktop usage together with ondemand governor leads to a lot of fast frequency changes. Therefore, this is not a good scenario to decide on "per core" vs "per CPU". We did a lot of testing the following way:

    Put load on all cores using "taskset" (this avoids C-states). Switch to "userspace" governor and then set frequencies of individual cores differently. You have one control per core but the actual hardware decides what really happens - you can check this in /proc/cpuinfo or using a tool such as "mhz" from lmbench as load generator (this one calculates actual frequency based on CPI and time, it allows also measurement of turbo frequencies).

    Trying around, the results are:

    AMD K8: One clock domain, maximum of the requested frequencies is taken

    Intel Core2 Duo: Same as K8

    AMD K10: Individual clock domains, you can clock each core individually

    Intel Core 2 Quad: TWO clock domains! These CPUs are two dual core dies glued together so each die has its one multiplicator. Therefore, the cores of each die get the maximum of the requested frequencies but you can clock the two dies independendly.

    Intel Nehalem: One clock domain, maximum of requests of all cores that are not in C-state! If you set one core to, e.g., 2.66 GHz and all other to 1.6, all cores clock at 1.6 as long as the core set to 2.66 is not used, they all switch to 2.66 if you put load on that core.

    So far to our findings. "cat /proc/cpuinfo" or some funny tools are useless if you do not control the environment (userspace, manual settings). If you then enable ondemand, the system switches fast between different states and looking at it is just a snapshot, maybe taken in the middle of a transition.

    Greetings,

    Jan

Log in

Don't have an account? Sign up now