Limitations

First of all, let's discuss the limitations of this review. The benchmark we used allowed us to control the number of threads very accurately, but it is not a real world benchmark for most IT professionals. The fact that it is an integer dominated benchmark means that it has some relevance, but it's still not ideal. In our next article we will be using MS SQL Server 2008 R2. That will allow us to measure power efficiency at a certain performance level, which is also much more relevant than pure performance/watt. Also, the low power six-core Opteron 2419 EE is missing. This CPU just arrived in the labs as we finished this article, so expect an update soon.

"Academic" Conclusion

The days where dynamic frequency scaling offers significant power savings are over. The reason is that you can only lower voltages if you scale the complete package towards a lower clock. In that case the power savings are considerable (P ~ V²), but we did not encounter that situation very often. No, both AMD and Intel favor the strategy of placing the idle cores in higher C-states. The most important power savings come from fine grained clock gating, from placing cores in a completely clock gated C-state (AMD's Smart Fetch + C1), or even better placing them in a power gated stated (Intel's Power Gating into deep C6 sleep).

Practical Conclusions

Windows 2008 makes you choose between Balanced and Performance power plans. If your application runs at idle most of the time and you are heavily power constrained, Balanced is always the right choice. But in all other cases, we would advise using the "Performance" plan for the Opterons. For some reason, the CPU driver does not deliver the performance that is demanded. With Balanced, when you ask for 25% of the total CPU performance, you'll get something like 15% to 20%. The result is that you get up to 25% less performance than the CPU delivers in "Performance" mode, without significant power savings. That's not good. We can already give away that we saw response time increases in MS SQL Server 2008 due to this phenomenon. It is also worth saying that our new measurements confirm that the performance/watt ratio of the six-core Opterons is significantly better than the quad-core Opterons.

The Xeons are a different story. For the normal 95W Xeons it makes sense to run in Balanced mode. The "base" performance is excellent and Turbo Boost adds a bit of performance but also quite a bit of power. Ideally, it should be possible to run in Balanced mode and use Turbo Boost when your application is performing a single threaded batch operation, but unfortunately this is not possible with the default Windows 2008 settings.

For the low power Xeons, it is different. Those CPUs run closer to their specified TDP power limit and will rarely use Turbo Boost as soon as they are loaded at 25% or more. If your application is limited by regular single threaded batch operations, it makes a lot of sense to choose the Performance plan. Turbo Boost pays off in that case: the clock speed is raised from a meager 1.86GHz to an impressive 3.2GHz. As Xeons based on the "Nehalem" architecture place idle cores in C6 very quickly, the Performance mode hardly consumes more than the Balanced mode. As we have shown, frequency scaling does not save much power, as most of the cores are power gated automatically. This aggressive "go to C6 sleep" policy allows the architecture with the highest IPC in the industry to morph into a high performance server CPU with modest power consumption. There is a huge difference between this CPU inside a machine where it is pushed towards 100% load and inside a server where it hovers between 20 and 70% load most of the time. The latter situation allows the CPU to put cores in C6 mode a significant amount of time. As a result the power savings in a server environment are nothing short of impressive.

Now that we understand the nuts and bolts, we are able to move on to our next question: How can we get the best power efficiency at a certain performance point? We will follow up with a power efficiency case study based on SQL Server 2008.

References

[1] "Planet Google": One Company's Audacious Plan to Organize Everything, page 82, Randall Stross, Free Press New York.

[2] "AMD Family 10h Server and Workstation Processor Power and Thermal Data Sheet Publication Revision: 3.07, September 2009"

[3]"Power Reduction through RTL Clock Gating," F. Emnett and M. Biegel, SNUG (Synopsis User Group) Conference San Jose, 2000.

[4] "45nm Next Generation Intel Core™ Microarchitecture (Penryn)", Varghese George Principal Engineer Intel Corp, HOT CHIPS 2007

[5] "Analysis of Dynamic Power Management on Multi-Core Processors", W. Lloyd Bircher and Lizy K. John, The University of Texas at Austin. ICS '08 June 2008

[6] "Intel Xeon Processor 3400 series thermal/mechanical specifications and design guidelines, December 2009

Overview
Comments Locked

35 Comments

View All Comments

  • n0nsense - Monday, January 18, 2010 - link

    Here is what system sees ...
    only one is 2.5, other three are 2.0 :)

    nons ~ # cat /proc/cpuinfo
    processor : 0
    vendor_id : GenuineIntel
    cpu family : 6
    model : 23
    model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
    stepping : 7
    cpu MHz : 2497.000
    cache size : 3072 KB
    physical id : 0
    siblings : 4
    core id : 0
    cpu cores : 4
    apicid : 0
    initial apicid : 0
    fpu : yes
    fpu_exception : yes
    cpuid level : 10
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
    bogomips : 5009.38
    clflush size : 64
    cache_alignment : 64
    address sizes : 36 bits physical, 48 bits virtual
    power management:

    processor : 1
    vendor_id : GenuineIntel
    cpu family : 6
    model : 23
    model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
    stepping : 7
    cpu MHz : 1998.000
    cache size : 3072 KB
    physical id : 0
    siblings : 4
    core id : 1
    cpu cores : 4
    apicid : 1
    initial apicid : 1
    fpu : yes
    fpu_exception : yes
    cpuid level : 10
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
    bogomips : 7012.69
    clflush size : 64
    cache_alignment : 64
    address sizes : 36 bits physical, 48 bits virtual
    power management:

    processor : 2
    vendor_id : GenuineIntel
    cpu family : 6
    model : 23
    model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
    stepping : 7
    cpu MHz : 1998.000
    cache size : 3072 KB
    physical id : 0
    siblings : 4
    core id : 2
    cpu cores : 4
    apicid : 2
    initial apicid : 2
    fpu : yes
    fpu_exception : yes
    cpuid level : 10
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
    bogomips : 5009.08
    clflush size : 64
    cache_alignment : 64
    address sizes : 36 bits physical, 48 bits virtual
    power management:

    processor : 3
    vendor_id : GenuineIntel
    cpu family : 6
    model : 23
    model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
    stepping : 7
    cpu MHz : 1998.000
    cache size : 3072 KB
    physical id : 0
    siblings : 4
    core id : 3
    cpu cores : 4
    apicid : 3
    initial apicid : 3
    fpu : yes
    fpu_exception : yes
    cpuid level : 10
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
    bogomips : 5009.09
    clflush size : 64
    cache_alignment : 64
    address sizes : 36 bits physical, 48 bits virtual
    power management:
  • VJ - Tuesday, January 19, 2010 - link

    These are mobile CPUs, however:

    With Linux on a Latitude (Intel T7200 or T7500), CPU Frequency Scaling Monitor allows one to scale the frequency of one core to its max while leaving the other core at its minimum.

    With an AMD TL62, this is not possible. The induced scaling of one core causes the frequency of the other core to follow.

    With an AMD ZM84 this is possible. Just like with the Latitude, one can have one core at its max with the other core at its minimum.

    Maybe what's shown is not what's taking place.

    Additionally;

    http://www.intel.com/technology/itj/2006/volume10i...">http://www.intel.com/technology/itj/200...al_Manag...

    "For example, in a Dual-Processor system, when the OS decides to reduce the frequency of a single core, the other core can still run at full speed. In the Intel Core Duo system, however, lowering the frequency to one core slows down the other core as well."


  • VJ - Tuesday, January 19, 2010 - link

    Additionally; AMD's ZM84 allows each core to operate at different frequencies. The lowest frequency is 575Mhz while the highest is 2300Mhz.

    I can set one core to 1150Mhz with the other set at 2300Mhz. This is different from the Intel (Mobile) CPUs I've come across where a difference in frequency between cores is only possible when one core is (seemingly) operating at its lowest frequency (in a dual core system).

    What is also interesting from aforementioned cpuinfo output is that only core is running at its max frequency while all (3) other cores are (seemingly) at their minimum frequency. Considering my previous conjecture on C2 and C0 states, it would be surprising if one can show cpuinfo output where 2 cores are running at max frequency while the other 2 cores are running at any frequency other than max frequency. That shouldn't be possible at all.

  • valnar - Thursday, May 6, 2010 - link

    Does anyone know if this kind of power management for Lynnfield processors is available in Windows 2003?
  • hshen1 - Sunday, June 23, 2013 - link

    This is really a good article for power management researchers like me!!

Log in

Don't have an account? Sign up now