Raw FPU power: FLOPS

Let us see if we understand floating-point performance better when we look at FLOPS. FLOPS was programmed by Al Aburto and is a very floating-point intensive benchmark. Analysis show that this benchmark contains:

  • 70% floating-point instructions
  • Only 4% branches
  • Only 34% of instructions are memory instructions

Benchmarking with FLOPS is not real world, but isolates the FPU power. Al Aburto states the following about FLOPS:

"Flops.c is a 'C' program which attempts to estimate your systems floating-point 'MFLOPS' rating for the FADD, FSUB, FMUL, and FDIV operations based on specific 'instruction mixes'. The program provides an estimate of PEAK MFLOPS performance by making maximal use of register variables with minimal interaction with main memory. The execution loops are all small so that they will fit in any cache."

FLOPS shows the maximum double precision power that the core has, by making sure that the program fits in the L1 cache. FLOPS consists of eight tests, and each test has a different but well-known instruction mix. The most frequently used instructions are FADD (addition, 1st number), FSUB (subtraction, 2nd number), and FMUL (multiplication, 3rd number). FDIV is the fourth number in the mix. We focused on four tests (1, 3, 4, and 8) as the other tests were very similar.

We compiled three versions:

  • Gcc (GCC) 4.1.2 20070115 (prerelease) (SUSE Linux)
  • Intel C Compiler (ICC) version 10.1 20070913:
    • An x87 version
    • A fully vectorized SSE2 version

Let us restate what we are measuring with FLOPS:

  • Single-threaded double precision floating performance
  • Maximum performance not constrained by bandwidth bottlenecks or latency delays
  • Perfect clock speed scaling

This allows us to calculate the FLOPS per cycle number: we divide the MFLOPs numbers reported by FLOPS by the clock speed. This way we get a good idea of the raw floating-point power of each architecture. First we test with the basic '-O3' optimization.

Don't you love the little FLOPS benchmark? It is far from a real world benchmark, but it tells us so much.

To understand this better, know that the raw speed of the AMD out of order x87 FPU (Athlon K7 and Athlon 64) used to be between 20% and 60% faster clock for clock than any P6 based CPU, including Banias and Dothan. However, this x87 FPU advantage has evaporated since the introduction of the Core architecture as the Core Architecture has a separate FMUL and FADD pipe. (Older P6 architectures could execute one FADD followed by one FMUL, or two FP operations per two clocks at the most). You can see that when we test with a mix of FADD/FSUB and FMUL, the Core architecture is slightly faster clock for clock.

The only weakness remaining in the Core x87 architecture is the FP divider. Notice how even a relatively low percentage of divisions (the 4th number in the mix) kills the performance of our 65nm Xeon. The Opteron 22xx and 23xx are 70% faster (sometimes more) when it comes to double precision FP divisions. However, the new Xeon 54xx closes this gap completely thanks to lowering the latency of a 64-bit FDIV from 32 cycles (Xeon 53xx) to 20 cycles (Xeon 54xx, Opteron 23xx). The Xeon 54xx is only 1% to 5% slower in the scenarios where quite a few divisions happen. That is because the Opterons are capable of somewhat pipelining FDIVs, which allows them to retire one FDIV every 17 cycles. The clock speed advantage of the 45nm Xeon (3.2GHz vs. 2.5GHz maximum at present) will give it a solid lead in x87 performance.

However, x87 performance is not as important as it used to be. Most modern FP applications make good use of the SSE SIMD instruction set. Let us see what happens if we "vectorize" our small FLOPS loops. Remember, FLOPS runs from the L1 cache, so L2 and memory bandwidth do not matter at all.

When it comes to raw SSE performance, the Intel architectures are 3% to 14% faster in the add/subtract/multiply scenarios. When there are divisions involved, Barcelona absolutely annihilates the 65nm Core architecture with up to 80% better SSE performance, clock for clock. It even manages to outperform the newest 45nm Xeon, but only by 8% to 18%. Notice once again the vast improvement from the 2nd generation Opteron to the 3rd generation Opteron when it comes to SIMD performance, ranging from 55% to 150% (!!).

As not all applications are compiled using Intel's high-performance compiler, we also compiled with the popular open source compiler gcc. Optimization beyond "-O3" had little effect on our FLOPS binary, so we limited our testing to one binary.

GCC paints about the same picture as ICC without vectorization. The new AMD Barcelona core has a small advantage, as it is the fastest chip in 3 out of 4 tests. Let us now see if we can apply our FP analyses of LINPACK and FLOPS to real world applications.

Floating-point Analyses (Linux 64-bit) 3ds Max 9 (32-bit Windows)
Comments Locked


View All Comments

  • Regs - Tuesday, November 27, 2007 - link

    I would not expect any from vendors and wholesalers until early next year.

    Matter of fact I wouldn't want one until then anyhow. I would at least wait until B3 stepping.
  • TA152H - Tuesday, November 27, 2007 - link


    From my understanding, x87 is now obsolete and not even supported in x86-64. Can you verify this? I know I had read it, from your article you state that Intel improved it, so I'm not as sure. I had assumed one of AMD's handicaps was the disproportionate, and nearly useless, x87 processing power their processors carried, but now I am not as sure. Is x87 supported in x86-64, and if not, why would Intel increase their x87 capabilities when it's clearly a deprecated technology?
  • JohanAnandtech - Tuesday, November 27, 2007 - link

    The x87 instructions can be used in legacy mode and long mode. But it is true that Scalar SSE instructions are preferred by AMD and Intel.

    x87 performance as many 32 bit programs are still important (look at 3DSMAx 32 bit).

    If Intel's newest Core architecture would not have improved the x87 FP it would probably have looked silly as so many 32 bit programs still use it intensively. Secondly, as you can see, things like the Radix-16 circuitry are used by both the SIMD as the x87 units.
  • Gholam - Tuesday, November 27, 2007 - link

    Do you have any plans to benchmark Opteron vs Xeon in an ESX Server environment?
  • DeepThought86 - Tuesday, November 27, 2007 - link

    This is exactly what I was thinking of too. I want to change my mode of working to run several separate VM's, one for programming, one for Office etc and really want to know how Phenom compares to Q6600 for those uses. Well, this article looks at the server versions of those chips but for VMware the performance might be more comparable than, say, SuperPi 1M benchmarks!
  • DeepThought86 - Tuesday, November 27, 2007 - link

    I forgot to add, since Phenom would presumably also have the nested table support as Barcelona, how much performance improvement would this yield? I'd love to know
  • sht - Tuesday, November 27, 2007 - link

    I was about to ask the same question after reading the concluding

    You may feel for example that using four instances in our SPECjbb test favors AMD too much, but there is no denying that using more virtual machines on fewer physical servers is what is happening in the real world.

    Since the CPUs have features that should accelerate virtualization, it would really be interesting to see how they compete there. My only addition to your request would be to add KVM as host as well (and XEN and what not as well if you care, though I really think only KVM is of interest).
  • JohanAnandtech - Tuesday, November 27, 2007 - link

    Indeed, we are working on that. The software that we described here (http://www.anandtech.com/IT/showdoc.aspx?i=2997&am...">http://www.anandtech.com/IT/showdoc.aspx?i=2997&am... is being adapted to testing virtualized applications. We are also looking into the parameters that can really influence the results of a benchmark on a virtualized server.
  • JohanAnandtech - Tuesday, November 27, 2007 - link

    Indeed, we are working on that. The software that we described here (http://www.anandtech.com/IT/showdoc.aspx?i=2997&am...">http://www.anandtech.com/IT/showdoc.aspx?i=2997&am... is being adapted to testing virtualized applications. We are also looking into the parameters that can really influence the results of a benchmark on a virtualized server.
  • AssBall - Tuesday, November 27, 2007 - link

    Thanks, Johan.

    This has been one of the clearer and better proofread articles I have read here lately. It was interesting, unbiased, and insightful. I am excited to see what you get into for your next project.

Log in

Don't have an account? Sign up now