vApus Mark I: Performance-Critical Applications Virtualized

If you have virtualized your datacenter a while ago, chances are that the light loads are already virtualized. What is next? Well, if you have been following the virtualization scene, you’ll know that the virtualization vendors are very actively promoting that you should virtualize your performance-critical applications. vSphere 4 allows you to use up to 8 vCPUs and up to 255 GB of RAM, Xenserver 8 vCPUs and 32 GB RAM. Hyper-V is still lagging with only 4 vCPUs and a maximum of 16 CPUs (24 with the “Dunnington” hotfix”) per host. But that will change in Hyper-V R2. Bottom line is, it is getting attractive to virtualize “heavy duty” applications too. If only to be able to migrate them (“Vmotion”, “Xenmotion”, “Live Migration”) or manage them more easily.

That is where vApus Mark I comes in: one OLAP, one DSS and two heavy websites are combined in one tile. These are the kind of demanding applications that still got their own dedicated and natively running machine a year ago. vApus Mark I shows what will happen if you virtualize them. If you want to fully understand our benchmark methodology: vApus Mark I has been described in great detail here. We have changed only one thing compared to our previous benchmarking: we used large pages as it is generally considered as a best practice (with RVI, EPT). This increases performance by 4 to 5%.
 
Our other choices remain the same:
  • RVI and EPT are enabled on all VMs if possible
  • HT-Assist is off, unless indicated otherwise

vApus Mark I uses four VMs with four server applications:

- A SQL Server 2008 x64 database running on Windows 2008 64-bit, stress tested by our in-house developed vApus software.
- Two heavy-duty MCS eFMS portals running PHP, IIS on Windows 2003 R2, stress tested by our in-house developed vApus software.
- One OLTP database, based on the Oracle 10G Calling Circle benchmark of Dominic Giles.

The beauty is that vApus (stresstesting software developed by the Sizing Servers Lab) uses actions made by real people (as can be seen in logs) to stresstest the VMs, not some benchmarking algorithm. First we look at the results in ESX 3.5 Update 4, at the moment the most popular hypervisor.

Sizing Servers vAPUS Mark I  - ESX 3.5

If you just plug Istanbul into your virtualized server, you can't tell if you're running with a six-core or quad-core. You might remember from our previous article that a 2.9 GHz 2389 scored 203. Pretty dissapointing that six cores at 2.6 GHz equals 4 cores at 2.9 GHz. What went wrong? By default, the VMware ESX 3.5 scheduler logically partitions the available cores into groups of four, called “Cells”. The objective is to schedule VM’s always on the same cell, thereby making sure that the VM’s stay in the same node and socket. This should make sure that the VM always uses local memory (instead of needing remote memory of another node) and more importantly that the caches stay “warm”. If you use the default cell size of 4 cores, one or more VM’s will be split among two sockets with lots of traffic going back and forth. Once we increase the cell size from 4 to 6 (see VMware’s knowledge base), the ugly duck becomes a swan. The six-core Opteron keeps up with the best Xeons available!

The Xeon x55xx is however somewhat crippled in this case, as ESX 3.5 update 4 does not support EPT and does not make optimal use of HyperThreading. You can see from our measurements above that hyperthreading improves the score by about 17%. According to our OEM sources, VMmark improves by up to 30% on ESX 4.0. This shows that ESX 4.0 makes better use of HyperThreading. So let us see some ESX 4.0 numbers!

Reference 175.3

45.8

45.8

155.3

Server System Based On OLAP VM Webportal VM2 Webportal VM3 OLTP VM
Dual Xeon X5570 2.93 103% 50% 51% 95%
Dual Opteron 2435 2.6 91% 43% 43% 90%
Dual Opteron 2377 2.3 82% 36% 35% 53%

Sizing Servers vAPUS Mark I  - ESX 4.0

The Nehalem-based Xeon moves forward, but does not make a huge jump. Performance of the six-core Opteron was decreased by 2%, which is inside the error margin of this benchmark. It is still an excellent result for the latest Opteron: this results means it will have no trouble competing with the 2.66 Ghz Xeon X5550. VMmark tells us that the latest Xeon “Nehalem” starts to shine when you dump huge amounts of VM on top of the server. So we decided to test with 8 VM’s. It is very unlikely that you will consolidate more than 10 Performance-Critical applications on top of one physical server, so we feel that 8 VM’s should tell the whole story. We changed only one thing: we decreased the amount of memory to the webportals from 4 to 2 GB, to make sure that the benchmark fits within the maximum of 24 GB that we had on the Xeon X5570. To keep things readable, we have made an average of each 2 identical VM’s (so OLAP VM = (OLAP VM1 + OLAP VM5)/2).

Reference 175.3

45.8

45.8

155.3

Server System Based On OLAP VM Webportal VM2 Webportal VM3 OLTP VM
Dual Xeon X5570 2.93 79% 34% 32% 47%
Dual Opteron 2435 2.6 71% 23% 23% 38%
Dual Opteron 2377 2.3 76% 19% 19% 28%

vAPUS Mark I 2 tile test  - ESX 4.0

Notice that HT-assist is a performance killer in 2P configurations: you remove two times 1 MB of L3-cache, which is a bad idea with 8 VM’s hitting your two CPUs. It is interesting to see that the Xeon X5570 starts to break away, as we increase the number of VM’s. The Xeon X5570 is about 30% faster than the Dual Opteron 2435. It gives us a clue why the VMmark scores are so extreme: the huge amount of VM’s might overemphasize world switch times for example. But even with light loads, it is very rare to find more than 20 VM’s on top of DP processor.

There is more. In the 2-tile test the ESX scheduler has to divide 16 logical CPU’s among 32 vCPU’s. That is a lot easier than dividing 12 physical CPUs among 32 vCPU’s. This might create coscheduling issues on the six-core Opteron.

So our 2-tile test was somewhat “biased” towards the Xeon X5570.

We reduced the number of vCPUs on the webportal VMs from 4 to 2. That means that we have:

- Two times 4 vCPUs for the OLAP test
- Two times 4 vCPUs for the OLTP test
- Two times 2 vCPUs for the OLTP test

Or a total of 24 vCPU’s. This test is thus biased towards the “Istanbul” processor. Remember that our reference score was based on a 4 CPU “native” score. So we adjusted the reference score of webportals to one that was obtained with 2 native CPU’s. The reference score for the OLTP and OLAP test remained unchanged. The results below are not comparable with the ones you have seen so far. It is an experiment to understand our scores better. To keep things readable, we have made an average of each 2 identical VM’s (so OLAP VM = (OLAP VM1 + OLAP VM5)/2).

Reference 175.3

45.8

45.8

155.3

Server System Based On OLAP VM Webportal VM2 Webportal VM3 OLTP VM
Dual Xeon X5570 2.93 82% 53% 53% 43%
Dual Opteron 2435 2.6 81% 38% 38% 44%

 

vAPUS Mark I 2 tile test - 24 vCPUs - ESX 4.0

The result is that the Xeon Nehalem is once again only 11% faster. So it is important to remember that relation between the number of vCPU’s and the Cell size is pretty important when you are dealing with MP virtual machines. We expect that the number of VM’s with more than one vCPU will increase as time goes by.

Virtualization: To Be or Not to Be Power Consumption & Market Analysis
POST A COMMENT

40 Comments

View All Comments

  • solori - Tuesday, June 2, 2009 - link

    I should have said "abundant (cheap) memory." Reply
  • mkruer - Monday, June 1, 2009 - link

    I am disappointed that you did not bench X5550 vs 2435. This is the chip that the Opteron 2435 was designed to go up against, not the X5570 which is clocked 300MHz higher and 40% more expensive. Heaven forbid that you try to include chips at the same price point. That being said other sites that did compare based upon price, and not top of the line, show that the Opteron 2435 is indeed comparable to the X5550 at the same price point and speed. Now if AMD can up the speed of the hex core, then it will be a more direct comparison to the X5570. The X5570 is 50% faster but it is also >50% more in cost. Reply
  • mino - Wednesday, June 3, 2009 - link

    Right.

    Actually, I have no qualms with comparing the best with the best, but the commentary is mostly out-of-place.
    I guess this was written after 3 days without sleep, but anyway.

    After an excelent vAPUS Mark 1 article I would expect better that old-school style:
    "1000 $ Pentium 4 3.2 EE is clearly (15%) better than $400 Athlon 3200+ so Athlon is clearly a piece of junk. Well maybe for games not so much but generally it is a piece of junk."

    Thank god the numbers tell their own story.
    Reply
  • JohanAnandtech - Wednesday, June 3, 2009 - link

    It seems that some people like to create the impression that we did not take into account that both CPUs were not at the same pricing.

    However:

    http://it.anandtech.com/IT/showdoc.aspx?i=3571&...">http://it.anandtech.com/IT/showdoc.aspx?i=3571&...
    [quote]"However, as the Opteron 2435 competes with 2.66 GHz Xeon and not the Xeon 2.93 GHz, this is the first benchmark where “Istanbul” is competitive."[/quote]

    http://it.anandtech.com/IT/showdoc.aspx?i=3571&...">http://it.anandtech.com/IT/showdoc.aspx?i=3571&...
    [quote]"The Nehalem-based Xeon moves forward, but does not make a huge jump. Performance of the six-core Opteron was decreased by 2%, which is inside the error margin of this benchmark. It is still an excellent result for the latest Opteron: this results means it will have no trouble competing with the 2.66 Ghz Xeon X5550. "
    [/quote]

    http://it.anandtech.com/IT/showdoc.aspx?i=3571&...">http://it.anandtech.com/IT/showdoc.aspx?i=3571&...
    [quote]"The new Opteron 2435 at 2.6 GHz was a pleasant surprise on vApus Mark I: it keeps up with more expensive Xeons on ESX 3.5 update 4 while consuming less, and offers a competitive performance/watt and performance/price ratio on vSphere 4. The six-core Opteron is about 11 to 30% slower on vSphere 4 than the 2.93 GHz Xeon X5570 but the overall cost of the Istanbul platform is significantly lower (DDR-2 versus DDR-3) and the 2.6 GHz 2435 consumes less power in a virtualized environment "
    [/quote]

    And I have confidence that the vast majority of my readers are intelligent people who can decrease the benchmarks with 8 to 10% to see what a Xeon x5550 would do
    Reply
  • mino - Thursday, June 4, 2009 - link

    No, I do not like that, nor like to create such an impression.

    The article presents the numbers reasonably well for me. It is just that your (justified) love for Nehalem is glowing through and many, many comments were out of place.
    I believe this was not intentional but cause by your love for the Nehalem platform which is otherwise great.

    All the numbers tell one thing - Istanbull is generally on par with Nehalem clock for clock +- 10% depending on the workload.

    About that glowiong love for Nehalem:
    >>>MCS eFMS 9.2
    "A single 8-thread Xeon X55xx is by far the best choice here."

    Why ? There is no 1*2435 number.
    Based on the numbers published single 2435 will get about 55-58rps which for all practical needs is identical performance to _flagship_ Nehalem.

    >>>3ds Max 2008 32b
    "We are sure that there are probably more efficient render engines out there, but it is simply not a market the AMD six-core should cater to. Nehalem-based Xeons are simply way too powerful for this kind of application. Render engines scale almost perfectly with clockspeed. So if cost is your main concern, consider the Xeon E5520 at 2.26 GHz, the cheapest CPU that still supports HT. We will test this one soon, but we expect it to deliver 67 frames per hour, which is still more than 20% better than any Opteron."

    OK, so first bash(rightfully) the application fo it rigid resource use pattern, than say that for Nehalem is "way too powerfull for this KIND of application" for Opteron to compete with.
    You managed to contradict your own reasoning to promote Nehalem for rendering while the numbers speak about single improperly optimized app.
    Which it is pretty certain SW vendor will take care of in due time. These numbers are just a result of no (affordable) 6-core presence on the market up to now.

    By these 2 comments you took the article balance from "Instanbul is generally about 5% slower per_clock than Nehalem, in certain apps it is on par or better while in other loses about 15%" - which is what the numbers tell - to "Instanbul is good for VMware, forget about it elsewhere".

    Which is about as much bad publicity you could give to the second fastest CPU on the market by_large_margin.

    Fact is, at a given price, Nehalem box is ALMOST IDENTICAL performance-wise to Istanbul box. While both crush everything else on the market by 30+ %.
    Reply
  • lopri - Monday, June 1, 2009 - link

    Page 2, "..The most recent data is however in CPU’s L2-cache" I think you meant CPU #2? Reply
  • JohanAnandtech - Monday, June 1, 2009 - link

    Yes, good catch. Fixed the issue. Reply
  • classy - Monday, June 1, 2009 - link

    I skipped right to the virtualization portions. It is by far becoming the most dominate criteria for most of the IT world. The 6 core opty looks solid there, so it will come down to price. Now with the quickly developing virtual desktop infrastructures, how well a platform does virtualization makes it just two fold more important. Many folks have already virtualized mission critical apps. I know we're doing exchange in the near future. The days of seperate physical servers and desktops are going the way of the dodo bird. Its becoming all about virtualization. Reply
  • genkk - Tuesday, June 2, 2009 - link

    why power consumption not shown here....the bench mark guys in anandtech lost the papers...or they don't want you to see

    any way go to techreport.com where istanbul wins
    Reply
  • JohanAnandtech - Tuesday, June 2, 2009 - link

    More detailed power consumption numbers will be available in the next review. Reply

Log in

Don't have an account? Sign up now