vApus Mark I: Performance-Critical Applications Virtualized

If you have virtualized your datacenter a while ago, chances are that the light loads are already virtualized. What is next? Well, if you have been following the virtualization scene, you’ll know that the virtualization vendors are very actively promoting that you should virtualize your performance-critical applications. vSphere 4 allows you to use up to 8 vCPUs and up to 255 GB of RAM, Xenserver 8 vCPUs and 32 GB RAM. Hyper-V is still lagging with only 4 vCPUs and a maximum of 16 CPUs (24 with the “Dunnington” hotfix”) per host. But that will change in Hyper-V R2. Bottom line is, it is getting attractive to virtualize “heavy duty” applications too. If only to be able to migrate them (“Vmotion”, “Xenmotion”, “Live Migration”) or manage them more easily.

That is where vApus Mark I comes in: one OLAP, one DSS and two heavy websites are combined in one tile. These are the kind of demanding applications that still got their own dedicated and natively running machine a year ago. vApus Mark I shows what will happen if you virtualize them. If you want to fully understand our benchmark methodology: vApus Mark I has been described in great detail here. We have changed only one thing compared to our previous benchmarking: we used large pages as it is generally considered as a best practice (with RVI, EPT). This increases performance by 4 to 5%.
 
Our other choices remain the same:
  • RVI and EPT are enabled on all VMs if possible
  • HT-Assist is off, unless indicated otherwise

vApus Mark I uses four VMs with four server applications:

- A SQL Server 2008 x64 database running on Windows 2008 64-bit, stress tested by our in-house developed vApus software.
- Two heavy-duty MCS eFMS portals running PHP, IIS on Windows 2003 R2, stress tested by our in-house developed vApus software.
- One OLTP database, based on the Oracle 10G Calling Circle benchmark of Dominic Giles.

The beauty is that vApus (stresstesting software developed by the Sizing Servers Lab) uses actions made by real people (as can be seen in logs) to stresstest the VMs, not some benchmarking algorithm. First we look at the results in ESX 3.5 Update 4, at the moment the most popular hypervisor.

Sizing Servers vAPUS Mark I  - ESX 3.5

If you just plug Istanbul into your virtualized server, you can't tell if you're running with a six-core or quad-core. You might remember from our previous article that a 2.9 GHz 2389 scored 203. Pretty dissapointing that six cores at 2.6 GHz equals 4 cores at 2.9 GHz. What went wrong? By default, the VMware ESX 3.5 scheduler logically partitions the available cores into groups of four, called “Cells”. The objective is to schedule VM’s always on the same cell, thereby making sure that the VM’s stay in the same node and socket. This should make sure that the VM always uses local memory (instead of needing remote memory of another node) and more importantly that the caches stay “warm”. If you use the default cell size of 4 cores, one or more VM’s will be split among two sockets with lots of traffic going back and forth. Once we increase the cell size from 4 to 6 (see VMware’s knowledge base), the ugly duck becomes a swan. The six-core Opteron keeps up with the best Xeons available!

The Xeon x55xx is however somewhat crippled in this case, as ESX 3.5 update 4 does not support EPT and does not make optimal use of HyperThreading. You can see from our measurements above that hyperthreading improves the score by about 17%. According to our OEM sources, VMmark improves by up to 30% on ESX 4.0. This shows that ESX 4.0 makes better use of HyperThreading. So let us see some ESX 4.0 numbers!

Reference 175.3

45.8

45.8

155.3

Server System Based On OLAP VM Webportal VM2 Webportal VM3 OLTP VM
Dual Xeon X5570 2.93 103% 50% 51% 95%
Dual Opteron 2435 2.6 91% 43% 43% 90%
Dual Opteron 2377 2.3 82% 36% 35% 53%

Sizing Servers vAPUS Mark I  - ESX 4.0

The Nehalem-based Xeon moves forward, but does not make a huge jump. Performance of the six-core Opteron was decreased by 2%, which is inside the error margin of this benchmark. It is still an excellent result for the latest Opteron: this results means it will have no trouble competing with the 2.66 Ghz Xeon X5550. VMmark tells us that the latest Xeon “Nehalem” starts to shine when you dump huge amounts of VM on top of the server. So we decided to test with 8 VM’s. It is very unlikely that you will consolidate more than 10 Performance-Critical applications on top of one physical server, so we feel that 8 VM’s should tell the whole story. We changed only one thing: we decreased the amount of memory to the webportals from 4 to 2 GB, to make sure that the benchmark fits within the maximum of 24 GB that we had on the Xeon X5570. To keep things readable, we have made an average of each 2 identical VM’s (so OLAP VM = (OLAP VM1 + OLAP VM5)/2).

Reference 175.3

45.8

45.8

155.3

Server System Based On OLAP VM Webportal VM2 Webportal VM3 OLTP VM
Dual Xeon X5570 2.93 79% 34% 32% 47%
Dual Opteron 2435 2.6 71% 23% 23% 38%
Dual Opteron 2377 2.3 76% 19% 19% 28%

vAPUS Mark I 2 tile test  - ESX 4.0

Notice that HT-assist is a performance killer in 2P configurations: you remove two times 1 MB of L3-cache, which is a bad idea with 8 VM’s hitting your two CPUs. It is interesting to see that the Xeon X5570 starts to break away, as we increase the number of VM’s. The Xeon X5570 is about 30% faster than the Dual Opteron 2435. It gives us a clue why the VMmark scores are so extreme: the huge amount of VM’s might overemphasize world switch times for example. But even with light loads, it is very rare to find more than 20 VM’s on top of DP processor.

There is more. In the 2-tile test the ESX scheduler has to divide 16 logical CPU’s among 32 vCPU’s. That is a lot easier than dividing 12 physical CPUs among 32 vCPU’s. This might create coscheduling issues on the six-core Opteron.

So our 2-tile test was somewhat “biased” towards the Xeon X5570.

We reduced the number of vCPUs on the webportal VMs from 4 to 2. That means that we have:

- Two times 4 vCPUs for the OLAP test
- Two times 4 vCPUs for the OLTP test
- Two times 2 vCPUs for the OLTP test

Or a total of 24 vCPU’s. This test is thus biased towards the “Istanbul” processor. Remember that our reference score was based on a 4 CPU “native” score. So we adjusted the reference score of webportals to one that was obtained with 2 native CPU’s. The reference score for the OLTP and OLAP test remained unchanged. The results below are not comparable with the ones you have seen so far. It is an experiment to understand our scores better. To keep things readable, we have made an average of each 2 identical VM’s (so OLAP VM = (OLAP VM1 + OLAP VM5)/2).

Reference 175.3

45.8

45.8

155.3

Server System Based On OLAP VM Webportal VM2 Webportal VM3 OLTP VM
Dual Xeon X5570 2.93 82% 53% 53% 43%
Dual Opteron 2435 2.6 81% 38% 38% 44%

 

vAPUS Mark I 2 tile test - 24 vCPUs - ESX 4.0

The result is that the Xeon Nehalem is once again only 11% faster. So it is important to remember that relation between the number of vCPU’s and the Cell size is pretty important when you are dealing with MP virtual machines. We expect that the number of VM’s with more than one vCPU will increase as time goes by.

Virtualization: To Be or Not to Be Power Consumption & Market Analysis
POST A COMMENT

40 Comments

View All Comments

  • duploxxx - Wednesday, June 3, 2009 - link

    ESX 4 should add IOMMU to the AMD istanbul platform, not sure how far this is implemented in the beta esx4 builds.

    Are you using the paravirtualization scsi driver in the new esx4 platform, I would expect bigegr differences between 3.5 and 4 and not just because EPT is included in esx4 together with enhanced HT.

    for the rest very good thorough review.

    The only thing I always miss in reviews is that although it is good to test the fastest out there, it is now where near the most deployed platform, you rather should look at the 5520-5530 against 2387 - 2431 as the mid range platform that will be deployed in a wide range of systems, this will have a much healthier performance/price/power platform then the top bin. Even the 5570 is not supported in all OEM platforms for the TDP range.
    Reply
  • Adul - Monday, June 1, 2009 - link

    I do not see oracle running on top of windows all that often. It is normally running on some *nix OS. How about running the same benchmark on say RHEL instead? Reply
  • InternetGeek - Monday, June 1, 2009 - link

    There's actually an odd bug on Oracle's DB that makes it run faster on Windows than on Linux. Search on the internet and you'll find info about it.

    In the other hand, in my now 9 years in the IT industry I've only come across one Oracle DB running on HP-UX. Everything else (Sybase, MySQL, etc) runs on Windows.
    Reply
  • LizVD - Friday, June 5, 2009 - link

    Could you provide us with a link for that? I'd like to see if this "bug" corresponds with the behaviour we're seeing on our tests. Reply
  • Nighteye2 - Monday, June 1, 2009 - link

    You give a good description of how it works and how it has so much benefit, but then you benchmark only dual-socket servers?

    It would be fairer to also test and compare octo-socket servers - to see the real impact of that HT assist feature.
    Reply
  • phoenix79 - Monday, June 1, 2009 - link

    Completely agreed (I was typing up a comment about this too when yours popped up)

    I'd love to see some 4-way VMWare scores
    Reply
  • ltcommanderdata - Monday, June 1, 2009 - link

    Yes. Nehalem is in a great position in the DP market, but isn't yet available in MP. It'd be great to see six-core Dunnington and six-core Istanbul go head to head. Conveniently their highest models have similar clock speeds at 2.66GHz and 2.6GHz respectively although Dunnington would be a lot more power hungry and although I don't remember their prices, probably more expensive too. Reply
  • JohanAnandtech - Tuesday, June 2, 2009 - link

    Dunnington vs Istanbul coming up ... But we are going to take some time to address the shortcomings of this "deadline" article such as better power consumption readings. Reply
  • solori - Monday, June 1, 2009 - link

    "Notice that HT-assist is a performance killer in 2P configurations: you remove two times 1 MB of L3-cache, which is a bad idea with 8 VM’s hitting your two CPUs."

    BIOS guidance suggests that HT Assist be disabled by default on 2P systems, and enabled only for specialized workloads. So that begs the question: Were vAPUS tests performed with or without HT Assist in the 2P configuration? It was not clear.

    I assume AMD-V and RVI were enabled for ALL workloads in ESX 3.5 and 4.0 (forced for 32-bit workloads.) Is this accurate? Based on the number of ESX 3.5 installations out there, this probably should be clearly stated...

    I do want to take issue with your memory sizing and estimates on vCPU loading. Let me put it this way: while Nehalem-EP has better memory bandwidth and SMT threads, Opteron has access to abundant memory. Therefore, it does not make sense - for example - to be OK with enabling SMT but then constrain the benchmark to 24GB due to a Xeon memory limitation.

    I would urge you to look at 48GB configurations on Xeon and Istanbul for your comparison systems. By the way, in consolidation numbers, this makes a significant reduction in $/VM with only a minor increase in per-system CAPEX.

    Another interesting issue you touched on is tuning and load balance. Great job here. These are "black magic" issues that - as you noted - can have serious effects on virtualization performance (ok, scheduling efficiency.) Knowing your platform's balance point(s) is critical to performance sensitive apps but not so critical for light-load virtualization (i.e. not performance sensitive.)

    It sounds like your learning - through experimentation with vAPUS - that virtualization testing does not predict similar results from "similarly configured machines" where performance testing is concerned. In fact, the "right balance" of VM's, memory and vCPU/CPU loading for one system may be on the wrong side of the inflection point for another.

    All and all, a very good article.
    Reply
  • JohanAnandtech - Tuesday, June 2, 2009 - link

    "this probably should be clearly stated... "

    Good suggestion. I adapted the article. RVI and EPT are always on if possible (so also 32 bit). HT-assist is of always on "Auto" (so off) unless we indicate otherwise.

    "Therefore, it does not make sense - for example - to be OK with enabling SMT but then constrain the benchmark to 24GB due to a Xeon memory limitation. "

    1) You must know that vApus Mark I uses too much memory for the webportals. They can run without any performance loss in 2 GB, even 1 GB. So as we move up on the number of tiles we run, it is best to reclaim the wasted memory.

    2) I agree that a price comparison should include copious amount of memory (48 GB or so).

    3) We don't have more than 24 GB DDR-3 available right now. It would be unfair to force the system to swap in a performance comparison.

    "Opteron has access to abundant memory". What do you mean by this? Typical 2P Opterons have 64 GB, 2P Nehalems 72 GB as upper limit?

    "In fact, the "right balance" of VM's, memory and vCPU/CPU loading for one system may be on the wrong side of the inflection point for another"

    Great comment. Yes, that makes it even more complex to compare two systems. That is why we decided to show 2 datapoints for the 2 tile systems.

    Collin, thanks for the excellent comments. It is very rewarding to notice that people take the time to dissect our hard work. Even if that means that you find wrinkles that we have to iron out. Great feedback.




    Reply

Log in

Don't have an account? Sign up now