Understanding the VMmark Score

Before we try to demystify the published VMmark scores, let me state upfront that the VMmark benchmark has it flaws, but we know from firsthand experience how hard it is to build a decent virtualization benchmark. It would be unfair and arrogant to call VMmark a bad benchmark. The benchmark first arrived back in 2006. The people of VMware were pioneers and solved quite a few problems, such as running many applications simultaneously and getting one score out of the many different benchmarks, all with scores in different units. The benchmark results are consistent and the mix of applications reflects more or less the real world.

Let's refresh your memory: VMware VMmark is a benchmark of consolidation. It consolidates several virtual machines performing different tasks, creating a tile. A VMmark tile consists of:

  • MS Exchange VM
  • Java App VM
  • Idle VM
  • Apache web server VM
  • MySQL database VM
  • SAMBA fileserver VM

The first three run on a Windows 2003 guest OS and the last three run on SUSE SLES 10.


Now let's list the few flaws:

  • The six applications plus virtual machines in one tile only need 5GB of RAM. Most e-mail servers running right now will probably use 4GB or more on their own! The vast majority of MySQL database servers and java web servers have at least 2GB at their disposal.
  • It uses SPECjbb, which is an "easy to inflate" benchmark. The benchmark scores of SPECjbb are obtained with extremely aggressively tuned JVMs, the kind of tuning you won't find on a typical virtualized, consolidated java web server.
  • SysBench only works on one table and is thus an oversimplified OLTP test: it only performs transactions on one table.

Regarding our SysBench remark, as OLTP benchmarks are very hard, we also use SysBench and we are very grateful for the efforts of Alexey Kopytov. SysBench is in many cases close enough for native situations. The problem is that some effects that a real world OLTP database has on a hypervisor (such as network connections and complex buffering that requires a lot more memory management) may not show up if you run a benchmark on such an oversimplified model.

The VMmark benchmark is also starting to show its age with its very low memory requirements per server. To limit the amount of development time, the creators of VMmark also went with some industry standard benchmarks, which have been starting to lose their relevance as vendors have found ways to artificially inflate the scores. VMmark needs an update, but as VMware is involved in the SPEC Virtualization Committee to develop a new industry standard virtualization benchmark, it does not make sense to further develop VMmark.

The easiest way to see that VMmark is showing its age is in the consolidation ratio of the VMmark runs. Dual CPU machines are consolidating 8 to 17 tiles. That means a dual CPU system is running 102 virtual machines, of which 85 are actively stressed! How many dual CPU machines have you seen that even operate half that many virtual machines?

That said, we'll have to work with VMmark until something better comes up. That brings up two questions. How can you spot reliable and unreliable VMmark scores? Can you base decisions on the scores?

Index The VMmark Scoring Chaos
Comments Locked

23 Comments

View All Comments

  • duploxxx - Sunday, May 10, 2009 - link

    time for some OEM beta hardware :)
  • has407 - Saturday, May 9, 2009 - link

    With a single socket (assuming quad core), I'd think you should be able to do it with less memory, maybe 32-48GB? (unless your IO subsystem is slow) Even if that's beyond reach, a relative measure with a smaller number of tiles might be interesting (ok, it won't be strictly by-the-VMark-book).
  • tshen83 - Friday, May 8, 2009 - link

    So,I take it that you want to discredit VMmark as being a relevant benchmark for virtualization?

    You know VMmark isn't the only benchmark that says Nehalem is twice as efficient in performance/watt than Shanghais right?

    In the last paragraph, you said "give us a few more days". To do what? To selectively choose a few benchmarks that show that Shanghai is a better CPU for virtualization workloads? Good luck with that.

    Sometimes I want to find a ruler and measure just how deep you stuck your head up AMD's rear end. Sometimes I also wonder why a Belgian is so freaking adamant about AMD. Anand got too cheap I guess to outsource an important job to Belgium I guess.
  • 7Enigma - Monday, May 11, 2009 - link

    What I find so funny about this post is that since the Nehelem launch the author has chronically been labeled pro-Intel in the comments section of the majority of his articles. Just goes to show you; you can please all of the people some of the time, some of the people all of the time, and in the case of Intel/AMD, none of the people all of the time. :)
  • whatthehey - Friday, May 8, 2009 - link

    Or perhaps he just has some information on a new virtualization benchmark suite, which may or may not show Shanghai in a better light.

    I think it's pretty easy to conclude that the two year old design of VMmark is aging and not as relevant as when it first came out (if it was even truly relevant then). So let's wait a few more days, eh?

    Sometimes I want to pull out a ruler to measure just how far up Intel's ass tshen83 has shoved his head so that he can't even consider any viewpoint that doesn't state that Intel is unequivocally the best. Seriously, look at any AMD or Intel article, and he's there espousing the virtues of Intel and trashing everything AMD does. It's not all black and white, dude... except when you get paid by Intel to do what you do, of course.
  • Viditor - Monday, May 11, 2009 - link

    My own guess is that tshen83 has become a 100% Intel ass, the 2 things have merged in this reality...:)

    I would like to see how things compare on the larger boxes though. There are an awful lot of 4 and 8 way VM machines going out there right now...
  • noxipoo - Friday, May 8, 2009 - link


    Plenty of companies have old servers that doesn't need much. NT4 to 2000 servers can easily range to those numbers at most corporations.
  • JohanAnandtech - Saturday, May 9, 2009 - link

    Possible, but still an exception. Windows 2000 and NT4 servers have become a minority, probably less than 5% of the installed base.

    And you are probably not too concerned about CPU performance when consolidating those servers.
  • duploxxx - Friday, May 8, 2009 - link

    Finally some good article to breach this new vmmark scores.

    Although it is clear that the new Nehalem based system is better then current shanghai this is mostly due to the 3 mem controllers (which in the end provides more mem/cpu) and faster memory. The HT feature is the main VMmark whoop score cause here, it is already stated by many Vmware performance representatives that people should take care about the HT core as a real core in production, if you do this the performance will get bad just as previous HT, although the ESX sw is no much more aware of this feature (esx 3.5u4 and esx4), but is seems like the vmmark is not able to see the difference since there is not enough load on the system.

    All other features are now equal while shanghai switching time was way better then harpertown the nehalem is more or less equal, also the ept/npt or rapid V or whatever you want to call it is now implemented.

    so a final vmmark performance score you stated around 16-17 sounds very reasonable.

    the performance enhancements in esx4 are not really for HT rather the core coherency features like vmware wants to call this, iommu which will be first introduced by amd istanbul and most important the paravirtualized scsi driver and off course more cpu/vm and a lot of memory, scheduling improvement.....etc....

    Perhaps you should contact there are aware of this VMmark real world difference and are working on a new version.
  • lopri - Friday, May 8, 2009 - link

    Yet again from Johan. Johan never disappoints! I have just had a quick read, but I will take a thorough read later. Thank you much and I'd like the follow-up articles very much, too.

Log in

Don't have an account? Sign up now