AnandTech Home IT Portal Home Increase Font Size Decrease Font Size Change Page Size
IT @ AnandTech.com Blogs


  November 17, 2009

vApus for Open Source: Creating a virtualized stress test
blog post by Liz van Dijk
If you've been keeping up with our articles for a while, you might have picked up on vApus Mark I: the virtualized stress test we created for internal use at the Sizing Servers testlab.

As detailed in Johan's article, this bench consists of 3 separate applications, all of which we are very familiar with due to extensive optimization and stress testing efforts. Although we believe the results published based on this bench speak for themselves, the problem remained that it was impossible for anyone outside our lab to verify the results, seeming as how two out of three of the applications used were owned by private companies and were entrusted to our lab under rather strict conditions (distributing them to the rest of the world sadly not being one of them).

Secondly, vApus M1 being a bench that focuses on fairly heavy VM's, we feel the need to create another point of reference. One that will back up the results of the original, but with a completely different mix of VM's.

Thus began the process of creating vApus For Open Source, or vApus FOS, as we like to call it in the lab.

The idea behind vApus FOS is that the VM's can be freely distributed to any vendors that wish to verify our results, and our lab can provide a version of the actual in-house developed vApus benching software to generate the load.

I am happy to say that the preliminary 1-tile testing for this new benchmark has just completed, and so far everything has been running quite smoothly. The results are reproducible, the VM's stable... looks like our 4-tile (16 VM's in total) testing can begin!

The fun part is that a lot of the ideas we incorporated into the new setup we owe to you, our readers! Thanks to the feedback we got on vApus M1, we were able to combine some new workloads into an interesting mix:

As it stands, one tile consists of 4 VM's, all of which run a basic, minimal CentOS 5.4.

VM1 runs an Apache webserver and MySQL database, hosting a phpbb3-forum. The VM is given 2 vCPU's and 1GB RAM.

VM2 runs the same setup as VM1, but is only given 1 vCPU.

VM3 runs a fully configured mailserver using Postfix, Courier and a Squirrelmail frontend. This VM is assigned 2 vCPU's and 1GB RAM.

VM4 runs a separate MySQL OLAP database, using InnoDB as its storage engine. This machine is also assigned 2 vCPU's and 1GB RAM.

The goal is currently to get a 4-tile test going on a 16-core machine, meaning that the hypervisor will have to account for 28 vCPU's in total. This should prove to be a very interesting exercise for the scheduler. Of course, this VM setup can be made to work perfectly fine in an OpenVZ environment as well, meaning we can finally do some real world testing on alternative Linux-based virtualization solutions as well.

We thought we'd keep you updated on the progress of our research. As any experienced IT professional will know, well thought-out server technology testing takes time, and it's important to realize the amount of steps required to produce results that can immediately be applied in the real world.

Stay tuned for our first testing results, they should be rolling in very soon now!



November 17, 2009, 8 comments
  November 3, 2009

Choosing the right foundation: which hypervisor do you evaluate?
blog post by Johan De Gelas
First of all, we were pretty excited to see so many comments and votes (5000!) on our last IT poll. It is good to see that professional IT is so much alive at Anandtech.com. So yes, we should have updated this blog quicker, to keep the momentum going. The reason why this update comes rather late is -once again - that we are working on the much delayed hypervisor comparison. Hundreds of tests have already been done, but we have added more tests to check important I/O performance factors such as VMDq and iSCSI performance.
 
And of course, the virtualization market is evolving fast. There is a new kid on the block: KVM. Two of the three most important Linux vendors, Red Hat and Canonical, have ripped Xen out of their distributions in favor of KVM. KVM has an interesting philosophy: it simply adds two kernel modules to the Linux kernel to turn the latter into a hypervisor. As a result, KVM can leverage the huge amount of Linux drivers and the Linux kernel improvements such as power management. Still, a virtualization solution needs to mature quite a bit before it is ready. And that is more than a cliche. Xen's support for Windows VMs was for example supposed to work at the beginning of 2007, as Xen introduced support for Hardware Virtual Machines at the end of 2006. But only around in the middle of 2008, we felt confident enough to say that Windows virtual machines work well on Xen. We reported
 
"Xen 3.2.0 which can be found in the newest Novell SLES 10 SP2, is capable of running Windows 2003 R2 under heavy stress."
So it took Xen several major revisions to really get it right. It is unlikely that KVM will do this much quicker. We will be giving KVM some heavy stresstesting so we can tell you more than just hearsay.
 
In the mean time, a new survey by Centrify shows a still dominant VMware, but it also tell us that Hyper-V and Xen are making a lot of progress, growing strong enough to be dangerous opponents in the near future. I have been talking to tens of Small and Medium Enterprises (SME) in Belgium and the Netherlands. Our own tests show that VMware ESX is still the most robust hypervisor and most people concur. However VMware's half-hearted attempts to make vSphere more attractive to the SME does not create  a lot of enthousiasm. If VMware does not create a more budgetfriendly solution for SMEs (and VMware, newsflash: most SME have more than 3 servers), we have the impression it may lose the server virtualization battle in the SME world, where everything is still possible. But those are my personal impressions. At the end of the day, what will happen in your working environment determines who will prevail. So let us know what you are planning...
 


November 3, 2009, 30 comments
  October 7, 2009

The basic
blog post by Johan De Gelas

If you read our last article, it is clear that when your applications are virtualized, you have a lot more options to choose from in order to build your server infrastructure . Let us know how you would build up your "dynamic datacenter" and why!


October 7, 2009, 48 comments
  May 27, 2009

Intel talking about the 16-thread RISC killer
blog post by Johan De Gelas
Take two Nehalem dies, turn them  90 degrees, add a lot of system interface logic and 8 MB extra of L3-cache and you get - very oversimplified - the impressive Nehalem EX, alias "Beckton". The new Xeon MP is an impressive monster, just like it's predecessor Dunnington. Dunnington consisted of 1.9 Billion transistors, the Xeon MP based on the "Nehalem" architecture will feature up to 2.3 Billion transistors.
 
 
Those 2.3 Bilion transistors are needed for 
  • Up to eight cores, 16 threads thanks to SMT
  • Up to 24MB of shared L3 cache
  • four QuickPath links
  • four memory channels which support for up to 16 memory modules per socket 
Intel calls the chips to drive the DDR-3 modules "Scalable Memory Buffer" chips, which means that Intel figured out that it is best to move the power gobbling AMB chip from the FBDIMMs to the systemboard. As you need only one chip to drive several registered DDR-3 modules, it consumes a lot less power than placing an AMB chip on each DIMM.
 
 
 
 
In the second of half of this year, Intel will have a IBM Power 6 killer and a server platform to match. The irony is that when it comes to "Intel Scalable Memory Buffers", IBM has the right to say "what to took you so long to figure out that FB-DIMMs were a pretty bad idea?" Back in 2005, IBM's X3 chipset already featured a solution that allowed large memory capacities with lower latency and much lower power consumption than FBDIMMs.
 
It will be interesting to see what IBM's respons to the Nehalem EX will be, as Intel's first octal core is going to enter the last market where RISC CPUs still hold their ground: 8 sockets and more.There have been previous attempts, but this time it is for real:more than 15 8+ socket designs are being readied. More irony: IBM will probably design the servers with the highest socket counts which really give the Power servers a run for their money...
 
As Intel gave its octal core CPU RAS features (MCA) that once belonged to the RISC and Itanium families only, it seems that the last stronghold of the non-x86 servers is going to fall..."mainframe slowly"  but steadily. Only the Ultrasparc T2 with its radically different architecture may survive this assault.
 
The Machine Check Architecture is of course ultra important for the future Xeon MP systems. Even a quad socket system will contain 32 cores and probably up to 512 GB of RAM. That kind of machine simply cries out for large databases and virtualization consolidation. In the latter case, MCA should allow hypervisors such as ESX to overcome critical errors in one of the VMs, instead of shutting down tens of VMs. 
 
In a different note, Intel claims that by August 2009 50% of it's DP server processors sold  will be "Nehalem" based. So even though AMD is executing very well and introducing the hex-core "Istanbul" soon, it is not a minute too soon as the Opterons are under heavy attack.
 
Update:  Anand also talked about Nehalem EX in his lab update here.
 

May 27, 2009, 8 comments
  May 19, 2009

quick update from the
blog post by Johan De Gelas
We promised you a new datapoint, a new independent virtualization benchmark in "a few days". Those "few days" have become a week in good "IT at Anandtech" tradition. :-) But this wednesday, unless Murphy strikes us hard, the article will be online. It will offer a refreshing look at the virtualization performance, the result of months of work.  Liz will follow up quickly with a "performance optimization for virtualization" article.

Until then, we have updated two articles. We told you in one of our "Intel Nehalem vs AMD Istanbul" blogs, that you will have to wait for ESX 4.0 for EPT support. However, we found that "forcing hardware VMMU" (= EPT) improves performance tangible, so we wrote that ESX 3.5 update 4 has support for EPT. That is not true, at least not officially. EPT is only officially supported on ESX 4.0 (the hypervisor of vSphere 4.0).  Check out the updates that we did to the last article, as it clarifies some of the VMmark benchmarking. Our thanks goes to Scott Drummonds of VMware for the excellent info.
 
The last update can be found in our "The Best Server CPUs part 2" article. We solved the problems with our Shanghai "exchange" server and managed to get some Opteron numbers. The newest quadcore "Shanghai" opterons are clock for clock as fast as the quadcore "Harpertown Xeons. In other words, Microsoft exchange runs faster on the Xeons 54xx thanks to their clockspeed advantage, and the Xeon 55XX is still by far the MS Exchange champion. You can find the benchmarks here.So expect a lot of new content soon... New CPUs, new servers, new storage. The second part of May and June should be fun.
 
 
 
 

May 19, 2009, 0 comments
  April 7, 2009

The million dollar question: how do you upgrade your datacenter
blog post by Johan De Gelas
 
"the challenge for AMD and Intel is to convince the rest of the market - that is 95% or so - that the new platforms provide a compelling ROI (Return On Investment). The most productive or intensively used servers in general get replaced every 3 to 5 years. Based on Intel's own inquiries, Intel estimates that the current installed base consists of 40% dual-core CPU servers and 40% servers with single-core CPUs."
 
At the end of the presentation of Pat Gelsinger (Intel) makes the point that replacing nine servers based on the old single core Xeons with one Xeon X5570 based server will result in a quick payback. Your lower energy bill will pay back  your investment back in 8 months according to Intel.
 
Why these calculations are quite optimistic is beyond the scope of this blogpost, but suffice to say that Specjbb is a pretty bad benchmark to perform ROI calculations (it can be "inflated" too easiliy) and that Intel did not consider the amount of work it takes to install and configure those servers. However, Intel does have a point that replacing the old power hungry Xeons (irony...) will deliver a good return on investment.
 
In contrast, John Fruehe (AMD) is pointing out that you could upgrade dualcore Opteron based servers (the ones with four numbers in their modelnumbers and DDR-2) with hex-core AMD "Istanbul" CPUs. I must say that I encountered few companies who would actually bother upgrading CPUs, but his arguments make some sense as the CPU will still use the same kind of memory: DDR-2. As long as your motherboard supports it, you might just as well upgrade the BIOS, pull out your server, replace the 1 GB DIMMs with 4 GB DIMMs and replace the dual cores with hex-cores instead of replacing everything. It seems more cost effective than redo the cabling, reconfigure a new server and so on...
 
There were two reasons why few professional IT people bothered with CPU upgrades:
  1. You could only upgrade to a slightly faster CPU. Upgrading a CPU to a higher clocked, but similar CPU rarely gave any decent performance increase that was worth the time. For example, the Opteron was launched at 1.8 GHz, and most servers you could buy at the end of 2003 were not upgradeable beyond 2.4 GHz.
  2. You could not make use of more CPU performance. With the exception of the HPC people, higher CPU performance rarely delivered anything more than even lower CPU percentage usage. So why bother?
AMD has also a point that both things have changed. The first reason may not be valid anymore if hex-cores do indeed work in a dualcore motherboard. The second reason is no longer valid as virtualization allows you to use the extra CPU horse power to consolidate more virtual servers on one physical machine. On the condition of course that the older server allows you to replace those old 1 GB DIMMs with a lot of 4 GB ones. I checked for example the HP DL585G2 and it does allow up to 128 GB of DDR-2.
 
So what is your opinion? Will replacing CPUs and adding memory to extend the lifetime of servers become more common? Or should we stick to replacing servers anyway?
 

April 7, 2009, 23 comments
  February 27, 2009

Istanbul versus Nehalem, some extra notes
blog post by Johan De Gelas

My last post generated quite a bit of discussion, some of it based on misunderstandings. In this post I'll try to make a few things more clear. In a previous post, I pointed out that there are a good indications that a dual Nehalem EP has a 40 to 100% advantage over Shanghai (depending on the application, based on the SAP and Core i7 workstation benchmarks).

If Istanbul is introduced in the early part of H2 2009, AMD will have a small window of opportunity of competing with a hex-core versus a quad-core (Intel's Nehalem EP). Time will tell of course how small, large or non-existing this window will be.

In well threaded applications, the best a "hex-core Shanghai" can do is give about a 30-40% boost to performance compared to the current Shanghai, which is most likely not enough to close the gap with the upcoming Nehalem CPU (let alone the 32 nm hex-core version). However, Istanbul is more than a hex-core Shanghai. The improved memory controller and HT-assist can lower the latency of inter-CPU syncing and increase the effective memory bandwidth. For that reason, Istanbul will do better than just "a shanghai with 2 added cores" in many applications such as SAP, OLTP databases, Virtualization scenario's and HPC. Depending on the application, Istanbul might prove to be competitive with the quad-core Nehalem. It is clear that the hex-core "Westmere" which will have a slightly improved architecture will be a different matter.

But back to the "this higher amount of bandwidth will allow the quad Istanbul to stay out of the reach of the dual Nehalem EP Xeons" comment. It is very embarrassing, and simply bad PR if a quad socket platform is beaten by a dual socket platform in any benchmark. This is something we have witnessed in the early SAP numbers. That is why I commented that the improved "uncore" will help the quad socket Istanbul to stay out of the reach of the dual Nehalem EP. I was and am not implying that people who would consider a dual Nehalem EP are suddenly going to consider a quad Istanbul.

It is clear those looking for a 4S and 2S server are in a slightly overlapping but mostly different market. Quad socket is mostly chosen for large back end applications such as OLTP databases or for virtualization consolidation. The number of DIMM slots in that case is a very important factor. However, even with the advantage of having more DIMM slots, better RAS etc., a quad socket platform that cannot outperform a dual socket platform will leave a bad taste in the mouth of potential buyers. It is important that there is a minimal performance advantage.

The fact that the performance/power ratio of such a quad server will be worse than a dual socket server is an entirely different discussion. IBM's market research (see the picture below) shows which form factor is bought mostly for consolidating VMs. As you can see it comes down to some people being convinced that a number of 4-socket rack servers is the best way, others are firm believers that about twice as much low power 2-socket blades is the way to go. It is very hard to convince the latter or former group to switch sides and that is why I feel that 2S and 4S servers are mostly in different markets.

In many cases, the number of virtual machines you can consolidate on one physical server is mostly a function of the amount of RAM. If the number of DIMM slots allows you to consolidate twice as many virtual machines on the quad socket machine, the consumed energy might be better than using two DP machines with the same number of DIMMs.

So despite the fact that the two DP machines have a lot more CPU power, the "scale up" buyers still prefer to go for a large box with more memory; they are not limited by raw CPU power, but by the amount of RAM that they can put in this server. It is these people that AMD will target with their 4S platform, a platform which has - especially for virtualization - a number of advantages over the current Intel 4S "Dunnington" platform... at least until Intel's octal-core arrives. Whether you choose the 2S blades or 4S rack servers depends on whether you believe in the "scale up" or "scale out" philosophy.

The conclusion is that many 4S rack servers are not only bought for raw CPU performance, but for the amount of RAM, their RAS features, and so on. However, it is clear that a 4S server should still outperform 2S servers so that the group of buyers who are believers in the "scale up" philosophy feel good about their purchase.


February 27, 2009, 18 comments
  February 27, 2009

VMware's Fault Tolerance feature explained
blog post by Liz van Dijk
Now that the actual conference is behind us, and we've found our way back to the lab, it's time to finish what we started. First off, an apology for our radio silence on day 3: our schedule turned out to be quite a bit more packed than we thought it was, so finding our way to the quiet of the press room proved to be more of a hurdle than originally expected. 

Since our main objective in attending the conference was to learn as much about virtualization as possible, rather than simply cover news flashes, we spent a lot of time in the breakout sessions, and I'm hoping to pour those into an article (or series of blogposts) for you as soon as possible.

On with the show! Last blog, I wrote about the first part of VMware's cloud strategy, being vCenter and vSphere, the continuations of today's Virtual Center and Virtual Infrastructure. Back then, I wondered just how exactly Fault Tolerance would be implemented, and in case you missed the comments of reader duploxxx and my own, I'll repeat what we learned here.

Essentially, most of the Fault Tolerance technology was leveraged from the Record/Replay feature present in VMware Workstation 6, allowing users to accurately record and reproduce a certain set of actions on a VM perfectly. As Lionel Cavalliere explained to us, what it comes down to is the hypervisor logging every single CPU instruction happening in the primary VM, while a floating IP (think of failover clusters) helps vCenter's virtual switch pass traffic on to the correct machine. In between the two machines, a private (preferrably as fast as possible) network should be set up for the primary vSphere to send the recorded instructions to the one carrying the shadow VM. In the breakout session, it was explained that no IO is ever performed by the primary VM, without the shadow VM first acknowledging the instructions. Both primary and shadow VM then both perform the IO, but the shadow's actions are suppressed by its hypervisor. 

As vCenter's task is to monitor the state of all the vSpheres in the network, it will notice when the primary VM goes down due to a hardware failure and will issue a broadcast on the network for all traffic to be rerouted to the now operational shadow VM.

 
Thanks to Tijl Deneut for this image of the Fault Tolerance module in vCenter! 

As expected from this sort of heavy duty logging, there is to be quite a noticeable performance hit, and at this point, complexity issues have made it impossible to enable this feature on a virtual machine running on more than a single vCPU, leaving quite a lot of room for improvement.

We've been asked before whether this feature, when fully functional, will remove the need for any other High Availability measures. The answer to that is a pretty conclusive "no". VMware told us they have no interest in making their software "intrusive" to the point where they are able to provide a failover solution for applications. Fault Tolerance is meant to keep the VM safe from an unexpected hardware failure. Software failures will simply be reproduced on the shadow VM, rendering it useless for recovery. Clustering applications will at this point still be necessary, it seems.

Check back soon for part 2 of the second day's keynote!

February 27, 2009, 2 comments
  February 25, 2009

Day 2 at VMworld Europe 2009 - Part 1
blog post by Liz van Dijk
Here we are once more, blogging away after a very interesting second keynote by VMware CTO Stephen Herrod, delving a bit deeper into the actual changes being pushed into different levels of the software. At this point, the amount of information available might actually fill up an entire article, but alas, time constraints force me to keep this short.

First and foremost, Herrod talked about the performance leaps that have been made over the past year, stressing the importance of moving every aspect of the data center into the cloud, to fully utilize its possibilities. He quoted performance studies using both a heavy OLTP database (using Oracle) and SPEC's very own SPECweb2005 bench to prove that performance hits are quickly becoming a non-issue (weren't they saying this last year as well, though?). Oracle was claimed to run at 24000 transactions per second, while the webserver was able to maintain up to 3 billion pageviews a day. Not too shabby compared to Ebay's average of 1 billion pageviews. The image below displays Oracle's virtual performance when using 1, 2, 4 and 8 vCPU's. The green bar is its native performance on an 8-core machine, VMware claims the performance loss is now limited to 15%.



The most interesting part of the keynote, however, was getting a more technical explanation of exactly what will change in VMware's current offerings to promote their vision of Cloud Computing.

As stated in yesterday's blog, VMware's input on the virtualized data center is threefold: the vSphere forming the management foundation for the internal cloud, the vCloud allowing for federation and interoperability between clouds both internal and external, and the vClient making all of these technologies available to actual end-users through various means.

vSphere is essentially the successor to the current Virtual Infrastructure suite, encompassing all of VMware's software that is actually installed on a physical server meant for virtualization (ESX, VMware HA, VMotion, etc.). The suite has been improved on many levels, some of which we'll be describing here. The big word to keep in mind through all of this is "automation". A lot of this stuff is technically already possible, but requires a very heavy monitoring set and team of very script-savvy administrators.

First of all, the release of VMotion introduced a few logistical problems on the networking level. While every piece of information inside the VM could be transferred to a completely different hardware system without a glitch, each ESX maintains its own virtual switch, which can be specifically configured for a VM running on it. When moving to a new physical server, this information was lost in the past, prompting VMware to develop the vNetwork Distributed Switch. This technology will allow the vSphere to maintain a centralized management of the virtual network, providing the administrator the ability to manage the virtual network of his cloud just as easily as a physical one, while allowing VMotion to happen without the loss of network states. Herrod mentioned switch vendors like Cisco are already working on plugins for the vNDS, helping out network admins to manage their virtual switch with the same tools they're used to for physical switches.

Another great aspect of the vSphere will be Distributed Power Management. The vSphere will be able to monitor the load put on its cloud and scale its physical buildup accordingly. The picture below demonstrates this effect quite nicely. While a regular data center will definitely see its power consumption change according to the load dropped on it, there's nothing quite as power saving as simply VMotioning little used VM's and powering off a machine. Automating this process will be a big hit for companies trying to reduce their carbon footprint. Note: I edited the picture a bit to make the graph more clear, since our seating during the keynote left a lot to be desired.


Updates of existing features of Virtual Infrastructure include the long-expected High Availability improvements. VMware HA was until this point mostly limited to VMotion for planned downtimes, and plain reboots of the VM at a different location for unplanned downtimes. This system left applications that couldn't handle a plain reboot out in the cold, so developers were still forced to implement failover systems. All of that is about to change with VMware's new Fault Tolerance feature, however, vSphere's new feature related to HA. Essentially, Fault Tolerance builds a shadow copy for VM's that require the highest possible availability. The shadow copy is a shielded off machine that is completely unaccessible for as long as the original VM is still running, but is synchronized "clock per clock" (or so we've been told) with it. Ideally, from the moment the original VM goes down, the shadow copy is pushed forward and is able to continue the work from the very CPU instruction the original dropped the ball on.

Exactly how this is achieved over a standard network connection is something that remains a bit of a mystery to us, more on that later, hopefully.

On the security level, VMSafe is built into the hypervisor to function as a sort of virtual firewall, allowing third party developers to plug into its functionality and release a virtual appliance that handles security for VM's whichever way they prefer, acting as an external antivirus that is uncorruptable from within the VM itself.

While all of this greatly improves standalone systems just as well, as virtualization continues its steady grow, many administrators find themselves required to manage not just one, but large amounts of physical servers, hosting thousands of virtual machines. To their aid, VMware has another product called vCenter, which essentially plugs into all available machines, and allows for central management of multiple virtual hosts. This product has been improved on several levels as well: implementing high availability by using vCenter in failover, using heartbeats to make sure everything is performing as it should.

While vCenter is limited in its management capabilities (up to 200 physical hosts, running up to 2000 virtual machines), it will be possible to link up to 10 vCenters to one another, allowing for central management of up to 20000 virtual machines in total. VMware saw it fit to implement a nifty little search function into their client, to keep things more manageable.

Thirdly, vCenter will allow the use of vCenter Host Profiles, enabling administrators to build a set of basic configuration rules for every host to be check against. This way, it's possible to push mass configuration changes to several physical hosts at once, forcing them into compliance of the news rules with just a right-click > Apply.

In any case, that's as far as time permits me to write this morning. Time to follow some more sessions and somehow cram more hours into a day to finish the rest. See you at the next blog!

February 25, 2009, 7 comments
  February 25, 2009

How AMD's Istanbul might close the gap with Nehalem EP
blog post by Johan De Gelas
The Istanbul cores are the same as those that can be found in the AMD's latest Shanghai CPU. But the "uncore" part of Istanbul is more interesting. By now, you have probably heard about AMD's "HT-assist" technology, a probe or snoop filter. Every time a new cacheline is brought into the L3-cache of for example CPU 1 on the current Shanghai Platform, a broadcast message is sent to all L3-caches of all CPUs, and CPU 1 has to wait until those CPUs answer. 
 
In the case of Istanbul, the CPU will simply check it's snoop filter in it's own L3-cache, and if none of the other CPUs have that certain cacheline, it can go ahead. This lowers the latency of bringing in a new cacheline and raises the effective bandwidth.
 
To better understand this, we combined our own stream benchmarking with the one that AMD presented. All AMD systems are using DDR-2 800.
 
Stream Triad benchmark
 
As each Stream thread works on its own data, there is no reason to send out coherency synchronization requests. These requests slow the process of getting new cachelines in the L3 and hence lower effective memory bandwidth. What is interesting is that this will not only benefit the applications that use the HT interconnects a lot for coherency traffic, but also applications like stream which do not need the HT interconnects. Also notice that HT 3.0 does not improve memory bandwidth, as Stream will try to keep its thread data local. Our testing used SUSE SLES 10 SP2 and AMD used Windows 2008. Both OSs are well optimized and NUMA aware.
 
This means that especially HPC applications, with many threads all working on their own data, will benefit from the higher effective bandwidth. Besides HT assist, AMD has now confirmed to us that the memory controller has been tuned quite a bit. This higher amount of bandwidth will allow the quad Istanbul to stay out of the reach of the dual Nehalem EP Xeons in many HPC applications.
 
HT assist might also improve the SAP and OLTP scores quite a bit, but for a different reason. SAP and OLTP applications perform a lot of cache coherency syncronization requests, so the snoop filter will substantially lower the average latency of such requests as in some cases:
  • the CPU will only wait on one other CPU (instead of waiting for all responses to come back)
  • the CPU won't have to wait at all, as the other CPUs don't have this line.
Secondly, this will also lower memory latency, which is a bonus for almost every multi-threaded application.
 
Lower memory latency, higher bandwidth, lower "cache coherency" latency and more interconnect bandwidth: the improved "uncore" of Istanbul will be vital to close the gap with Nehalem. Much will depend on how quickly Intel introduces its own hexacore 32 nm Xeons, but that probably won't happen before 2010. Istanbul is shaping up to be a really good alternative for Intel's quadcore Nehalem. We might see a good fight after all...
 
Don't forget to check it.anandtech.com (IT portal) often, as many of our blogposts (for example the VMworld 2009 coverage) are not published on the frontpage of Anandtech.com.
 


February 25, 2009, 40 comments
  February 24, 2009

Live from the bloggers' room at VMworld Europe 2009
blog post by Liz van Dijk
Seeming as how virtualization is a technology that is still expanding exponentially, and our research is not of the kind that drops a subject once the novelty has worn off, the Belgian IT department of Anandtech is once again attending VMworld Europe, with high hopes of greatly improving our knowledge on the vast amounts of fields virtualization has seeped into.
 
So here we are, once more in lovely Cannes, joining 4700 other attendees (up 200 from last year) at undoubtedly one of the best conference locations in Europe. And rather than trying to pour all the information into a big mega-article, we have decided to try some daily blogging as a way of channeling the content of the sessions to our readers.
 
 
 
As it is, the first breakout sessions started a mere 20 minutes ago, and we are excited to see what the day has to offer. As explained by VMware's CEO Paul Maritz in the keynote that kicked off the conference 2 hours ago, VMware's focus is shifting to completely changing the way data centers are structured. Of course, this is a process that was kicked off years ago with the release of server hypervisors in 2000, however, now that the technology has matured sufficiently to allow for amazing breakthroughs like VMotion and Dynamic Resource Scheduling, VMware feels it's time to take data centers to the next level: Cloud Computing. "Providing IT as a Service" or building a "software mainframe" are two of the very nice publication-ready terms he used to describe the idea, and they actually capture the technology quite well.
 
Their strategy to achieve a Cloud Computing environment in any serverroom rests on 3 basic principles:
  • The Virtual Data Center OS (or VDC-OS, in short)
  • vCloud
  • vClient
Each keyword representing one of the big fields VMware chooses to focus future developments on. The VDC-OS stands for their current Virtual Infrastructure technology, allowing global management of the ESX machines in a network. With vCloud, they intend to extend the management of data centers beyond that of the internal architecture, allowing workloads to be distributed to either the internal or an external cloud (offered by ISP's, for example). vClient adds desktop virtualization to the mix, allowing regular users a spot in this newly virtualized landscape, by further improving their VDI technology.

 
 
Tomorrow's keynote is promised to discuss these new technologies a bit more in-depth, so be sure to check back if you are just as curious about VMware's new advances as we are.

Now it's off the follow some breakout sessions we go, see you at the next blog!

February 24, 2009, 1 comments
  February 23, 2009

AMD fighting back with hexacore Istanbul and
blog post by Johan De Gelas
Last Friday, AMD has given a good answer to the approaching  Intel Xeon Nehalem EP thunderstorm. AMD demonstrated to a handful of journalists (Charley and Scott) an up and running dual and quad socket Hexacore Istanbul system. Istanbul, which should be ready in the Autumn of this year, is basically a six core version of the current AMD Opteron "Shanghai". While we could not attend the Istanbul demo, we had a long phone conversation with the AMD people. A few interesting points came up during that phone conversation, and we love to share them with you.

AMD seems to recognize that the best Nehalem EP will be between 40 to 100% faster than their flagship CPU, but claims there will be much more benchmarks near the 40% than the 100% mark. AMD however believes that Intel will only be able to steal back the "performance is everything" HPC market, as it will counter Nehalem by launching an Energy Efficient version of the current Shanghai CPU. AMD firmly believes that the 95W Nehalems EP (2.66 to 2.93 GHz) will not be very attractive to many datacenters. AMD also points out that even the low power versions of Nehalem (up to 2.26 GHz) need 60W. We will see whether AMD can offer higher clockspeeds with lower energy consumption.It is interesting to hear that AMD firmly targets the low power market. According to AMD, many customers are already putting "power caps" (a BIOS feature) on their CPUs to avoid that the server exceed a certain power consumption level. This means that the CPU is staying in the lower p-states and is never able to run at full clockspeed. This is used by many customers that do not buy low power CPUs.
 
Secondly, AMD believe that  the total number of servers, based on Nehalem EP, will probably amount to being small percentages of the total server shipped in Q2. Buyers will oppose the high price of DDR-3 according to AMD. We are rather sceptic:
 
So the price difference is small to non-existing on a $3000-$4000 dual socket server. 
 
Still, Nehalem is a completely new platform and it will take some effort from the system administrator to verify if the currently running applications run well with Hyperthreading and Turbo Mode. Also AMD's RVI is already well supported in ESX 3.5, while we'll have to wait for VMware's vSphere ("ESX 4.0") before EPT will be supported. That means that the realworld performance of Nehalem running ESX will probably be lower than the published benchmarks in 2009. Yes, we are at VMworld 2009 remember!
 
The Shanghai platform is basically the same as the Barcelona one, so that earns AMD a few points in the "easier to integrate and upgrade to" departement. AMD is thus hoping that by the time Nehalem EP will really take off (Q3?), Istanbul will be ready to answer the threath.And there is something interesting about Istanbul... but we'll discuss that in a later post. 
 
 
 
 

February 23, 2009, 11 comments
  February 12, 2009

Will Nehalem conquer the server world by storm?
blog post by Johan De Gelas
A dramatic turn of events is the best way to describe what we'll witness in a few weeks. But let us first talk about the current situation. As we pointed out in our last server CPU comparison, AMD latest quadcore Opteron was a very positive surprise. Sure, you can show a few server benchmarks where the Intel CPU wins like Black Scholes or some exotic HPC benchmark but the server applications that really make the difference like webservers, database servers run faster on the latest AMD "Shanghai" CPU. All depends on what kind of application is important for you of course. But let us look at the complete picture: performing more than 30% faster in Virtualization benchmarks is the final proof that AMD's latest is overall the best server CPU at this point in time.

But a few weeks from now, that will all change. As always we can not disclose benchmark information before a certain date, but if you look around here at this site, you have been able to discern the omens. The K10 architecture of Shanghai is a well rounded architecture, but one that misses really crucial weapons to keep up with the Nehalem:
  • Simultaneous Hyperthreading offers performance boost that IPC Improvements are not capable of delivering (up to 45%!).
  • Memory latency. Nehalem's memory latency is up to 40% lower
  • Memory bandwidth: 3 channels is complete overkill for desktop apps, but it does wonders for many HPC and in a lesser degree server applications.
  • a really aggressive integer engine
Nehalem will use somewhat more expensive DDR-3 DIMMs, which hardly offer any real performance boosts (as compared to DDR-2). So moving to DDR-3 will not help AMD much.
 
Istanbul? 
The details on the six-core Istanbul are still sketchy. But the dual socket Xeon "Westmere" will get six cores too and will appear in the same timeframe as AMD's hexacore. Only if AMD added SMT very secretly to Istanbul, they will be able to turn the tide. Considering that this would be a first for AMD, it is very unlikely SMT made it to Istanbul.
 
A dent in Nehalem's armour?
Does AMD have a chance in the server market in 2009 (and possibly 2010)? I must say it was not easy to find a weakness in Nehalem's architecture. The challenge made it very attractive to search anyway :-). So what follows is a big "IF- iF" story and you should take it with a big grain of salt ... as you should always do with forward looking articles.
 
There is one market where AMD has really been the leader and that is virtualization thanks to the IMC and the support for segments (four privilege levels) in the AMD64 Instruction Set Architecture. AMD's performance running VMware ESX in the "good old" ESX Binary translating mode (software virtualization) was better than running an Intel on the latest hardware virtualization hypervisor. VMware only uses hardware virtualization on an AMD server if NPT (or RVI or HAP) is present . In contrast, hardware virtualization slowed the Xeons of 2005 and 2006 a bit down but was absolutely necessary to run 64 bit guests on a hypervisor on top of a Xeon server.
 
Nehalem is catching up with EPT and VPID (see here), and while it was well implemented, one thing is lacking: the TLB is rather small. I have been pointing out this out about a year ago: while the TLB got AMD a lot of bad press, it will probably be the one thing that keeps AMD somewhat in Intel's slipstream. Let me make that more clear: 
 
CPU
L1 TLB Data
L1 TLB Instr
L2 TLB
AMD Shanghai/ Opteron 238x or 838x
48  (4 KB)
48 (large)
48 (4 KB)
48 (large)
 512 (4 KB)
 128 (large)
Intel Penryn / Xeon 54xx
16 (4 KB)
16  (large)
128(4 KB)
8 (large)
 256 (4 KB)
 32 (large)
 Intel Nehalem / Xeon 55xx
  64 (4KB)
32 (large)
 128 (4 KB)
  14 (large)
 512 (4 KB)
 0 (large)
 
Notice that in case you use large pages, the Nehalem TLB has few entries. So, let us now do a thought experiment. Currently, most of the virtualization benchmarks like VMmark (VMware) and VConsolidate (Intel) use relatively small VMs. VMs are for example a small Apache webserver and Mysql server which get between 512 MB and 2 GB of RAM. As a result most of them run with large pages off (Page size = 4 KB). These benchmark are very similar to the daily practice of an enterprise which uses IT mostly for "infrastructure purposes" such as authentificating it's employees and giving them access to mail, ftp, fileserver, print serving and web browsing.
 
It becomes totally different when you are an IT firm that offers it's services to a relatively large amount of customers on the internet. You need a large database with many probably pretty heavy webportals which offer a good interactive experience.
 
So you are not going to consolidate something like  84 (14 tiles x 6 VMs) tiny VMs on one physical machine, but rather 5 to 10 "fat" VMs. With fat VMs I mean VMs that get 4 GB and more of RAM, 2 to 4 vCPUs, run a 64 bit guest OS and so on.
 
Those applications also open tons of connections, which they have to destroy and recreate after some time. In other words, lots of memory activity going on. 
 
EPT and NPT can offer between 10 and 35% better performance when lots of memory management activity is going on. Compared to the shadow page table technique, each change in the page tables does not cause a trap and  the associated overhead (which can be 1000s of cycles). So you could say that going to the TLB of your CPU is a lot smoother. But if the TLB fails to deliver, the hardware page walk is very costly.
 
In search of the real page table
A hardware page walk consists of searching in several tables which allow the CPU to find the real physical address as the running software always supplies a virtual address. With a normal OS, the OS has set the CR3 register to contain a physical address where the first table is located.The first table converts the first part of the virtual address into a physical one,  a pointer towards the physical address where the next table is located. With large pages, it takes about 3 steps to translate the virtual address to the physical one.
 
With EPT/NPT, the Guest OS gives a (CR3) address which in fact virtual and which must be converted into a real physical address. All the Guest OS tables contain pointers to a virtual addresses. So each table gives you a virtual address towards the other table. But the next table is not located at this virtual address, so we need to go out and search for the real address. So instead of 3 accesses to the memory, we need 3x3 accesses. If this happens too many times, EPT will actually reduce performance instead of improving it!
 
It is a good practice to use large pages with large database. Now remember we are moving towards a datacenter where almost everything is virtualized, databases included. In that case, Nehalem's TLB can only make sure that about 32 x 2 MB or only 64 MB of data and 28 MB of code is covered by the TLB. As a result, lots of relatively heavy hardware page walks will happen. Luckily, Intel caches the real physical page tables in the L3-cache, so it should not be too painful.
 
The latest quadcore Opteron has a much more potent TLB. As instructions take a lot less space than data, it is safe to say that the data TLB can cover up 176 (48 + 128) times 2 MB or 352 MB of data. Considering that virtualized machines have easily between 32 and 128 GB and are much better utilized (60-80% CPU load), it is clear that the AMD chip has an advantage there. How much difference can this make? We have to measure it, but based on our profiling and early benchmarking we believe that "an overflowing TLB" can decrease virtualized performance by as much 15%. To be honest: it is to early to tell, but we are pretty sure it is not peanuts in some important applications.
 
So what are we saying? Well, it is possible that the Opteron might be able to do some "damage control" compared to Nehalem when we try out a benchmark with large and fat VMs (Like we have done here). But there are a lot of "IF"s.  Firstly, AMD must also cache the page tables in the caches. If for some reason they keep the page tables out of the caches, the advantage will probably be partly negated. Secondly, if the applications running on the physical machine demand a lot of bandwidth, the fact that the Nehalem platform has up to 70% more bandwidth might spoil the advantage too.
 
The last AMD Stronghold?
So Should Intel worry about this? Most likely not. For simplicity sake, let us assume that both cores - Shanghai and Nehalem- offer equal crunching power. They more or less do when it comes to pure raw FP power, but SpecInt makes it clear that Nehalem is faster in integer loads.
 
But let us forget that, as most server applications are unable to use all that superscalar power anyway. The AMD chip is still disadvantaged by the fact that it does not have SMT. Considering that most server apps have ample threads and that virtualization makes it easier to load each logical CPU up to 80% that remains a hard to close gap. Secondly, many of these applications do not fit entirely in the cache, so the fact that AMD's memory latency is up to 40% higher is not helping either. Thirdly, all top Xeons (2.66 GHz and higher) are capable of adding 2 extra speedbins even if all 4 cores are busy (like it was the case in SAP). It will be interesting to see how much power this costs, and if Turbo mode is possible with a 80% loaded virtualized machine.
 
In a nutshell: expect Nehalem with it's ample bandwidth and EPT to do very well in VMmark. However, we think that AMD might stay in the slipstream of the Intel flagship in some virtualization setups. It is possible that AMD counters with an even better optimized memory controller in Istanbul, but it is going to be tough.
 
Return to Linpack
The benchmarks where AMD will be able to stay close should have no use for massive amounts of memory bandwidth, SMT or Turbo mode. Feel free to educate us, but so far we have only found one benchmark that answers this profile: Linpack. Linpack achieves the highest IPC rates of probably almost all softwares. That means the Nehalem Xeon will be consuming peak power, and will not be able to use Turbo mode. Linpack (with MKL or ACML) is also so carefully optimized that it runs almost completely in the caches, and SMT or hyperthreading is only disturbing the carefully placed code lines. Considering that a 2.7 GHz Shanghai CPU with registered RAM was only a tiny bit slower than a Nehalem CPU with non registered RAM, you may expect to see both CPUs very close in this benchmark. 
 
Outlook to 2009
The AMD quadcore is now the server CPU to get, but it is not going to stay that way very long. Until AMD comes up with SMT or another form of multi-threading and a faster memory controller,  Intel's newest platform and CPU will force AMD to make the quadcore opteron very cheap. We expect that the AMD quadcore will only be competitive in Linpack and some virtualization scenario's.
 
And unless Istanbul has a very nice surprise for us, it is not going to change soon. Agreed, to our loyal readers, this does not come as a surprise...
 


February 12, 2009, 17 comments
  February 11, 2009

Nehalem Xeon EP  update: too good but true
blog post by Johan De Gelas
We were quite amazed, even slightly suspicious, when HP and Fujitsu-Siemens Published their SAP numbers. These numbers showed that the newest Xeon X5570 (Nehalem EP) series offer an enormous performance boost over the Xeon X5470 (Harpertown). After all, an almost 100% improvement at a slightly lower speed (2.93 GHz vs 3.3 GHz) is nothing short of amazing. Turns out that the real clockspeed is 3.2 GHz (2.93 GHz + 266 MHz turbo) but that does not alter the fact that these are truly incredible performance numbers.

I can now confirm that there are no tricks behind these numbers: they paint the right picture about the Xeon Nehalem EP. Talking to SAP benchmarking specialists, it became clear that few tuning tricks exist that are not know to the big OEM. The benchmark has been analyzed and tuned so well, that even the use of a different database (for example MS SQL instead of DB2) only makes a 2 to 3% difference most of the time. So you might even compare SAP numbers which are obtained on different databases. To resume, the SAP numbers can only be really boosted by better hardware (CPU-memory).
 
Now why I am talking so much about SAP benchmarking numbers? It is not like the expensive ERP software is run by everyone.
 
Well, the SAP numbers are showing a dual 2.93 GHz (or 3.2 GHz) Xeon beating the only quad AMD 8384 (Shanghai at 2.7 GHz) score of 22000 we have so far. Granted, a blade server is most of the time a bit slower. But four AMD 8384 2.7 GHz will be in the same league as a dual Xeon X5570, which will be out very soon now.
 
Even worse for AMD is that the SAP benchmark is not some exotic exceptional benchmarking case for the Xeon 55xx series. It shall be no surprise that the HPC numbers will be very impressive too.So it looks like AMD is in a tough spot.
 
What happened? 
As the SAP threads are sharing a lot of data (as is typical for these kind of database driven applications), hyperthreading can not be the only explanation why Nehalem is simply doubling performance and annihilating the competition. SAP benchmarking specialists expect hyperthreading to be good for about one third of the performance boost. We tend to believe these people who performed this benchmark for years now. The reason why it is not one of the "top cases" for hyperthreading on Nehalem is that this OLTP based benchmark spends a lot of time on shared data. Our own Nehalem OLTP benchmarking (Oracle and MySQL) points also in that direction.
 
As we have pointed out before the benchmark also
  • responds very well to low latency cache and memory latency
  • does not care too much about memory bandwith
  • and is very sensitive to "syncing latency".
Since the AMD Shanghai CPU has the same fast way to sync between cores (via the L3-cache) as Nehalem, it can not explain why AMD falls behind. Another explanation is of course that these benchmarks are run on a CPU which uses turbo, which explains about a 5% advantage as the Nehalem CPU actually runs at 3.2 GHz. 
 
Nehalem has faster access to the memory than AMD's latest quadcore (70 ns vs 110 ns), which is probably the second reason why Shanghai falls behind. But AMD will probably have to redesign it's integer execution pipeline significantly before it will catch up with Nehalem (think memory disambiguation for example). Basically, AMD's better NUMA - integrated memory controller platform was hiding this disadvantage. Now that the new Intel platform does not put "the brakes" on the integer execution engine anymore, the superiority of Intel's integer engine is showing.
 
The lack of any form of multi-threading is hurting AMD badly. It is well known that most of these business applications achieve very low IPC (0.2-0.6) and that modern superscalar CPUs have ample execution resources for running two threads in these applications. The results is Simultaneous Multi Threading offers a typical 20 tot 40% performance advantage. And that is huge, considering that you need 25 to 50% more clockspeed to counter that. It is basically a mission impossible for a modern CPU without SMT to outperform a similar superscalar CPU with SMT in OLTP, Java, webserver, rendering and ERP workloads. AMD really dropped the ball there, SMT should have been part of the K10 architecture.
 
Difficult times ahead for AMD
Even if AMD is able to speed up beyond 3 GHz, chances are slim that AMD will be able to compete with the new Nehalem Xeons. Add Turbo mode, hyperthreading, a lower latency memory controller and a better integer core together and you get a performance gap the size of the "Grand Canyon".
 
So does AMD have any chance at all beyond a new architecture in 2011? Is it over and out for AMD in 2009 and 2010? Adding 2 cores at the end of 2009 is a good step in the right direction. But even if AMD executes flawlessly  the 32 nm Xeon Westmere will only give a window of a few months to the AMD hexacore "Istanbul".  Istanbul should appear at the end of 2009, the Westmere Xeon is scheduled for very early 2010.
 
Westmere has few performance optimizations, it seems to be a pretty straight forward shrink. Slightly higher clockspeeds, about 20% lower power consumption, and yet another addition to the ridiculously long list of SSE-instructions in the form of seven new instructions (six instructions are for crypto/AES acceleration). Westmere is only an evolutionary step forward, but the "Grand Canyon" gap that Nehalem EP has made is probably large enough.

 

It is sure that we'll see better (lower) virtualization switching from virtual machine to hypervisor time and some small tweaks in AMD's Istanbul CPU, but it remains unclear if there are any significant performance boosters in the core. So it looks like Intel will own the dual socket space throughout 2009 and 2010, if we may believe the current roadmaps.
 
As the SAP numbers indicate,  even the slowest Intel Xeons will show a large performance gap with the best AMD Opteron's. Is AMD doomed completely? In a large part of the market, yes. AMD's istanbul will make the gap a bit smaller but probably not small enough. 
 
There are some unknown factors that together with one of the few remaining weaknesses (or rather less strong points) of Nehalem that might make it possible that AMD's opteron comes close enough in a particular area of the market. In my next post, I will clarify the one and only opportunity that I see for AMD in the next two years.  Until then, don't shoot the messenger :-).

February 11, 2009, 35 comments
  January 9, 2009

Improving our articles with help from the IT community
blog post by Liz van Dijk
As some of our regular readers might have noticed, at Anandtech IT, we have been trying to spice up our articles and benchmarks by taking them a bit further than the usual rundown of standardized tests.
 
Instead of limiting ourselves to tests that can be found anywhere on the net, we are always eager to figure out how an application or technology will perform in a real-world situation, since in the end, that is what really matters. To this effect, we use a range of applications in combination with our in-house developed suite of benchmarking software (vApus), to get an idea of how a real application would interact with and benefit from whichever technology we're testing.
 
A look at vApus, our in-house developed software suite
 
This is a strategy we have adopted during our lab work for several small to medium enterprises (SME's), since these companies usually want to see clearcut results in realistic situations and solutions they can apply for immediate profit. Giving them the opportunity of being able to test their very own specific application on their very own specific hardware has allowed us to provide them with the most relevant results possible, while giving us access to a plethora of software types that stress every part of a server.
 
As our IT community grows here at Anandtech, now and then we'd like to see some input for our articles as well. For example: One of our upcoming pieces will be a comparison between the possibly underestimated container-based virtualization solution and the popular hypervisor-based one. To make sure that we're not simply comparing apples and oranges, we will be performing tests with several types of applications on different VM-densities. Though we have several reliable database tests laying around, we're currently still looking for a solid, "realistic" web application to run on a Linux-based system. For it to be usable in our testing setup, we would prefer the actual web application to be mostly autonomous (e.g. backed by a database, but independent from a host of other external services), but large enough for there to be varied usage patterns.
 
Rather than just coming up with our own, we would like to learn from your experiences as well. If you, as IT professional, have any suggestions that might point us in the right direction, or happen to have the perfect site for us to test on hand, feel free to post a comment with your ideas or mail me directly. Our goal is to turn the IT section of Anandtech into a truly professional community for IT'ers, where contributions of regulars help set the course of our research and journalism. Now is your chance to help us start that dynamic!

January 9, 2009, 14 comments
  December 16, 2008

Intel Xeon 5570: Smashing SAP records (scoop!)
blog post by Johan De Gelas
We have emphasized it more than once: the Nehalem architecture is all about regaining the performance crown in servers and HPC, desktop and mobile use were sometimes a bonus, sometimes an afterthought. Today it becomes almost painfully obvious. Just read Anand's thoughts about the Core i7:
 
"The Core i7's general purpose performance is solid, you're looking at a 5 - 10% increase in general application performance at the same clock speeds as Penryn"
and now look at the graph below.

 
Intel has apparantely allowed HP and Fujitsu-Siemens to break the NDA on the Xeon 5570 processor for PR reasons as both companies have published SAP numbers on a Dual Xeon 5570. The Xeon 5570 is based on the same architecture as the Core i7. It is a 2.93 GHz quadcore CPU with 4 times a 256 KB L2-cache and one huge shared 8 MB L3. 
 
 
SAP Sales & Distribution 2 Tier benchmark
 
The SAP numbers are absolutely astonishing, as Intel's dual socket is able to outperform quad socket opteron machines. Based on the scaling of Barcelona, we speculate that a quad Shanghai at 2.7 GHz would obtain the performance of the Dual Xeon 5570 w/o HT.The new Xeon 5570 outperforms the "old" 5450 by 119%!!!
 
These numbers are so high, that we checked and checked again. The database used is the same (SQL Server 2005), so unless there is some incredible tuning parameter that HP and FS have discovered and that we have yet to hear about, that is not it.
 
At this point we have no idea how it is possible that a 3 GHz Nehalem outperforms the latest Opteron by a margin as high as 80% and more. But we can give it a try. In a previous server oriented article, we summed up a rough profile of SAP S&D:

• Very parallel resulting in excellent scaling
• Low to medium IPC, mostly due to “branchy” code
• Not really limited by memory bandwidth
• Likes large caches
• Sensitive to Sync (“cache coherency”) latency
 
One of the biggest bottlenecks for Intel has been the sync latency. It is possible that once the "sync" bottleneck was removed, the intel architecture is able to show it's real integer crunching power thanks to the out of order loads (memory disambiguation) and better branch prediction.Those are two areas where the opteron architecture is still weak.
 
The slightly lower latency of the L3-cache of Nehalem helps too. This kind of software also makes the buffers fill up due to the long dependency chains. Those OOO buffers have been increased and the depencency chains have been shortened by a very low latency L2 cache and relatively fast L3.
 
Still we are absolutely amazed that the difference is this large. We would have expected Nehalem to outperform Shanghai by lower margins. Although we still are a bit skeptical that the difference is this large ("too good to be true" syndrome), we do not see how you could artificially inflate a SAP benchmark. It sure is not as easy as SPECJBB or SPECfp/int. 
 
 
Update (a few hours later): It seems that the SAP page was wrong about HT. It reported 8 threads on 8 cores on the Fujitsu Siemens Primergy Server. The certification page says otherwise: 16 threads on 8 cores. So hyperthreading (SMT) plays probably an important role in this benchmark as the SAP application has very low IPC and is very parallel. So this completely annihilating performance comes from combining a wide superscalar CPU with an excellent Simultaneous Multithreading implementation. Hats off to the Intel engineers...
 
 
 

December 16, 2008, 29 comments


more posts More posts


AnandTech.com Blog Categories
All categories
IT Computing general
Virtualization
Blank
Blank

Blank

Latest news by
DailyTech

 November 20, 2009

Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank

 November 19, 2009

Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank


more Blogs Discussions



pipeboost
Copyright © 1997-2009 AnandTech, Inc. All rights reserved. Terms, Conditions and Privacy Information.
Click Here for Advertising Information