AnandTech Home IT Portal Home Increase Font Size Decrease Font Size Change Page Size
IT @ AnandTech.com Blogs


  February 17, 2010

Databases and power management, not a perfect fit
blog post by Johan De Gelas
In our last article, I showed that the current power management does not seem to work well with the Windows Scheduler. We got tons of interesting suggestions and superb feedback. Also several excellent academic papers from two universities in Germany which confirm our findings and offer a lot of new insights. More about that later.The thing that is really haunting me once again is that our follow up article is long overdue. And it is urgent, because some people feel that the benchmark we used undermines all our findings. We disagree as we chose the Fritz benchmark not because it was realworld, but because it let us control the amount of CPU load and threads so easily. But the fact remains of course that the benchmark is hardly relevant for any server. Pleading guilty as charged.
 
So how about SQL Server 2008  Enterprise x64 on Windows 2008 x64? That should interest a lot more IT professionals.We used our "Nieuws.be" SQL Server test, you can read about our testing methods here. That is the great thing about the blog, you do not have to spend pages on benchmark configuration details :-). Hardware configuration details: a single Opteron 2435 2.6 GHz running in the server we described here. This test  is as real life as it gets: we test with 25, 50, 100 and so on users which fire off queries with an average rate of one per second. Our vApus stresstest makes sure that all those queries are not sent at the same time but within a certain time delta, just like real users. So this is much better than putting the CPU under 100% load and measuring maximum throughput. Remember in our first article, we stated that the real challenge of a server was to offer a certain number of users a decent responsetime, and this preferably at the lowest cost. And the lowest cost includes the lowest power consumption of course.  
 
While I keep some of the data for the article, I like to draw your attention to a few very particular findings when comparing the "balanced" and "performance" power plan of Windows 2008. Remember the balanced performance plan is the one that should be the best one: in theory it adapts the frequency and voltage of your CPU to the demanded performance with only a small performance hit. And when we looked at the throughput or queries per second figures, this was absolutely accurate. But throughtput is just throughput. Response time is the one we care about.
 
Let us take a look at the graph below. The response time and power usage of the server when set to performance (maximum clock all the time) is equal to one. The balanced power and response time are thus relative to the numbers we saw in performance.  Response time is represented by the columns and the first Y-axis (on the left), Power consumption is represented by the line and by the second Y-axis (on your right).
 
 
 
 The interesting  thing is that reducing the frequency and voltage never delivers more than 10% of power savings. One reason is that we are testing with only six-core CPU. The power savings would be obviously better when you look at a dual or even quad CPU system. Still, as the number of core per CPU increases, systems with less CPUs become more popular. If you have been paying attention to what AMD and Intel are planning in the next month(s), you'll notice that they are adapting to that trend. You'll see even more evidence next month.
 
What is really remarkable is that our SQL Server 2008 server took twice as much time to respond when the CPU is using DVFS (Dynamic Voltage Frequency Scaling) than when not. It clearly shows that in many cases, heavy queries were scheduled on cores which were running at a low frequency (0.8 - 1.4 GHz). 
 
I am not completely sure whether or not CPU load measurements are completely accurate when you use DVFS (Powernow!), but the CPU load numbers tell the same story.
   
 
The CPU load on the "balanced" server is clearly much higher. Only when the CPU load was approaching 90%, was the "balanced" server capable of delivering the same kind of performance as when running in "performance" mode. But then of course the power savings are insignificant. So while power management makes no difference for the number of users you can serve, the response time they experience might be quite different. Considering that most servers run at CPU loads much lower than 90%, that is an interesting thing to note.

February 17, 2010, 24 comments
  January 22, 2010

Setting up a high performance OpenVZ container
blog post by Liz van Dijk
As promised for a long time, we've been working on pitting Xen and OpenVZ against eachother in a little "battle of the free virtualization solutions". (If you can't quite recall what this OpenVZ business is all about, we suggest you go read our article on container-based virtualization)

Though development of our vApus FOS benchmark suite is moving on quite diligently, it takes time to create both a realistic testing setup that will prove useful and relevant for a while in a world where cores are multiplying like a pair of rabbits. As it turns out, our test client is up for a thorough rewrite and optimization as well in the face of the upcoming Magny-Cours and 64-core Nehalem systems, so we definitely have our work cut out for us.

In preparation for the "official" rollout of vApus FOS, we have been using our beta versions to test both the performances of CentOS 5.4 Xen and OpenVZ, meanwhile figuring out just how easy it is to set up a large scale realistic testing environment in OpenVZ.

As with many extensive open source software packages, OpenVZ comes with quite a few hefty man-pages and very minimal basic configuration, making the learning curve quite steep. 

Having a repeatable test ready, however, helps quite a lot in tracking down possible bottlenecks in your container setup, and because our greatest issues came up when trying to configure a container for a relatively heavily queried MySQL database, here's some pointers for our readers out there trying to do the same.

  • While testing, keep a very close look on /proc/user_beancounters. The very last column of this table displays the failcount of a certain resource in the container. When you start noticing problems, check user_beancounters first to get a better idea of what's going wrong.
  • Problematic resource counters to look out for are the following:
numproc - This is the number of processes the container is allowed to create. In MySQL, every connection will get its own process, so make sure you allow for at least the the value you entered for max_connections in my.cnf, plus the usual amount of processes in a container. For a test with 900 users, we just set this to 1000 to be sure.

numtcpsock - Same as above, you need to increase this to at least the amount of users you want to allow at the same time. Each of them will need a TCP Socket.

kmemsize - When allowing a container access to a certain amount of memory, not all of it will be used in the same way. kmemsize is the amount of bytes that will be used for kernel activity of that specific container. Creating a large amount of processes requires quite some kernel intervention, so make sure it gets the memory it needs to keep track of the processes' data structures. Though it's best to experiment somewhat to figure out which setting is optimal, a good starting point is to look at your number of processes and multiply it by 50kb, then downscale or upscale as necessary. This is something you can easily keep track of by watching /proc/user_beancounters.

numfile - Again, this parameter depends on the type of application you use, how many users use it, how many tables they access (in the case of MySQL) and even which storage engine you use. Giving pointers here can become quite complex, but what worked for us was simply multiplying the base value by two to start with and examining the maxheld column in /proc/user_beancounters to downsize the amount to what we required.

tcpsndbuf & tcprcvbuf - These two buffers can be a little tricky, and confusing to notice while not paying attention. When the difference between the barrier setting and the limit of these buffers are too small, some connections can in fact be made, but some of them simply won't send or receive anything, and keep silent. This was very confusing to vApus, which opens its full amount of connections before starting the test, in the assumption that the successful creation of all connections would allow transmission of data, however slow. Instead, quite a few of its connections simply stalled indefinitely, for no apparent reason. The rule of thumb in this case is that, no matter the amount of memory you want to allow the container for networking purposes, the difference between the barrier and limit for these buffers should always allow for 2.5kB per connection, e.g. the amount filled in for numtcpsock. For our environment, this came down to 2500kB. As such, you can set the barrier value for these buffers as low as you like, but the limit should be set at barrier + numtcpsock * 2.5kB.

The easiest way to tweak these settings is by simply updating your containers' config files. In our case, they were located at /etc/vz/conf/[containerid].conf in the host container's filesystem.

Well, it's back to the grindstone for me, time to show these multicore monsters what we're made of. 

January 22, 2010, 6 comments
  January 20, 2010

Internet Served TV versus Cable and Satellite TV
blog post by Loyd Case
I've been using Dish Network for quite a few years now. Recently, I went through a forced upgrade to their latest ViP 722 high definition DVR. (I say "forced" because the older ViP 622 I had died, and Dish no longer supported the older unit. I didn't have to extend my contract, though.)

I haven't paid a great deal of attention to how rapidly IPTV services have been coming to the living room, built into consumer electronics devices. I've certainly used Hulu, plus the dedicated streaming services from individual "legacy" networks -- NBC and the like. I've also watched shows on Revision 3 and others of the new generation of Internet-only video.
 
About the only regular IPTV viewing we do here at the Case House as a family is the Netflix Watch Instantly service through the Xbox 360. Overall, that's been a pretty positive experience. We did have a couple of burbs, however. A few months ago, we transitioned from Comcast consumer broadband to Comcast Business. I mostly wanted faster upstream bandwidth, but we also encountered the dreaded bandwidth cap when using the Consumer service. What happened when we hit the cap was watching videos through Netflix in highly compressed, worse-than-standard def mode. Ugh.
 
But most of my internet TV viewing has been through the PC. Watching videos on a high performance PC is necessarily different than watching on a TV in the living room. PC users tend to be more forgiving than your average TV watcher. If you get a momentary pause as more data is buffered on the PC, you'll tend to accept it as routine. When that happens in the living room, there's usually a chorus of groans.

Nevertheless, we've seen a whole bunch of IPTV services integrated into consumer electronics devices in the last 18 months or so. Netflix Watch Instantly and Youtube have been the most common, but Amazon.com's service has garnered a few wins. 

At the recent CES 2010 show, even more devices had Internet video services integrated -- even networks, like CBS, CNN, ESPN and others were integrated directly into devices. Companies like Panasonic, Sony, Samsung, Sherwood and others now have IPTV right in the box.

From what I can see, users will encounter a number of different problems. Network configuration issues will probably become a major problem. Most of these devices purport to work wirelessly, over 802.11n. My brother-in-law can't keep his run-of-the-mill Linksys router working. I can just imagine him struggling with streaming services on his TV.

There will also be the inevitable security issues, though no one seems to know what form that will take.
 
Internet TV services are also struggling with their business models. Hulu is already poised to start charging for their service. Will a TV owner with Hulu built in pony up the subscription fee?

On the other hand, these are very early services, and as the infrastructure becomes more robust, delivery and networking issues will gradually subside, though I suspect that will take years. What will happen to the cable and satellite delivery services then? One thing they do offer is content aggregation -- users pay one company for access to a variety of networks. Will customers want to manage a variety of different payments to different services?

Nevertheless, delivering video services over the Internet will gradually become one of the accepted delivery vehicles. Whether the cables and satellite companies can adapt will be interesting to watch.

January 20, 2010, 37 comments
  December 23, 2009

Cloud computing in 2010: let us get practical
blog post by Johan De Gelas
Cloud Computing was probably the most popular buzzword of 2009. There was a lot of hype, but basically, cloud computing is about using the large datacenters of the Internet to your advantage. Either by copying the methods they use to be very scalable and available and applying them in your own datacenter (what VMware is partly trying to do with their "private Cloud", "vCloud"), by outsourcing your infrastructure (PaaS, SaaS) to an external datacenter via the Internet or most likely some hybrid form. 
 
In 2010, all the hype and buzz should materialize. Will you use a form of cloud computing?
 

December 23, 2009, 16 comments
  December 6, 2009

the x86 instruction proprietary extensions: a waste of time, money and energy
blog post by Johan De Gelas
Agner Fog, a Danish expert in software optimization is making a plea for an open and standarized procedure for x86 instruction set extensions. Af first sight, this may seem a discussion that does not concern most of us. After all, the poor souls that have to program the insanely complex x86 compilers will take care of the complete chaos called "the x86 ISA", right? Why should the average the developer, system administrator or hardware enthusiast care?

Agner goes in great detail why the incompatible SSE-x.x additions and other ISA extensions were and are a pretty bad idea, but let me summarize it in a few quotes:
  • "The total number of x86 instructions is well above one thousand" (!!)
  • "CPU dispatching ... makes the code bigger, and it is so costly in terms of development time and maintenance costs that it is almost never done in a way that adequately optimizes for all brands of CPUs."
  • "the decoding of instructions can be a serious bottleneck, and it becomes worse the more complicated the instruction codes are"
  • The costs of supporting obsolete instructions is not negligible. You need large execution units to support a large number of instructions. This means more silicon space, longer data paths, more power consumption, and slower execution.
Summarized: Intel and AMD's proprietary x86 additions cost us all money. How much is hard to calculate, but our CPUs are consuming extra energy and underperform as decoders and execution units are unnecessary complicated. The software industry is wasting quite a bit of time and effort supporting different extensions.
 
Not convinced, still thinking that this only concerns the HPC crowd? The virtualization platforms contain up to 8% more code just to support the incompatible virtualization instructions which are offering almost exactly the same features. Each VMM is 4% bigger because of this. So whether you are running Hyper-V, VMware ESX or Xen, you are wasting valuable RAM space. It is not dramatic of course, but it unnecessary waste. Much worse is that this unstandarized x86 extention mess has made it a lot harder for datacenters to make the step towards a really dynamic environment where you can load balance VMs and thus move applications from one server to another on the fly. It is impossible to move (vmotion, live migrate) a VM from Intel to AMD servers, from newer to (some) older ones, and you need to fiddle with CPU masks in some situations just to make it work (and read complex tech documents). Should 99% of market lose money and flexibility because 1% of the market might get a performance boost?

The reason why Intel and AMD still continue with this is that some people inside feel that can create a "competitive edge". I believe this "competitive edge" is neglible: how many people have bought an Intel "Nehalem" CPU because it has the new SSE 4.2 instructions? How much software is supporting yet another x86 instruction addition?
 
So I fully support Agner Fog in his quest to a (slightly) less chaotic and more standarized x86 instruction set.

December 6, 2009, 110 comments
  November 17, 2009

vApus for Open Source: Creating a virtualized stress test
blog post by Liz van Dijk
If you've been keeping up with our articles for a while, you might have picked up on vApus Mark I: the virtualized stress test we created for internal use at the Sizing Servers testlab.

As detailed in Johan's article, this bench consists of 3 separate applications, all of which we are very familiar with due to extensive optimization and stress testing efforts. Although we believe the results published based on this bench speak for themselves, the problem remained that it was impossible for anyone outside our lab to verify the results, seeming as how two out of three of the applications used were owned by private companies and were entrusted to our lab under rather strict conditions (distributing them to the rest of the world sadly not being one of them).

Secondly, vApus M1 being a bench that focuses on fairly heavy VM's, we feel the need to create another point of reference. One that will back up the results of the original, but with a completely different mix of VM's.

Thus began the process of creating vApus For Open Source, or vApus FOS, as we like to call it in the lab.

The idea behind vApus FOS is that the VM's can be freely distributed to any vendors that wish to verify our results, and our lab can provide a version of the actual in-house developed vApus benching software to generate the load.

I am happy to say that the preliminary 1-tile testing for this new benchmark has just completed, and so far everything has been running quite smoothly. The results are reproducible, the VM's stable... looks like our 4-tile (16 VM's in total) testing can begin!

The fun part is that a lot of the ideas we incorporated into the new setup we owe to you, our readers! Thanks to the feedback we got on vApus M1, we were able to combine some new workloads into an interesting mix:

As it stands, one tile consists of 4 VM's, all of which run a basic, minimal CentOS 5.4.

VM1 runs an Apache webserver and MySQL database, hosting a phpbb3-forum. The VM is given 2 vCPU's and 1GB RAM.

VM2 runs the same setup as VM1, but is only given 1 vCPU.

VM3 runs a fully configured mailserver using Postfix, Courier and a Squirrelmail frontend. This VM is assigned 2 vCPU's and 1GB RAM.

VM4 runs a separate MySQL OLAP database, using InnoDB as its storage engine. This machine is also assigned 2 vCPU's and 1GB RAM.

The goal is currently to get a 4-tile test going on a 16-core machine, meaning that the hypervisor will have to account for 28 vCPU's in total. This should prove to be a very interesting exercise for the scheduler. Of course, this VM setup can be made to work perfectly fine in an OpenVZ environment as well, meaning we can finally do some real world testing on alternative Linux-based virtualization solutions as well.

We thought we'd keep you updated on the progress of our research. As any experienced IT professional will know, well thought-out server technology testing takes time, and it's important to realize the amount of steps required to produce results that can immediately be applied in the real world.

Stay tuned for our first testing results, they should be rolling in very soon now!



November 17, 2009, 11 comments
  November 3, 2009

Choosing the right foundation: which hypervisor do you evaluate?
blog post by Johan De Gelas
First of all, we were pretty excited to see so many comments and votes (5000!) on our last IT poll. It is good to see that professional IT is so much alive at Anandtech.com. So yes, we should have updated this blog quicker, to keep the momentum going. The reason why this update comes rather late is -once again - that we are working on the much delayed hypervisor comparison. Hundreds of tests have already been done, but we have added more tests to check important I/O performance factors such as VMDq and iSCSI performance.
 
And of course, the virtualization market is evolving fast. There is a new kid on the block: KVM. Two of the three most important Linux vendors, Red Hat and Canonical, have ripped Xen out of their distributions in favor of KVM. KVM has an interesting philosophy: it simply adds two kernel modules to the Linux kernel to turn the latter into a hypervisor. As a result, KVM can leverage the huge amount of Linux drivers and the Linux kernel improvements such as power management. Still, a virtualization solution needs to mature quite a bit before it is ready. And that is more than a cliche. Xen's support for Windows VMs was for example supposed to work at the beginning of 2007, as Xen introduced support for Hardware Virtual Machines at the end of 2006. But only around in the middle of 2008, we felt confident enough to say that Windows virtual machines work well on Xen. We reported
 
"Xen 3.2.0 which can be found in the newest Novell SLES 10 SP2, is capable of running Windows 2003 R2 under heavy stress."
So it took Xen several major revisions to really get it right. It is unlikely that KVM will do this much quicker. We will be giving KVM some heavy stresstesting so we can tell you more than just hearsay.
 
In the mean time, a new survey by Centrify shows a still dominant VMware, but it also tell us that Hyper-V and Xen are making a lot of progress, growing strong enough to be dangerous opponents in the near future. I have been talking to tens of Small and Medium Enterprises (SME) in Belgium and the Netherlands. Our own tests show that VMware ESX is still the most robust hypervisor and most people concur. However VMware's half-hearted attempts to make vSphere more attractive to the SME does not create  a lot of enthousiasm. If VMware does not create a more budgetfriendly solution for SMEs (and VMware, newsflash: most SME have more than 3 servers), we have the impression it may lose the server virtualization battle in the SME world, where everything is still possible. But those are my personal impressions. At the end of the day, what will happen in your working environment determines who will prevail. So let us know what you are planning...
 


November 3, 2009, 32 comments
  October 7, 2009

The basic
blog post by Johan De Gelas

If you read our last article, it is clear that when your applications are virtualized, you have a lot more options to choose from in order to build your server infrastructure . Let us know how you would build up your "dynamic datacenter" and why!


October 7, 2009, 49 comments
  May 27, 2009

Intel talking about the 16-thread RISC killer
blog post by Johan De Gelas
Take two Nehalem dies, turn them  90 degrees, add a lot of system interface logic and 8 MB extra of L3-cache and you get - very oversimplified - the impressive Nehalem EX, alias "Beckton". The new Xeon MP is an impressive monster, just like it's predecessor Dunnington. Dunnington consisted of 1.9 Billion transistors, the Xeon MP based on the "Nehalem" architecture will feature up to 2.3 Billion transistors.
 
 
Those 2.3 Bilion transistors are needed for 
  • Up to eight cores, 16 threads thanks to SMT
  • Up to 24MB of shared L3 cache
  • four QuickPath links
  • four memory channels which support for up to 16 memory modules per socket 
Intel calls the chips to drive the DDR-3 modules "Scalable Memory Buffer" chips, which means that Intel figured out that it is best to move the power gobbling AMB chip from the FBDIMMs to the systemboard. As you need only one chip to drive several registered DDR-3 modules, it consumes a lot less power than placing an AMB chip on each DIMM.
 
 
 
 
In the second of half of this year, Intel will have a IBM Power 6 killer and a server platform to match. The irony is that when it comes to "Intel Scalable Memory Buffers", IBM has the right to say "what to took you so long to figure out that FB-DIMMs were a pretty bad idea?" Back in 2005, IBM's X3 chipset already featured a solution that allowed large memory capacities with lower latency and much lower power consumption than FBDIMMs.
 
It will be interesting to see what IBM's respons to the Nehalem EX will be, as Intel's first octal core is going to enter the last market where RISC CPUs still hold their ground: 8 sockets and more.There have been previous attempts, but this time it is for real:more than 15 8+ socket designs are being readied. More irony: IBM will probably design the servers with the highest socket counts which really give the Power servers a run for their money...
 
As Intel gave its octal core CPU RAS features (MCA) that once belonged to the RISC and Itanium families only, it seems that the last stronghold of the non-x86 servers is going to fall..."mainframe slowly"  but steadily. Only the Ultrasparc T2 with its radically different architecture may survive this assault.
 
The Machine Check Architecture is of course ultra important for the future Xeon MP systems. Even a quad socket system will contain 32 cores and probably up to 512 GB of RAM. That kind of machine simply cries out for large databases and virtualization consolidation. In the latter case, MCA should allow hypervisors such as ESX to overcome critical errors in one of the VMs, instead of shutting down tens of VMs. 
 
In a different note, Intel claims that by August 2009 50% of it's DP server processors sold  will be "Nehalem" based. So even though AMD is executing very well and introducing the hex-core "Istanbul" soon, it is not a minute too soon as the Opterons are under heavy attack.
 
Update:  Anand also talked about Nehalem EX in his lab update here.
 

May 27, 2009, 11 comments
  May 19, 2009

quick update from the
blog post by Johan De Gelas
We promised you a new datapoint, a new independent virtualization benchmark in "a few days". Those "few days" have become a week in good "IT at Anandtech" tradition. :-) But this wednesday, unless Murphy strikes us hard, the article will be online. It will offer a refreshing look at the virtualization performance, the result of months of work.  Liz will follow up quickly with a "performance optimization for virtualization" article.

Until then, we have updated two articles. We told you in one of our "Intel Nehalem vs AMD Istanbul" blogs, that you will have to wait for ESX 4.0 for EPT support. However, we found that "forcing hardware VMMU" (= EPT) improves performance tangible, so we wrote that ESX 3.5 update 4 has support for EPT. That is not true, at least not officially. EPT is only officially supported on ESX 4.0 (the hypervisor of vSphere 4.0).  Check out the updates that we did to the last article, as it clarifies some of the VMmark benchmarking. Our thanks goes to Scott Drummonds of VMware for the excellent info.
 
The last update can be found in our "The Best Server CPUs part 2" article. We solved the problems with our Shanghai "exchange" server and managed to get some Opteron numbers. The newest quadcore "Shanghai" opterons are clock for clock as fast as the quadcore "Harpertown Xeons. In other words, Microsoft exchange runs faster on the Xeons 54xx thanks to their clockspeed advantage, and the Xeon 55XX is still by far the MS Exchange champion. You can find the benchmarks here.So expect a lot of new content soon... New CPUs, new servers, new storage. The second part of May and June should be fun.
 
 
 
 

May 19, 2009, 1 comments
  April 7, 2009

The million dollar question: how do you upgrade your datacenter
blog post by Johan De Gelas
 
"the challenge for AMD and Intel is to convince the rest of the market - that is 95% or so - that the new platforms provide a compelling ROI (Return On Investment). The most productive or intensively used servers in general get replaced every 3 to 5 years. Based on Intel's own inquiries, Intel estimates that the current installed base consists of 40% dual-core CPU servers and 40% servers with single-core CPUs."
 
At the end of the presentation of Pat Gelsinger (Intel) makes the point that replacing nine servers based on the old single core Xeons with one Xeon X5570 based server will result in a quick payback. Your lower energy bill will pay back  your investment back in 8 months according to Intel.
 
Why these calculations are quite optimistic is beyond the scope of this blogpost, but suffice to say that Specjbb is a pretty bad benchmark to perform ROI calculations (it can be "inflated" too easiliy) and that Intel did not consider the amount of work it takes to install and configure those servers. However, Intel does have a point that replacing the old power hungry Xeons (irony...) will deliver a good return on investment.
 
In contrast, John Fruehe (AMD) is pointing out that you could upgrade dualcore Opteron based servers (the ones with four numbers in their modelnumbers and DDR-2) with hex-core AMD "Istanbul" CPUs. I must say that I encountered few companies who would actually bother upgrading CPUs, but his arguments make some sense as the CPU will still use the same kind of memory: DDR-2. As long as your motherboard supports it, you might just as well upgrade the BIOS, pull out your server, replace the 1 GB DIMMs with 4 GB DIMMs and replace the dual cores with hex-cores instead of replacing everything. It seems more cost effective than redo the cabling, reconfigure a new server and so on...
 
There were two reasons why few professional IT people bothered with CPU upgrades:
  1. You could only upgrade to a slightly faster CPU. Upgrading a CPU to a higher clocked, but similar CPU rarely gave any decent performance increase that was worth the time. For example, the Opteron was launched at 1.8 GHz, and most servers you could buy at the end of 2003 were not upgradeable beyond 2.4 GHz.
  2. You could not make use of more CPU performance. With the exception of the HPC people, higher CPU performance rarely delivered anything more than even lower CPU percentage usage. So why bother?
AMD has also a point that both things have changed. The first reason may not be valid anymore if hex-cores do indeed work in a dualcore motherboard. The second reason is no longer valid as virtualization allows you to use the extra CPU horse power to consolidate more virtual servers on one physical machine. On the condition of course that the older server allows you to replace those old 1 GB DIMMs with a lot of 4 GB ones. I checked for example the HP DL585G2 and it does allow up to 128 GB of DDR-2.
 
So what is your opinion? Will replacing CPUs and adding memory to extend the lifetime of servers become more common? Or should we stick to replacing servers anyway?
 

April 7, 2009, 25 comments
  February 27, 2009

Istanbul versus Nehalem, some extra notes
blog post by Johan De Gelas

My last post generated quite a bit of discussion, some of it based on misunderstandings. In this post I'll try to make a few things more clear. In a previous post, I pointed out that there are a good indications that a dual Nehalem EP has a 40 to 100% advantage over Shanghai (depending on the application, based on the SAP and Core i7 workstation benchmarks).

If Istanbul is introduced in the early part of H2 2009, AMD will have a small window of opportunity of competing with a hex-core versus a quad-core (Intel's Nehalem EP). Time will tell of course how small, large or non-existing this window will be.

In well threaded applications, the best a "hex-core Shanghai" can do is give about a 30-40% boost to performance compared to the current Shanghai, which is most likely not enough to close the gap with the upcoming Nehalem CPU (let alone the 32 nm hex-core version). However, Istanbul is more than a hex-core Shanghai. The improved memory controller and HT-assist can lower the latency of inter-CPU syncing and increase the effective memory bandwidth. For that reason, Istanbul will do better than just "a shanghai with 2 added cores" in many applications such as SAP, OLTP databases, Virtualization scenario's and HPC. Depending on the application, Istanbul might prove to be competitive with the quad-core Nehalem. It is clear that the hex-core "Westmere" which will have a slightly improved architecture will be a different matter.

But back to the "this higher amount of bandwidth will allow the quad Istanbul to stay out of the reach of the dual Nehalem EP Xeons" comment. It is very embarrassing, and simply bad PR if a quad socket platform is beaten by a dual socket platform in any benchmark. This is something we have witnessed in the early SAP numbers. That is why I commented that the improved "uncore" will help the quad socket Istanbul to stay out of the reach of the dual Nehalem EP. I was and am not implying that people who would consider a dual Nehalem EP are suddenly going to consider a quad Istanbul.

It is clear those looking for a 4S and 2S server are in a slightly overlapping but mostly different market. Quad socket is mostly chosen for large back end applications such as OLTP databases or for virtualization consolidation. The number of DIMM slots in that case is a very important factor. However, even with the advantage of having more DIMM slots, better RAS etc., a quad socket platform that cannot outperform a dual socket platform will leave a bad taste in the mouth of potential buyers. It is important that there is a minimal performance advantage.

The fact that the performance/power ratio of such a quad server will be worse than a dual socket server is an entirely different discussion. IBM's market research (see the picture below) shows which form factor is bought mostly for consolidating VMs. As you can see it comes down to some people being convinced that a number of 4-socket rack servers is the best way, others are firm believers that about twice as much low power 2-socket blades is the way to go. It is very hard to convince the latter or former group to switch sides and that is why I feel that 2S and 4S servers are mostly in different markets.

In many cases, the number of virtual machines you can consolidate on one physical server is mostly a function of the amount of RAM. If the number of DIMM slots allows you to consolidate twice as many virtual machines on the quad socket machine, the consumed energy might be better than using two DP machines with the same number of DIMMs.

So despite the fact that the two DP machines have a lot more CPU power, the "scale up" buyers still prefer to go for a large box with more memory; they are not limited by raw CPU power, but by the amount of RAM that they can put in this server. It is these people that AMD will target with their 4S platform, a platform which has - especially for virtualization - a number of advantages over the current Intel 4S "Dunnington" platform... at least until Intel's octal-core arrives. Whether you choose the 2S blades or 4S rack servers depends on whether you believe in the "scale up" or "scale out" philosophy.

The conclusion is that many 4S rack servers are not only bought for raw CPU performance, but for the amount of RAM, their RAS features, and so on. However, it is clear that a 4S server should still outperform 2S servers so that the group of buyers who are believers in the "scale up" philosophy feel good about their purchase.


February 27, 2009, 20 comments
  February 27, 2009

VMware's Fault Tolerance feature explained
blog post by Liz van Dijk
Now that the actual conference is behind us, and we've found our way back to the lab, it's time to finish what we started. First off, an apology for our radio silence on day 3: our schedule turned out to be quite a bit more packed than we thought it was, so finding our way to the quiet of the press room proved to be more of a hurdle than originally expected. 

Since our main objective in attending the conference was to learn as much about virtualization as possible, rather than simply cover news flashes, we spent a lot of time in the breakout sessions, and I'm hoping to pour those into an article (or series of blogposts) for you as soon as possible.

On with the show! Last blog, I wrote about the first part of VMware's cloud strategy, being vCenter and vSphere, the continuations of today's Virtual Center and Virtual Infrastructure. Back then, I wondered just how exactly Fault Tolerance would be implemented, and in case you missed the comments of reader duploxxx and my own, I'll repeat what we learned here.

Essentially, most of the Fault Tolerance technology was leveraged from the Record/Replay feature present in VMware Workstation 6, allowing users to accurately record and reproduce a certain set of actions on a VM perfectly. As Lionel Cavalliere explained to us, what it comes down to is the hypervisor logging every single CPU instruction happening in the primary VM, while a floating IP (think of failover clusters) helps vCenter's virtual switch pass traffic on to the correct machine. In between the two machines, a private (preferrably as fast as possible) network should be set up for the primary vSphere to send the recorded instructions to the one carrying the shadow VM. In the breakout session, it was explained that no IO is ever performed by the primary VM, without the shadow VM first acknowledging the instructions. Both primary and shadow VM then both perform the IO, but the shadow's actions are suppressed by its hypervisor. 

As vCenter's task is to monitor the state of all the vSpheres in the network, it will notice when the primary VM goes down due to a hardware failure and will issue a broadcast on the network for all traffic to be rerouted to the now operational shadow VM.

 
Thanks to Tijl Deneut for this image of the Fault Tolerance module in vCenter! 

As expected from this sort of heavy duty logging, there is to be quite a noticeable performance hit, and at this point, complexity issues have made it impossible to enable this feature on a virtual machine running on more than a single vCPU, leaving quite a lot of room for improvement.

We've been asked before whether this feature, when fully functional, will remove the need for any other High Availability measures. The answer to that is a pretty conclusive "no". VMware told us they have no interest in making their software "intrusive" to the point where they are able to provide a failover solution for applications. Fault Tolerance is meant to keep the VM safe from an unexpected hardware failure. Software failures will simply be reproduced on the shadow VM, rendering it useless for recovery. Clustering applications will at this point still be necessary, it seems.

Check back soon for part 2 of the second day's keynote!

February 27, 2009, 5 comments
  February 25, 2009

Day 2 at VMworld Europe 2009 - Part 1
blog post by Liz van Dijk
Here we are once more, blogging away after a very interesting second keynote by VMware CTO Stephen Herrod, delving a bit deeper into the actual changes being pushed into different levels of the software. At this point, the amount of information available might actually fill up an entire article, but alas, time constraints force me to keep this short.

First and foremost, Herrod talked about the performance leaps that have been made over the past year, stressing the importance of moving every aspect of the data center into the cloud, to fully utilize its possibilities. He quoted performance studies using both a heavy OLTP database (using Oracle) and SPEC's very own SPECweb2005 bench to prove that performance hits are quickly becoming a non-issue (weren't they saying this last year as well, though?). Oracle was claimed to run at 24000 transactions per second, while the webserver was able to maintain up to 3 billion pageviews a day. Not too shabby compared to Ebay's average of 1 billion pageviews. The image below displays Oracle's virtual performance when using 1, 2, 4 and 8 vCPU's. The green bar is its native performance on an 8-core machine, VMware claims the performance loss is now limited to 15%.



The most interesting part of the keynote, however, was getting a more technical explanation of exactly what will change in VMware's current offerings to promote their vision of Cloud Computing.

As stated in yesterday's blog, VMware's input on the virtualized data center is threefold: the vSphere forming the management foundation for the internal cloud, the vCloud allowing for federation and interoperability between clouds both internal and external, and the vClient making all of these technologies available to actual end-users through various means.

vSphere is essentially the successor to the current Virtual Infrastructure suite, encompassing all of VMware's software that is actually installed on a physical server meant for virtualization (ESX, VMware HA, VMotion, etc.). The suite has been improved on many levels, some of which we'll be describing here. The big word to keep in mind through all of this is "automation". A lot of this stuff is technically already possible, but requires a very heavy monitoring set and team of very script-savvy administrators.

First of all, the release of VMotion introduced a few logistical problems on the networking level. While every piece of information inside the VM could be transferred to a completely different hardware system without a glitch, each ESX maintains its own virtual switch, which can be specifically configured for a VM running on it. When moving to a new physical server, this information was lost in the past, prompting VMware to develop the vNetwork Distributed Switch. This technology will allow the vSphere to maintain a centralized management of the virtual network, providing the administrator the ability to manage the virtual network of his cloud just as easily as a physical one, while allowing VMotion to happen without the loss of network states. Herrod mentioned switch vendors like Cisco are already working on plugins for the vNDS, helping out network admins to manage their virtual switch with the same tools they're used to for physical switches.

Another great aspect of the vSphere will be Distributed Power Management. The vSphere will be able to monitor the load put on its cloud and scale its physical buildup accordingly. The picture below demonstrates this effect quite nicely. While a regular data center will definitely see its power consumption change according to the load dropped on it, there's nothing quite as power saving as simply VMotioning little used VM's and powering off a machine. Automating this process will be a big hit for companies trying to reduce their carbon footprint. Note: I edited the picture a bit to make the graph more clear, since our seating during the keynote left a lot to be desired.


Updates of existing features of Virtual Infrastructure include the long-expected High Availability improvements. VMware HA was until this point mostly limited to VMotion for planned downtimes, and plain reboots of the VM at a different location for unplanned downtimes. This system left applications that couldn't handle a plain reboot out in the cold, so developers were still forced to implement failover systems. All of that is about to change with VMware's new Fault Tolerance feature, however, vSphere's new feature related to HA. Essentially, Fault Tolerance builds a shadow copy for VM's that require the highest possible availability. The shadow copy is a shielded off machine that is completely unaccessible for as long as the original VM is still running, but is synchronized "clock per clock" (or so we've been told) with it. Ideally, from the moment the original VM goes down, the shadow copy is pushed forward and is able to continue the work from the very CPU instruction the original dropped the ball on.

Exactly how this is achieved over a standard network connection is something that remains a bit of a mystery to us, more on that later, hopefully.

On the security level, VMSafe is built into the hypervisor to function as a sort of virtual firewall, allowing third party developers to plug into its functionality and release a virtual appliance that handles security for VM's whichever way they prefer, acting as an external antivirus that is uncorruptable from within the VM itself.

While all of this greatly improves standalone systems just as well, as virtualization continues its steady grow, many administrators find themselves required to manage not just one, but large amounts of physical servers, hosting thousands of virtual machines. To their aid, VMware has another product called vCenter, which essentially plugs into all available machines, and allows for central management of multiple virtual hosts. This product has been improved on several levels as well: implementing high availability by using vCenter in failover, using heartbeats to make sure everything is performing as it should.

While vCenter is limited in its management capabilities (up to 200 physical hosts, running up to 2000 virtual machines), it will be possible to link up to 10 vCenters to one another, allowing for central management of up to 20000 virtual machines in total. VMware saw it fit to implement a nifty little search function into their client, to keep things more manageable.

Thirdly, vCenter will allow the use of vCenter Host Profiles, enabling administrators to build a set of basic configuration rules for every host to be check against. This way, it's possible to push mass configuration changes to several physical hosts at once, forcing them into compliance of the news rules with just a right-click > Apply.

In any case, that's as far as time permits me to write this morning. Time to follow some more sessions and somehow cram more hours into a day to finish the rest. See you at the next blog!

February 25, 2009, 10 comments
  February 25, 2009

How AMD's Istanbul might close the gap with Nehalem EP
blog post by Johan De Gelas
The Istanbul cores are the same as those that can be found in the AMD's latest Shanghai CPU. But the "uncore" part of Istanbul is more interesting. By now, you have probably heard about AMD's "HT-assist" technology, a probe or snoop filter. Every time a new cacheline is brought into the L3-cache of for example CPU 1 on the current Shanghai Platform, a broadcast message is sent to all L3-caches of all CPUs, and CPU 1 has to wait until those CPUs answer. 
 
In the case of Istanbul, the CPU will simply check it's snoop filter in it's own L3-cache, and if none of the other CPUs have that certain cacheline, it can go ahead. This lowers the latency of bringing in a new cacheline and raises the effective bandwidth.
 
To better understand this, we combined our own stream benchmarking with the one that AMD presented. All AMD systems are using DDR-2 800.
 
Stream Triad benchmark
 
As each Stream thread works on its own data, there is no reason to send out coherency synchronization requests. These requests slow the process of getting new cachelines in the L3 and hence lower effective memory bandwidth. What is interesting is that this will not only benefit the applications that use the HT interconnects a lot for coherency traffic, but also applications like stream which do not need the HT interconnects. Also notice that HT 3.0 does not improve memory bandwidth, as Stream will try to keep its thread data local. Our testing used SUSE SLES 10 SP2 and AMD used Windows 2008. Both OSs are well optimized and NUMA aware.
 
This means that especially HPC applications, with many threads all working on their own data, will benefit from the higher effective bandwidth. Besides HT assist, AMD has now confirmed to us that the memory controller has been tuned quite a bit. This higher amount of bandwidth will allow the quad Istanbul to stay out of the reach of the dual Nehalem EP Xeons in many HPC applications.
 
HT assist might also improve the SAP and OLTP scores quite a bit, but for a different reason. SAP and OLTP applications perform a lot of cache coherency syncronization requests, so the snoop filter will substantially lower the average latency of such requests as in some cases:
  • the CPU will only wait on one other CPU (instead of waiting for all responses to come back)
  • the CPU won't have to wait at all, as the other CPUs don't have this line.
Secondly, this will also lower memory latency, which is a bonus for almost every multi-threaded application.
 
Lower memory latency, higher bandwidth, lower "cache coherency" latency and more interconnect bandwidth: the improved "uncore" of Istanbul will be vital to close the gap with Nehalem. Much will depend on how quickly Intel introduces its own hexacore 32 nm Xeons, but that probably won't happen before 2010. Istanbul is shaping up to be a really good alternative for Intel's quadcore Nehalem. We might see a good fight after all...
 
Don't forget to check it.anandtech.com (IT portal) often, as many of our blogposts (for example the VMworld 2009 coverage) are not published on the frontpage of Anandtech.com.
 


February 25, 2009, 42 comments
  February 24, 2009

Live from the bloggers' room at VMworld Europe 2009
blog post by Liz van Dijk
Seeming as how virtualization is a technology that is still expanding exponentially, and our research is not of the kind that drops a subject once the novelty has worn off, the Belgian IT department of Anandtech is once again attending VMworld Europe, with high hopes of greatly improving our knowledge on the vast amounts of fields virtualization has seeped into.
 
So here we are, once more in lovely Cannes, joining 4700 other attendees (up 200 from last year) at undoubtedly one of the best conference locations in Europe. And rather than trying to pour all the information into a big mega-article, we have decided to try some daily blogging as a way of channeling the content of the sessions to our readers.
 
 
 
As it is, the first breakout sessions started a mere 20 minutes ago, and we are excited to see what the day has to offer. As explained by VMware's CEO Paul Maritz in the keynote that kicked off the conference 2 hours ago, VMware's focus is shifting to completely changing the way data centers are structured. Of course, this is a process that was kicked off years ago with the release of server hypervisors in 2000, however, now that the technology has matured sufficiently to allow for amazing breakthroughs like VMotion and Dynamic Resource Scheduling, VMware feels it's time to take data centers to the next level: Cloud Computing. "Providing IT as a Service" or building a "software mainframe" are two of the very nice publication-ready terms he used to describe the idea, and they actually capture the technology quite well.
 
Their strategy to achieve a Cloud Computing environment in any serverroom rests on 3 basic principles:
  • The Virtual Data Center OS (or VDC-OS, in short)
  • vCloud
  • vClient
Each keyword representing one of the big fields VMware chooses to focus future developments on. The VDC-OS stands for their current Virtual Infrastructure technology, allowing global management of the ESX machines in a network. With vCloud, they intend to extend the management of data centers beyond that of the internal architecture, allowing workloads to be distributed to either the internal or an external cloud (offered by ISP's, for example). vClient adds desktop virtualization to the mix, allowing regular users a spot in this newly virtualized landscape, by further improving their VDI technology.

 
 
Tomorrow's keynote is promised to discuss these new technologies a bit more in-depth, so be sure to check back if you are just as curious about VMware's new advances as we are.

Now it's off the follow some breakout sessions we go, see you at the next blog!

February 24, 2009, 2 comments


more posts More posts



pipeboost
Copyright © 1997-2010 AnandTech, Inc. All rights reserved. Terms, Conditions and Privacy Information.
Click Here for Advertising Information