My guess is that they are targeting the high-end HPC workloads, which is where IBM POWER and x86 processors currently reign. Both also can provide large I/O bandwidth (memory and interconnect) that is required to feed the beasts.
While there are a couple of ARM64 processors being designed that can scale up in speed and I/O, those processors are not out yet; and, the ones that are out, don't meet the I/O requirements or provide enough compute horsepower. The best bet would be for someone (maybe NVIDIA?) to build something like a 8 or 16 core ARM64 processor with native NVLink and >= 40GbE or 40Gb IB)
Not to mention, power consumption of an IBM POWER8 or a pair of Broadwell Xeons is still relatively small compared to 4, 8 or 16 of those NVIDIA compute modules.
I'm not sure how you could make a business case for something like that, given the huge development and support cost and how small the market would be. Of course, there's Tegra, but I doubt that's what you're referring to.
"Each POWER8 processor supports up to 1 TB of DDR3 or DDR4 memory with up to 230 GB/s sustained bandwidth (by comparison, Intel’s Xeon E5 v4 chips “only” support up to 76.8 GB/s of bandwidth with DDR4-2400)."
"My guess is that they are targeting high-end HPC workloads, which is where IBM POWER and x86 processors currently reign. Both also can provide large I/O bandwidth (memory and interconnect) that is required to feed the beasts."
-------------------------------------------------
Wow! 230 GB/sec sustained bandwidth for POWER8 is really a brutal number! And the x86 supports 80 GB/sec which is also good. The SPARC M7 only states only 160 GB/sec bandwidth in the specs.
Let us look at memory bandwidth benchmarks, and not a theoretical number. How well do they perform in real benchmarks? In official STREAM benchmarks, we see that https://blogs.oracle.com/BestPerf/entry/20151025_s...
-POWER8 gets 320 GB/sec for a 4-socket system. This translates to 80 GB/sec for one POWER8 cpu. Quite far away from the stated "230GB/sec sustained memory bandwidth", eh?
-E5-2699v3 gets 112 GB/sec for a 2 socket system. This gives 56 GB/sec per x86 cpu.
-SPARC M7 cpu gets 145 GB/sec which is close to the stated number of 160 GB/sec.
Who do you think is most "dishonest" with their numbers? Oracle?
"...IBM says the sustained or delivered bandwidth of the IBM POWER8 12-core chip is 230 GB/s. This number is a peak bandwidth calculation: 230.4 GB/sec = 9.6 GHz * 3 (r+w) * 8 byte. A similar calculation is used by IBM for the POWER8 dual-chip-module (two 6-core chips) to show a sustained or delivered bandwidth of 192 GB/sec (192.0 GB/sec = 8.0 GHz * 3 (r+w) * 8 byte). Peaks are the theoretical limits used for marketing hype, but true measured delivered bandwidth is the only useful comparison to help one understand delivered performance of real applications...."
IBM states that POWER8 has 230 GB/sec _sustained_ bandwidth, by looking at the peak theoretical bandwidth of 230.4 GB/sec. I dont really know when peak theoretical bandwidth became the same thing as sustained bandwidth? In real life benchmarks we see that IBM POWER8 reaches 80 GB/sec, that is 1/3 of their stated number. And someone called Oracle benchmarks "dishonest". Hmmm...
BTW, when Oracle uses 16 GB dimms, SPARC M7 reaches 169 GB/Sec in real life benchmarks. Just look at the bottom of the link I provided. It is quite close to the stated bandwidth of 160 GB/sec.
Let's use math. Let's assume a standard 160GB/s per socket with a 12 coreCPU vs the same BW with a 32 core CPU. Which system will provide better performance at the core level ?
Now let's throw some GPUs, because, you know, HPC does not run Oracle and makes great use of them. Now, to make things more interesting, let's use Linux for HPC ... because, you know, they mostly run on Linux not Solaris.
Well, one socket with TWO power8 cpus reaches 160GB/sec (read my post on page2), just the same as one SPARC M7. So the SPARC M7 has at least twice the memory bandwidth. The POWER8 S822L server has two sockets, but four POWER8 cpus. And if you look at benchmarks, the POWER8 is slower than x86. Whilst being more expensive. That is quite bad. https://blogs.oracle.com/BestPerf/entry/201510_spe...
Entry P8 systems use a DCM for cost reasons. Since each die in the DCM has half its memory controllers enabled, the memory performance is effectively identical to the SCM used in other P8 systems.
Of course, you knew this, since you've been told it at length at RWT and other places - but don't let me get in the way of your multi-decade Sun/Oracle promotion...
Your argument is that IBM must lie about POWER8 but Oracle wouldn't lie about Sparc? Is this because Oracle is known as an unbiased and independent source of information when it comes to Sparc?
No, that is not my argument. My argument is: check the benchmarks and convince yourself that four POWER8 cpus in two sockets, reaches 320GB/sec, just the same as TWO sparc m7 cpus. So the SPARC M7 bandwidth is twice as fast.
And IBM tries to trick people believing that two sockets in the S822L server, corresponds to two POWER8 cpus, but actually, it has four POWER8 cpus. That is very ugly, because IBM tries to trick you to believe that the POWER8 is twice as good as it is. Read page two here, and see my comment on "320GB/sec for two socket POWER8". Very very ugly FUD from IBMers indeed.
There are two different POWER8 dies, one with 6 cores and another with 12 cores. The 6 core die has four memory channels where as the 12 core die has eight memory channels. Aggregate bandwidth is the same if you have dual chip model or a single chip module.
IBM hasn't done so but it is feasible they could release the single die, 12 core part in the lower end S812L/S814/S822L/S824 systems.
Who would have thought that a 32 core SPARC CPU will come out better than a 10 core POWER8 CPU, right? And mind you, a 32 core CPU comes out 2 times faster than a 10 core CPU.
The reason why sustained is close to theoretical peak is due to the Centuar memory buffer chip. The actual bandwidth from the DIMMs to the Centuar chip is actually 410 GB/s. The link between the POWER8 and eight Centaur chips is 230 GB. In addition, IBM places 16 MB L4 cache on the Centaur chips so there is a quick buffer there related to each memory channel individually. (The Centaur chips cannot cache any data that resides outside of its own memory channel.) Between the 16 MB of cache per memory channel and the higher raw DRAM bandwidth to the buffer than to the CPU, it isn't surprising that the two figures are relatively close.
Oracle's link you cite doesn't list the full S824 hardware configuration. For memory bandwidth tests, the S824 can only obtain maximum memory bandwidth when all of its memory slots are occupied. This is due to the S824 using proprietary DIMMs that include the Centaur memory buffer. Thus it is easily possible to handicap overall memory bandwidth performance by using fewer but higher capacity DIMMs. In fact, Oracle could have conifgured the S824 with only 4 DIMMs to get the 512 GB capacity noted ( https://blogs.oracle.com/BestPerf/resource/stream/... ) and thus hinder performance.
Well the Oracle benchmark is a joke. It's a competitor submission. There are quite a few Things that makes you raise an eyebrow. The tester have done quite a few different from what he/she would normally have done for an Oracle's own product submission. First the Firmware level of the machine is backlevel.. to put it mildly the level uses is 810_087, current is SV840_087, which makes it 13 versions behind current firmware level. Brrrr... On Oracle's own stream submissions, they normally MAX the memory.. which they haven't done here, they don't even put Down the memory configuration, they normally do that for their own submissions. And last.. he/she configures the machine with 96 logical cores, but only uses 24. *COUGH* on Oracle's own submission he configures and runs threads over 1024 for the T5-8. So.. well.. even though they try to stack the Cards, the S824 still puts up quite impressive numbers for such an old machine. // Jesper
And here we go again. In think in ALL benchmarks I showed you, EVERY SINGLE ONE OF THEM, you have rejected them all because of this or that. And you did this years ago, and it is a bit worrying you have not changed a bit still today. I mean, dont you realize that as you expect everyone to accept your IBM benchmarks, but reject Oracle benchmarks - you seem like a bit biased fan boy? Some would say a VERY BIASED fanboy.
Myself support SPARC, yes, but I am guided by benchmarks. I support the best technique. And when POWER7 came and was fastest, I congratulated the IBM benchmarks and agreed POWER7 was fastest. Today SPARC M7 is superior, and the best technique. POWER8 is lagging far behind, even slower than x86. Just look at the benchmarks.
Anyway, on that site with benchmarks I gave you, there are 25ish benchmarks where SPARC M7 is 2-3x faster than POWER8. I understand you reject all of them because of this or that. As you did earlier when we discussed SPARC T2+ vs POWER6, years back. I think it is fascinating that your conclusion is that in every single 25ish benchmark, POWER8 is faster than SPARC M7 because of this or that. I have never understood your logic; hard numbers say that SPARC M7 cpus are 5.5x faster than POWER8 cpus in OLTP workloads: https://blogs.oracle.com/BestPerf/entry/20160317_s... but still you vehemently claim that POWER8 is faster, because of this or that. Amazing.
I remember when you claimed that, when I showed you benchmarks where x86 beat POWER6 in linpack, you claimed that POWER6 was the faster cpu because one POWER6 core was faster. Amazing. How do you reason to a man displaying flawed logic like that? x86 scores higher in Linpack, but still POWER6 is faster in Linpack - because one POWER6 core is faster! Amazing.
And when I point out that because one single core is faster, does not make the entire cpu faster - you tried to dribbled away that too with talking about this or that. Bios patch level, RAM timing, etc etc etc. Your conclusion was; if you want the fastest Linpack performance, you get a POWER6, even though a Xeon scored higher. Amazing.
And now on this STREAM benchmark where SPARC M7 is 2x faster than POWER8, your conclusion is that POWER8 is faster than SPARC M7 because of... Firmware level? And Oracle didnt max the memory? That is why POWER8 somehow magically increased from 80 GB/sec up to and beyond 160 GB/sec. Because of Firmware level and no max of RAM. Great. (Question: do you really believe this, yourself? Doesnt it sound a bit... hollow and unconvincing? No?)
The Power8 STREAM numbers are from IBM themselves, as KevinG supplied in the link.
Lets look at the rest of the 25ish benchmarks where SPARC M7 is 2-3x faster, up to 11x faster than POWER8. How will you explain that POWER8 is faster than SPARC M7 in each of them? Different RAM? RAM latency? Power supply? Keyboard and mouse?
What you don't understand is that in my professional life I don't really care if a server is Blue, Brown or Purple. I am not a fanboi. What I do care about is what gives my Company the biggest ROI. That is my job. And actually just a few months ago I finished a 'paper' in which I came up with the strategic solution stack for my company's Oracle Platform.
And it wasn't Power, it wasn't Windows, it wasn't AIX....I would have liked it to be M7 for technical reasons, but my 'Cold hard unbiased' analysis showed that the best platform for my Company was x86 box mover XXX using Intel Xeon's Ex-xxxx v4, with OVM ontop and Oracle Linux. The reason the M7 lost was mostly due to inhouse skills, cause It scored better than x86 in my analysis. Currently I am looking on what platform to choose for an IBM solution stack, and to be honestly nobody comes even close to Linux on POWER but perhaps for AIX on POWER, but here my Companys skill base is more suited for Linux. So I am pretty convinced that the paper I am writing will be pointing towards Linux on POWER (with POWERVM not KVM) as the platform of choice for IBM software products.
I cannot limit myself to 'car magazine' IT as you do, or Vendor FUD.
Again the Stream submission stated : Date result produced : Thu Oct 22 2015 Questions? : Gnanakumar.Rajaram@Oracle.com That is hardly an IBM submission.
I would be just as critical towards an IBM Stream submission on a M7 based machine where you only Ran 1 thread per core. It would most likely have horrific numbers.
And .. well you are kind of beyond normal reason. I don't dislike the M7, I think it is a wonderful chip, unfortunately the business case for using it compared to Xeon's with OVM and Oracle Linux just isn't there for us.
@Brutalizer "And here we go again. In think in ALL benchmarks I showed you, EVERY SINGLE ONE OF THEM, you have rejected them all because of this or that. [...] The Power8 STREAM numbers are from IBM themselves, as KevinG supplied in the link."
Don't you think it is a bit hypocritical to cite me since two posts below this you call me a troll, that I'm not serious and that "I don't know what I'm talking about"?
Also the link I supplied ( https://blogs.oracle.com/BestPerf/resource/stream/... ) if you haven't noticed is from Oracle, not IBM. If "Oracle" in the URL wasn't a big enough hint, the disclaimer on the first page of that PDF indicates that it was Oracle doing the testing themselves. That PDF is Oracle's detailed testing configuration and as I pointed out, omits a few technical details that are relevant to memory bandwidth. IE we don't know how many DIMMs were installed in the S824. This is a very fair criticism of Oracle's testing and disclosure since it directly impacts performance.
@Kevin G Sorry but I can not "discuss" with you, as you 1) are not serious. 2) trolling. 3) have no clue about the stuff you talk about.
For instance we had a "discussion" about the SGI UV2000 server, that is only used for scale-ou clustered HPC like workloads - and you vehemently explained UV2000 also was used for scale-up workloads like large POWER or SPARC servers. Scale-up workloads such as SAP can not be run on clusters, it can only be run on scale-up servers (look at the SAP top list). UV2000 is in practice a cluster, and there are no SAP entries for that cluster. Still you insisted UV2000 could replace a large Unix server on SAP. When I asked about proof and links, you answered: 1) You can boot SAP on a UV2000 2) It is probably for you that it is possible to run SAP on UV2000 3) something silly I dont remember
You did not present any links to a customer running SAP on UV2000, no SAP benchmarks, no nothing. I even emailed SGI and asked about customers running SAP on UV2000 and there was NO such customer. And even after all this information, you kept insisting on UV2000 could replace large Unix scale-up servers such as POWER and SPARC. No customer runs SAP on UV2000, they only run clustered workloads on x86 UV2000 Linux server.
And I tried to explain to you why it is not possible to run SAP on a cluster: it is because the code branches everywhere, and I gave you links to SGI where they confirm this. And you asked several times "what does it mean when code branches everywhere" and finally I explained in a long essay that ERP software serves 1000s of clients so the server is doing lot of things simultaneously: accounting, book keeping, pay rolls, etc - so the code can never be cached. All these clients are doing all sorts of things, so the code branches everywhere, cache is not that important in scale-up servers. That is why SPARC M7 has a brutal bandwidth, and not so good cache. Bandwidth is more important in scale-up servers than cache.
OTOH, clusters run HPC workloads, that is, run a tight for loop with calculations on the same set of grid points, so the code does not branch, you can cache it well. Cache is important, as the software seldom goes out to RAM. So bandwidth is not important. And if you look at x86 it has low bandwidht and good cache, making it useless for scale-up workloads.
Scale-up workloads thrash the cache. That is why UV2000 can not be used for SAP. I wrote a long essay explaining all this about cache and RAM etc etc. And still after this long explanation, you still did not understand anything about thrashing the cache. You kept writing stuff that showed you did not understand anything why a cluster can not run tightly coupled software.
So, it is useless to try to explain you how things really are. You are given links and explanations, and still you dont get it. You are just very uneducated about computer tech. That makes it difficult to talk to you, because you dont get it, when someone show you links or explain why you are wrong. You dont understand.
Or, you do understand but pretend not to, so you are Trolling to make me write long texts to you. I cant decide which. If you are trolling and pretending to be obtuse, or if you are obtuse for real.
Anyway, I advise anyone to not go into technical discussions with you, because you dont understand what you are talking about. If I point out an error in your post, you dont get it. I even post links, and you read them, and STILL you dont get it. I used to say "there are no dumb people, only ignorant" - but in your case, I am not really sure anymore...
Actually POWER8 has 320GB/s with a 2 SOCKET system (check link below pg 15)... so it would be 160GB/s per physical chip (you know, despite whatever concept you have, the silicon thing you can pick and plug on a socket).
The POWER8 for 2 SOCKET machines is rated at 192GB/s, while his HIGH-END counterpart is rated at 230GB/s. So the difference here is only 16%.
And... who on earth still considers a closed platform like SPARC an HPC alternative ? The only SPARC there is K Computer, which is Fujitsu's SPARC64.
So the 2-socket 10-core POWER8 S822L server has 320 GB/s bandwidth? So therefore ONE power8 cpu has 160GB/sec? Well, this again shows how ugly and FUDing the IBM people are.
This S822L server has two sockets, but FOUR cpus. So one socket has 160 GB/sec but one POWER8 cpu has 80 GB/sec. You knew this, and tried to trick me into believing that one POWER8 cpu is as fast as SPARC M7. That is the reason you talk about "physical chip..." nonsense.
First I actually wrote a post acknowledging that the POWER8 cpu's bandwidth is as high as SPARC M7, but as I wanted to add a small detail I checked it up, and lo and behold, the S88L server has FOUR cpus. Very very ugly of you. Why am I not surprised by the IBM people? You do know that the term FUD was coined by one of IBM customers? https://en.wikipedia.org/wiki/Fear,_uncertainty_an...
And finally, if you read the link (it is a research paper) you gave me, it says that the researchers ran the STREAM benchmark 1,000 times, and 10% of the time, the POWER8 server reached 320GB/sec. The rest of the time that is, 23% of the time the benchmark reached 245GB/sec, and 16% of the time it reached 260GB/sec and 24% of the time it reached 290GB/sec. So, only a very small fraction of the time the POWER8 reached 320GB/sec.
So, 10% of the time FOUR power8 cpus reached the same bandwidth as two SPARC M7. The rest of the time, it was far lower. POWWER8 ram bandwidth is not impressive nowhere.
And regarding "And... who on earth still considers a closed platform like SPARC an HPC alternative ?" Well, have you heard about OpenSPARC which is GNU licensed? Is POWER8 open as in GNU? No? Aha. https://en.wikipedia.org/wiki/OpenSPARC
.
And if you really want HPC, then SPARC M7 is the fastest computing cpu out there, reaching 1200 SPECint2006 and 832 SPECfp2006. Whereas POWER8 reaches 642 SPECint2006 and 468 SPECfp2006 which is slower than x86 actually. I dont understand why POWER8 is much more expensive than a x86 whilst being slower? Here are 25ish benchmarks where SPARC M7 is 2-3x faster than POWER8 and Intel Xeon E5-2699v3 (all the way up to 11x faster): https://blogs.oracle.com/BestPerf/entry/201510_spe...
RTFM Kebabbert. Read section 2.2 in http://www.redbooks.ibm.com/redpapers/pdfs/redp513... Each socket in a S824(L or no L) has 8 memory channels, putting a DCM module (dual chip module) as in the or a single chip module or a Quad chip module (if such a thing existed for POWER8), will all give you the same memory bandwidth.
You just have to read the documentation... it's all there .. no conspiracy.. no .. magical cheating.
It is amazing that you have keept up your Rant for so many years.
Fine, but it is still TRUE that in the server that IBM benchmarked, there where FOUR power8 cpus. Not two. So, it does not matter how much you try to dribble away the fact it was FOUR power8 cpus by talking about irrelevant things.
It is amazing you still keep on doing dribbling away facts after this many years. I remember when I showed you SPARC T2+ 1.6GHz besting POWER6 4.7GHz benchmarks to which you replied: -No, the throughput benchmark you show is irrelevant because POWER6 had lower latency. And that is what counts. So, POWER6 is in general faster than SPARC T2+. Some time later I showed you a benchmark where SPARC T2+ had lower latency, and you replied: -No, latency is not important. The only thing that is important is throughput, so POWER6 is faster again.
It IS really amazing that you STILL keep on doing this. One would hope you have matured after all these years. I keep on showing you benchmark after benchmark, which you all reject because of this or that. And when you show me benchmarks, I accept them. But still you reject ALL oracle benchmarks "cherrypicking, etc etc" and when you show me IBM benchmarks you expect me to accept them all. Which I do. And I have asked you why I must accept your benchmarks, but why you reject my benchmarks - to which you reply "Oracle cherrypicking".
Enough of this, I am very well aware of your "debate" technique since many years back. Show us benchmarks where IBM POWER8 cpus have higher RAM bandwidth than SPARC M7. Show us facts and hard numbers, that is what counts. Not your dribbling of facts. If you can not show benchmarks, then I suggest you ventilate your opinions on IBM fan pages instead.
.
And besides, the benchmark I posted, where Oracle compares SPARC M7 to POWER8 in STREAM bandwidth, to which some imply that Oracle lie about the POWER8 result - well that POWER8 result is from IBM themselves. Read the Oracle link and see the reference points to IBM web page. So, Oracle is not lying about POWER8 results.
No. There were 4 processors each with 6 cores and 4 memory channels in the benchmark that Oracle ran.
Eh... show you numbers that POWER8 has higher Stream benchmark numbers than a SPARC M7 ? Kind of hard as there are no official STREAM submissions for neither the M7 nor POWER8.
Now surely the per chip throughput of the M7 is very hard to match, it is IMHO the undisputed king of the hill when it comes to chip throughput. It has around 33% faster throughput on many workloads than the best that POWER8 or Xeons can produce. But Again it comes at a Price that you have to have a large number of threads to exploid the potential of the Moster M7 chip.
And not all applications are able to handle such a huge number of threads (256 for the M7 and up to 96 for POWER8) We do know, cause IBM releases the numbers, what effect it has running only 1 thread per core on the POWER8 chip. Basically it half the throughput going from 8 threads per core to 1 thread per core. For the M7 with it's simpler HW threading the impact surely must be much bigger.
Again reality is more complex than your carmagazine IT World.
anyone remember how Maxwell was supposed to include x64 ARM cores on the flagship GPU's to accelerate performance? It quietly fell off the roadmap about 6 months before maxwell 1 released.
They did have this on their roadmap several years ago. Between Project Denver being lack luster and TSMC's 20 nm process not being suitable for large GPUs, nVidia's roadmaps has radically changed since then. In fact, Pascal wasn't even part of nVidia's roadmap then:
Well... Some things run better in parallel and there is when a GPU with 3584 cores can handle brilliantly. Other things run sequentially, and then you need the fastest core available to get rid of the contention.
NVLINK is primarily for GPUs to exchange data between them. This avoid going thru PCI bus, passing by the CPU just to get to the other GPU on the system.
CAPI on the other hand is built for this kind of thing you mention with IB. Melanie 100Gbps IB (2016) and 200Gbps IB (2017) will leverage CAPI in order to allow direct memory Access, without the need of going thru the CPU. Can you imagine RDMA over 100Gbps with with adapter writing directly to destination memory address ?!?
NVLink is actually designed as a processor to processor link with NUMA for memory sharing (think intel QPI or AMD hypertransport). And I thought one of the point of using POWER8 with Pascal was to link the POWER8 CPU and the GPU using NVLink instead of PCIe. That would give better bandwidth and probably better latency, while allowing shared memory.
POWER8+ is supposed to support nvLink natively and is supposed to launch later this year. It is odd that IBM is showing off Pascal connected via the PCIe bus with POWER8 instead of directly with POWER8+.
I wonder if Intel will make a REAL server CPU again one day. I know they tried, and failed, with Itanium, but if they kep the x86 arch and built it massively wide, put huge memory b/w, and targeted similar TDP's as the POWER CPU's typically do (250+w) ... it could be interesting. As it is now, Intel server chips are just a crapton of little mobile cores hooked up (granted, they do use a pretty fancy ringbus network and all).
They are cornered in technology terms. Lates chip only got 5% better at the core level, so they packed more cores. When you have a GPU at your side you gotta be pristine on single thread. It might do the trick for cloud with several virtual machines, but for HPC it is not the way to handle it.
They can pull it off, but so far being the do-it-all cpu is not looking good.
Intel can do higher clock speeds and even higher IPC. The problem is that power consumption would increase at a rate higher than the performance gains from the additional clocks or IPC. Intel has an internal design rule that any major change can only increase power consumption by 1% only if it increases performance by 2% or more. This has forced Intel to focus on efficiency, not absolute raw performance.
Things are slowly changing as SkyLake-EP is a slightly different core than the consumer SkyLake chips currently on the market. The workstation/server chip gets AVX-512 and believed to carry 512 KB of L2 cache per core for example. I can see Intel implementing a few IPC increases that don't adhere to their current design rules just to push the market forward.
"Intel is cornered" song is so old nobody even registers it consciously anymore. Believing absolute technology leader can`t eke more than 5% IPC increase yearly because of anything other than marketing considerations is plain dumb.
I worked on a contemporary cluster computer, 16 TFLOPS (linpack) and consumed slightly over 1MW. GPGPU was starting to take off; we had some test systems with dual 7870s.
Pretty impressive that you can get this much performance on your desk (or more likely a rolling rack you can tuck somewhere far away.. these will still be pretty loud).
Maybe you have more knowledge, but last time i checked, there was some ARM servers ready boards with 48 or 96 CPU cores, but problem was that individual core was relatively weak.. Good for webservers, but not for other on easily parallel computing. Also ARM visualization still looks like something which is not still happening - Vmware dont support ARM at all, there is not big company behind it, like IBM for Lpars virtualisation or HP - vPars.
As far as i know is strongest cores are in Tegra X1 ARM which is 4 "strong" APU (A57) + 4 weak (A53 )in little big design, even if i would ignore weak ones, its 4 cores in 15W TDP evelope, which means that they are 15/4- 4W cores.. in comparison with x86 were we have 65W package (Broadwell / Skylake) for 4 core - 1 cores uses - 16W.
So i my eyes, inability to create just 1 beefier ARM core is main issue, why we not using ARM in our desktops, servers or consoles.. Im really curious how hard job is just create this beefy ARM, is angaist ARM architecture fundamentals or just about add some more L2, L3 cache, silicon and increase frequency?
The whole point of using GPU's for highly parallised workloads is to do the parallel work and leave the single-threaded work to a CPU. It's pointless to have more CPU cores with lower single-threaded performance in conjunction with GPU's. The best system will have fast single-threaded performance in fewer CPU cores and leave the parallel work to the GPU's.
Who in their right mind thought that putting a flat-chested lady wearing a male shirt next to a rack of switches would make for a good PR image? WTF is it supposed to represent anyway?
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
50 Comments
Back to Article
Pinn - Wednesday, April 6, 2016 - link
Any reason not to use ARM64?questionlp - Wednesday, April 6, 2016 - link
My guess is that they are targeting the high-end HPC workloads, which is where IBM POWER and x86 processors currently reign. Both also can provide large I/O bandwidth (memory and interconnect) that is required to feed the beasts.While there are a couple of ARM64 processors being designed that can scale up in speed and I/O, those processors are not out yet; and, the ones that are out, don't meet the I/O requirements or provide enough compute horsepower. The best bet would be for someone (maybe NVIDIA?) to build something like a 8 or 16 core ARM64 processor with native NVLink and >= 40GbE or 40Gb IB)
Not to mention, power consumption of an IBM POWER8 or a pair of Broadwell Xeons is still relatively small compared to 4, 8 or 16 of those NVIDIA compute modules.
Pinn - Wednesday, April 6, 2016 - link
Makes sense. Would love to see that combination arm64 and pascal/volta. Thanks!Ktracho - Wednesday, April 6, 2016 - link
I'm not sure how you could make a business case for something like that, given the huge development and support cost and how small the market would be. Of course, there's Tegra, but I doubt that's what you're referring to.Brutalizer - Wednesday, April 6, 2016 - link
"Each POWER8 processor supports up to 1 TB of DDR3 or DDR4 memory with up to 230 GB/s sustained bandwidth (by comparison, Intel’s Xeon E5 v4 chips “only” support up to 76.8 GB/s of bandwidth with DDR4-2400).""My guess is that they are targeting high-end HPC workloads, which is where IBM POWER and x86 processors currently reign. Both also can provide large I/O bandwidth (memory and interconnect) that is required to feed the beasts."
-------------------------------------------------
Wow! 230 GB/sec sustained bandwidth for POWER8 is really a brutal number! And the x86 supports 80 GB/sec which is also good. The SPARC M7 only states only 160 GB/sec bandwidth in the specs.
Let us look at memory bandwidth benchmarks, and not a theoretical number. How well do they perform in real benchmarks? In official STREAM benchmarks, we see that
https://blogs.oracle.com/BestPerf/entry/20151025_s...
-POWER8 gets 320 GB/sec for a 4-socket system. This translates to 80 GB/sec for one POWER8 cpu. Quite far away from the stated "230GB/sec sustained memory bandwidth", eh?
-E5-2699v3 gets 112 GB/sec for a 2 socket system. This gives 56 GB/sec per x86 cpu.
-SPARC M7 cpu gets 145 GB/sec which is close to the stated number of 160 GB/sec.
Who do you think is most "dishonest" with their numbers? Oracle?
"...IBM says the sustained or delivered bandwidth of the IBM POWER8 12-core chip is 230 GB/s. This number is a peak bandwidth calculation: 230.4 GB/sec = 9.6 GHz * 3 (r+w) * 8 byte. A similar calculation is used by IBM for the POWER8 dual-chip-module (two 6-core chips) to show a sustained or delivered bandwidth of 192 GB/sec (192.0 GB/sec = 8.0 GHz * 3 (r+w) * 8 byte). Peaks are the theoretical limits used for marketing hype, but true measured delivered bandwidth is the only useful comparison to help one understand delivered performance of real applications...."
Brutalizer - Wednesday, April 6, 2016 - link
IBM states that POWER8 has 230 GB/sec _sustained_ bandwidth, by looking at the peak theoretical bandwidth of 230.4 GB/sec. I dont really know when peak theoretical bandwidth became the same thing as sustained bandwidth? In real life benchmarks we see that IBM POWER8 reaches 80 GB/sec, that is 1/3 of their stated number. And someone called Oracle benchmarks "dishonest". Hmmm...Brutalizer - Wednesday, April 6, 2016 - link
BTW, when Oracle uses 16 GB dimms, SPARC M7 reaches 169 GB/Sec in real life benchmarks. Just look at the bottom of the link I provided. It is quite close to the stated bandwidth of 160 GB/sec.JohnMirolha - Wednesday, April 6, 2016 - link
Let's use math.Let's assume a standard 160GB/s per socket with a 12 coreCPU vs the same BW with a 32 core CPU.
Which system will provide better performance at the core level ?
Now let's throw some GPUs, because, you know, HPC does not run Oracle and makes great use of them.
Now, to make things more interesting, let's use Linux for HPC ... because, you know, they mostly run on Linux not Solaris.
SPARC M = irrelevant
Brutalizer - Thursday, April 7, 2016 - link
Well, one socket with TWO power8 cpus reaches 160GB/sec (read my post on page2), just the same as one SPARC M7. So the SPARC M7 has at least twice the memory bandwidth. The POWER8 S822L server has two sockets, but four POWER8 cpus. And if you look at benchmarks, the POWER8 is slower than x86. Whilst being more expensive. That is quite bad.https://blogs.oracle.com/BestPerf/entry/201510_spe...
SarahKerrigan - Friday, April 8, 2016 - link
Entry P8 systems use a DCM for cost reasons. Since each die in the DCM has half its memory controllers enabled, the memory performance is effectively identical to the SCM used in other P8 systems.Of course, you knew this, since you've been told it at length at RWT and other places - but don't let me get in the way of your multi-decade Sun/Oracle promotion...
close - Thursday, April 7, 2016 - link
Your argument is that IBM must lie about POWER8 but Oracle wouldn't lie about Sparc? Is this because Oracle is known as an unbiased and independent source of information when it comes to Sparc?Brutalizer - Thursday, April 7, 2016 - link
No, that is not my argument. My argument is: check the benchmarks and convince yourself that four POWER8 cpus in two sockets, reaches 320GB/sec, just the same as TWO sparc m7 cpus. So the SPARC M7 bandwidth is twice as fast.And IBM tries to trick people believing that two sockets in the S822L server, corresponds to two POWER8 cpus, but actually, it has four POWER8 cpus. That is very ugly, because IBM tries to trick you to believe that the POWER8 is twice as good as it is. Read page two here, and see my comment on "320GB/sec for two socket POWER8". Very very ugly FUD from IBMers indeed.
Kevin G - Thursday, April 7, 2016 - link
There are two different POWER8 dies, one with 6 cores and another with 12 cores. The 6 core die has four memory channels where as the 12 core die has eight memory channels. Aggregate bandwidth is the same if you have dual chip model or a single chip module.IBM hasn't done so but it is feasible they could release the single die, 12 core part in the lower end S812L/S814/S822L/S824 systems.
close - Friday, April 8, 2016 - link
Who would have thought that a 32 core SPARC CPU will come out better than a 10 core POWER8 CPU, right? And mind you, a 32 core CPU comes out 2 times faster than a 10 core CPU.Good comparison boy.
JohnMirolha - Wednesday, April 6, 2016 - link
POWER8 starting rolling out the support for 2TB of Memory per socket, starting with High-End systems.Kevin G - Thursday, April 7, 2016 - link
The reason why sustained is close to theoretical peak is due to the Centuar memory buffer chip. The actual bandwidth from the DIMMs to the Centuar chip is actually 410 GB/s. The link between the POWER8 and eight Centaur chips is 230 GB. In addition, IBM places 16 MB L4 cache on the Centaur chips so there is a quick buffer there related to each memory channel individually. (The Centaur chips cannot cache any data that resides outside of its own memory channel.) Between the 16 MB of cache per memory channel and the higher raw DRAM bandwidth to the buffer than to the CPU, it isn't surprising that the two figures are relatively close.Oracle's link you cite doesn't list the full S824 hardware configuration. For memory bandwidth tests, the S824 can only obtain maximum memory bandwidth when all of its memory slots are occupied. This is due to the S824 using proprietary DIMMs that include the Centaur memory buffer. Thus it is easily possible to handicap overall memory bandwidth performance by using fewer but higher capacity DIMMs. In fact, Oracle could have conifgured the S824 with only 4 DIMMs to get the 512 GB capacity noted ( https://blogs.oracle.com/BestPerf/resource/stream/... ) and thus hinder performance.
jesperfrimann - Friday, April 8, 2016 - link
Well the Oracle benchmark is a joke. It's a competitor submission. There are quite a few Things that makes you raise an eyebrow. The tester have done quite a few different from what he/she would normally have done for an Oracle's own product submission.First the Firmware level of the machine is backlevel.. to put it mildly the level uses is 810_087, current is SV840_087, which makes it 13 versions behind current firmware level. Brrrr... On Oracle's own stream submissions, they normally MAX the memory.. which they haven't done here, they don't even put Down the memory configuration, they normally do that for their own submissions.
And last.. he/she configures the machine with 96 logical cores, but only uses 24. *COUGH* on Oracle's own submission he configures and runs threads over 1024 for the T5-8.
So.. well.. even though they try to stack the Cards, the S824 still puts up quite impressive numbers for such an old machine.
// Jesper
Brutalizer - Friday, April 8, 2016 - link
And here we go again. In think in ALL benchmarks I showed you, EVERY SINGLE ONE OF THEM, you have rejected them all because of this or that. And you did this years ago, and it is a bit worrying you have not changed a bit still today. I mean, dont you realize that as you expect everyone to accept your IBM benchmarks, but reject Oracle benchmarks - you seem like a bit biased fan boy? Some would say a VERY BIASED fanboy.Myself support SPARC, yes, but I am guided by benchmarks. I support the best technique. And when POWER7 came and was fastest, I congratulated the IBM benchmarks and agreed POWER7 was fastest. Today SPARC M7 is superior, and the best technique. POWER8 is lagging far behind, even slower than x86. Just look at the benchmarks.
Anyway, on that site with benchmarks I gave you, there are 25ish benchmarks where SPARC M7 is 2-3x faster than POWER8. I understand you reject all of them because of this or that. As you did earlier when we discussed SPARC T2+ vs POWER6, years back. I think it is fascinating that your conclusion is that in every single 25ish benchmark, POWER8 is faster than SPARC M7 because of this or that. I have never understood your logic; hard numbers say that SPARC M7 cpus are 5.5x faster than POWER8 cpus in OLTP workloads:
https://blogs.oracle.com/BestPerf/entry/20160317_s...
but still you vehemently claim that POWER8 is faster, because of this or that. Amazing.
I remember when you claimed that, when I showed you benchmarks where x86 beat POWER6 in linpack, you claimed that POWER6 was the faster cpu because one POWER6 core was faster. Amazing. How do you reason to a man displaying flawed logic like that? x86 scores higher in Linpack, but still POWER6 is faster in Linpack - because one POWER6 core is faster! Amazing.
And when I point out that because one single core is faster, does not make the entire cpu faster - you tried to dribbled away that too with talking about this or that. Bios patch level, RAM timing, etc etc etc. Your conclusion was; if you want the fastest Linpack performance, you get a POWER6, even though a Xeon scored higher. Amazing.
And now on this STREAM benchmark where SPARC M7 is 2x faster than POWER8, your conclusion is that POWER8 is faster than SPARC M7 because of... Firmware level? And Oracle didnt max the memory? That is why POWER8 somehow magically increased from 80 GB/sec up to and beyond 160 GB/sec. Because of Firmware level and no max of RAM. Great. (Question: do you really believe this, yourself? Doesnt it sound a bit... hollow and unconvincing? No?)
The Power8 STREAM numbers are from IBM themselves, as KevinG supplied in the link.
Lets look at the rest of the 25ish benchmarks where SPARC M7 is 2-3x faster, up to 11x faster than POWER8. How will you explain that POWER8 is faster than SPARC M7 in each of them? Different RAM? RAM latency? Power supply? Keyboard and mouse?
jesperfrimann - Friday, April 8, 2016 - link
What you don't understand is that in my professional life I don't really care if a server is Blue, Brown or Purple. I am not a fanboi. What I do care about is what gives my Company the biggest ROI.That is my job. And actually just a few months ago I finished a 'paper' in which I came up with the strategic solution stack for my company's Oracle Platform.
And it wasn't Power, it wasn't Windows, it wasn't AIX....I would have liked it to be M7 for technical reasons, but my 'Cold hard unbiased' analysis showed that the best platform for my Company was x86 box mover XXX using Intel Xeon's Ex-xxxx v4, with OVM ontop and Oracle Linux. The reason the M7 lost was mostly due to inhouse skills, cause It scored better than x86 in my analysis.
Currently I am looking on what platform to choose for an IBM solution stack, and to be honestly nobody comes even close to Linux on POWER but perhaps for AIX on POWER, but here my Companys skill base is more suited for Linux. So I am pretty convinced that the paper I am writing will be pointing towards Linux on POWER (with POWERVM not KVM) as the platform of choice for IBM software products.
I cannot limit myself to 'car magazine' IT as you do, or Vendor FUD.
Again the Stream submission stated :
Date result produced : Thu Oct 22 2015 Questions? : Gnanakumar.Rajaram@Oracle.com
That is hardly an IBM submission.
I would be just as critical towards an IBM Stream submission on a M7 based machine where you only Ran 1 thread per core. It would most likely have horrific numbers.
And .. well you are kind of beyond normal reason. I don't dislike the M7, I think it is a wonderful chip, unfortunately the business case for using it compared to Xeon's with OVM and Oracle Linux just isn't there for us.
// Jesper
Meteor2 - Friday, April 8, 2016 - link
We came to the same conclusion.Kevin G - Friday, April 8, 2016 - link
@Brutalizer"And here we go again. In think in ALL benchmarks I showed you, EVERY SINGLE ONE OF THEM, you have rejected them all because of this or that.
[...]
The Power8 STREAM numbers are from IBM themselves, as KevinG supplied in the link."
Don't you think it is a bit hypocritical to cite me since two posts below this you call me a troll, that I'm not serious and that "I don't know what I'm talking about"?
Also the link I supplied ( https://blogs.oracle.com/BestPerf/resource/stream/... ) if you haven't noticed is from Oracle, not IBM. If "Oracle" in the URL wasn't a big enough hint, the disclaimer on the first page of that PDF indicates that it was Oracle doing the testing themselves. That PDF is Oracle's detailed testing configuration and as I pointed out, omits a few technical details that are relevant to memory bandwidth. IE we don't know how many DIMMs were installed in the S824. This is a very fair criticism of Oracle's testing and disclosure since it directly impacts performance.
Brutalizer - Friday, April 8, 2016 - link
@Kevin GSorry but I can not "discuss" with you, as you
1) are not serious.
2) trolling.
3) have no clue about the stuff you talk about.
For instance we had a "discussion" about the SGI UV2000 server, that is only used for scale-ou clustered HPC like workloads - and you vehemently explained UV2000 also was used for scale-up workloads like large POWER or SPARC servers. Scale-up workloads such as SAP can not be run on clusters, it can only be run on scale-up servers (look at the SAP top list). UV2000 is in practice a cluster, and there are no SAP entries for that cluster. Still you insisted UV2000 could replace a large Unix server on SAP. When I asked about proof and links, you answered:
1) You can boot SAP on a UV2000
2) It is probably for you that it is possible to run SAP on UV2000
3) something silly I dont remember
You did not present any links to a customer running SAP on UV2000, no SAP benchmarks, no nothing. I even emailed SGI and asked about customers running SAP on UV2000 and there was NO such customer. And even after all this information, you kept insisting on UV2000 could replace large Unix scale-up servers such as POWER and SPARC. No customer runs SAP on UV2000, they only run clustered workloads on x86 UV2000 Linux server.
And I tried to explain to you why it is not possible to run SAP on a cluster: it is because the code branches everywhere, and I gave you links to SGI where they confirm this. And you asked several times "what does it mean when code branches everywhere" and finally I explained in a long essay that ERP software serves 1000s of clients so the server is doing lot of things simultaneously: accounting, book keeping, pay rolls, etc - so the code can never be cached. All these clients are doing all sorts of things, so the code branches everywhere, cache is not that important in scale-up servers. That is why SPARC M7 has a brutal bandwidth, and not so good cache. Bandwidth is more important in scale-up servers than cache.
OTOH, clusters run HPC workloads, that is, run a tight for loop with calculations on the same set of grid points, so the code does not branch, you can cache it well. Cache is important, as the software seldom goes out to RAM. So bandwidth is not important. And if you look at x86 it has low bandwidht and good cache, making it useless for scale-up workloads.
Scale-up workloads thrash the cache. That is why UV2000 can not be used for SAP. I wrote a long essay explaining all this about cache and RAM etc etc. And still after this long explanation, you still did not understand anything about thrashing the cache. You kept writing stuff that showed you did not understand anything why a cluster can not run tightly coupled software.
So, it is useless to try to explain you how things really are. You are given links and explanations, and still you dont get it. You are just very uneducated about computer tech. That makes it difficult to talk to you, because you dont get it, when someone show you links or explain why you are wrong. You dont understand.
Or, you do understand but pretend not to, so you are Trolling to make me write long texts to you. I cant decide which. If you are trolling and pretending to be obtuse, or if you are obtuse for real.
Anyway, I advise anyone to not go into technical discussions with you, because you dont understand what you are talking about. If I point out an error in your post, you dont get it. I even post links, and you read them, and STILL you dont get it. I used to say "there are no dumb people, only ignorant" - but in your case, I am not really sure anymore...
JohnMirolha - Wednesday, April 6, 2016 - link
Actually POWER8 has 320GB/s with a 2 SOCKET system (check link below pg 15)... so it would be 160GB/s per physical chip (you know, despite whatever concept you have, the silicon thing you can pick and plug on a socket).The POWER8 for 2 SOCKET machines is rated at 192GB/s, while his HIGH-END counterpart is rated at 230GB/s. So the difference here is only 16%.
And... who on earth still considers a closed platform like SPARC an HPC alternative ? The only SPARC there is K Computer, which is Fujitsu's SPARC64.
http://openpowerfoundation.org/wp-content/uploads/...
Brutalizer - Thursday, April 7, 2016 - link
So the 2-socket 10-core POWER8 S822L server has 320 GB/s bandwidth? So therefore ONE power8 cpu has 160GB/sec? Well, this again shows how ugly and FUDing the IBM people are.This S822L server has two sockets, but FOUR cpus. So one socket has 160 GB/sec but one POWER8 cpu has 80 GB/sec. You knew this, and tried to trick me into believing that one POWER8 cpu is as fast as SPARC M7. That is the reason you talk about "physical chip..." nonsense.
First I actually wrote a post acknowledging that the POWER8 cpu's bandwidth is as high as SPARC M7, but as I wanted to add a small detail I checked it up, and lo and behold, the S88L server has FOUR cpus. Very very ugly of you. Why am I not surprised by the IBM people? You do know that the term FUD was coined by one of IBM customers?
https://en.wikipedia.org/wiki/Fear,_uncertainty_an...
And finally, if you read the link (it is a research paper) you gave me, it says that the researchers ran the STREAM benchmark 1,000 times, and 10% of the time, the POWER8 server reached 320GB/sec. The rest of the time that is, 23% of the time the benchmark reached 245GB/sec, and 16% of the time it reached 260GB/sec and 24% of the time it reached 290GB/sec. So, only a very small fraction of the time the POWER8 reached 320GB/sec.
So, 10% of the time FOUR power8 cpus reached the same bandwidth as two SPARC M7. The rest of the time, it was far lower. POWWER8 ram bandwidth is not impressive nowhere.
And regarding "And... who on earth still considers a closed platform like SPARC an HPC alternative ?"
Well, have you heard about OpenSPARC which is GNU licensed? Is POWER8 open as in GNU? No? Aha.
https://en.wikipedia.org/wiki/OpenSPARC
.
And if you really want HPC, then SPARC M7 is the fastest computing cpu out there, reaching 1200 SPECint2006 and 832 SPECfp2006. Whereas POWER8 reaches 642 SPECint2006 and 468 SPECfp2006 which is slower than x86 actually. I dont understand why POWER8 is much more expensive than a x86 whilst being slower? Here are 25ish benchmarks where SPARC M7 is 2-3x faster than POWER8 and Intel Xeon E5-2699v3 (all the way up to 11x faster):
https://blogs.oracle.com/BestPerf/entry/201510_spe...
jesperfrimann - Friday, April 8, 2016 - link
RTFM Kebabbert.Read section 2.2 in http://www.redbooks.ibm.com/redpapers/pdfs/redp513...
Each socket in a S824(L or no L) has 8 memory channels, putting a DCM module (dual chip module) as in the or a single chip module or a Quad chip module (if such a thing existed for POWER8), will all give you the same memory bandwidth.
You just have to read the documentation... it's all there .. no conspiracy.. no .. magical cheating.
It is amazing that you have keept up your Rant for so many years.
// Jesper
Brutalizer - Friday, April 8, 2016 - link
Fine, but it is still TRUE that in the server that IBM benchmarked, there where FOUR power8 cpus. Not two. So, it does not matter how much you try to dribble away the fact it was FOUR power8 cpus by talking about irrelevant things.It is amazing you still keep on doing dribbling away facts after this many years. I remember when I showed you SPARC T2+ 1.6GHz besting POWER6 4.7GHz benchmarks to which you replied:
-No, the throughput benchmark you show is irrelevant because POWER6 had lower latency. And that is what counts. So, POWER6 is in general faster than SPARC T2+.
Some time later I showed you a benchmark where SPARC T2+ had lower latency, and you replied:
-No, latency is not important. The only thing that is important is throughput, so POWER6 is faster again.
It IS really amazing that you STILL keep on doing this. One would hope you have matured after all these years. I keep on showing you benchmark after benchmark, which you all reject because of this or that. And when you show me benchmarks, I accept them. But still you reject ALL oracle benchmarks "cherrypicking, etc etc" and when you show me IBM benchmarks you expect me to accept them all. Which I do. And I have asked you why I must accept your benchmarks, but why you reject my benchmarks - to which you reply "Oracle cherrypicking".
Enough of this, I am very well aware of your "debate" technique since many years back. Show us benchmarks where IBM POWER8 cpus have higher RAM bandwidth than SPARC M7. Show us facts and hard numbers, that is what counts. Not your dribbling of facts. If you can not show benchmarks, then I suggest you ventilate your opinions on IBM fan pages instead.
.
And besides, the benchmark I posted, where Oracle compares SPARC M7 to POWER8 in STREAM bandwidth, to which some imply that Oracle lie about the POWER8 result - well that POWER8 result is from IBM themselves. Read the Oracle link and see the reference points to IBM web page. So, Oracle is not lying about POWER8 results.
jesperfrimann - Friday, April 8, 2016 - link
No. There were 4 processors each with 6 cores and 4 memory channels in the benchmark that Oracle ran.Eh... show you numbers that POWER8 has higher Stream benchmark numbers than a SPARC M7 ?
Kind of hard as there are no official STREAM submissions for neither the M7 nor POWER8.
Now surely the per chip throughput of the M7 is very hard to match, it is IMHO the undisputed king of the hill when it comes to chip throughput. It has around 33% faster throughput on many workloads than the best that POWER8 or Xeons can produce. But Again it comes at a Price that you have to have a large number of threads to exploid the potential of the Moster M7 chip.
And not all applications are able to handle such a huge number of threads (256 for the M7 and up to 96 for POWER8)
We do know, cause IBM releases the numbers, what effect it has running only 1 thread per core on the POWER8 chip. Basically it half the throughput going from 8 threads per core to 1 thread per core. For the M7 with it's simpler HW threading the impact surely must be much bigger.
Again reality is more complex than your carmagazine IT World.
// Jesper
jasonelmore - Wednesday, April 6, 2016 - link
anyone remember how Maxwell was supposed to include x64 ARM cores on the flagship GPU's to accelerate performance? It quietly fell off the roadmap about 6 months before maxwell 1 released.HighTech4US - Thursday, April 7, 2016 - link
Only an idiot like you believe every idiotic rumor on the web and believe that it must be true.Kevin G - Thursday, April 7, 2016 - link
It wasn't a rumor:http://www.computerworld.com/article/2493074/compu...
They did have this on their roadmap several years ago. Between Project Denver being lack luster and TSMC's 20 nm process not being suitable for large GPUs, nVidia's roadmaps has radically changed since then. In fact, Pascal wasn't even part of nVidia's roadmap then:
http://www.anandtech.com/show/6846/nvidia-updates-...
JohnMirolha - Wednesday, April 6, 2016 - link
Well...Some things run better in parallel and there is when a GPU with 3584 cores can handle brilliantly.
Other things run sequentially, and then you need the fastest core available to get rid of the contention.
NVLINK is primarily for GPUs to exchange data between them. This avoid going thru PCI bus, passing by the CPU just to get to the other GPU on the system.
CAPI on the other hand is built for this kind of thing you mention with IB. Melanie 100Gbps IB (2016) and 200Gbps IB (2017) will leverage CAPI in order to allow direct memory Access, without the need of going thru the CPU. Can you imagine RDMA over 100Gbps with with adapter writing directly to destination memory address ?!?
frenchy_2001 - Thursday, April 7, 2016 - link
NVLink is actually designed as a processor to processor link with NUMA for memory sharing (think intel QPI or AMD hypertransport).And I thought one of the point of using POWER8 with Pascal was to link the POWER8 CPU and the GPU using NVLink instead of PCIe. That would give better bandwidth and probably better latency, while allowing shared memory.
Kevin G - Thursday, April 7, 2016 - link
POWER8+ is supposed to support nvLink natively and is supposed to launch later this year. It is odd that IBM is showing off Pascal connected via the PCIe bus with POWER8 instead of directly with POWER8+.jesperfrimann - Friday, April 8, 2016 - link
@Kevin G.It's IBM if they can milk an old product a Little longer they will...
// Jesper
extide - Wednesday, April 6, 2016 - link
I wonder if Intel will make a REAL server CPU again one day. I know they tried, and failed, with Itanium, but if they kep the x86 arch and built it massively wide, put huge memory b/w, and targeted similar TDP's as the POWER CPU's typically do (250+w) ... it could be interesting. As it is now, Intel server chips are just a crapton of little mobile cores hooked up (granted, they do use a pretty fancy ringbus network and all).Pinn - Wednesday, April 6, 2016 - link
Hz died a long time ago.Shadow7037932 - Wednesday, April 6, 2016 - link
Intel will do it when they feel the heat. Right now, it looks like they aren't really facing too much competition.JohnMirolha - Wednesday, April 6, 2016 - link
"Intel will do when they feel the heat"... Join the dots below:http://www.digitaltrends.com/computing/intels-lead...
http://arstechnica.com/information-technology/2016...
http://www.businessinsider.com/intel-ceo-brian-krz...
http://wccftech.com/intel-14nm-broadwell-cpu-archi...
They are cornered in technology terms. Lates chip only got 5% better at the core level, so they packed more cores. When you have a GPU at your side you gotta be pristine on single thread.
It might do the trick for cloud with several virtual machines, but for HPC it is not the way to handle it.
They can pull it off, but so far being the do-it-all cpu is not looking good.
Kevin G - Thursday, April 7, 2016 - link
Intel can do higher clock speeds and even higher IPC. The problem is that power consumption would increase at a rate higher than the performance gains from the additional clocks or IPC. Intel has an internal design rule that any major change can only increase power consumption by 1% only if it increases performance by 2% or more. This has forced Intel to focus on efficiency, not absolute raw performance.Things are slowly changing as SkyLake-EP is a slightly different core than the consumer SkyLake chips currently on the market. The workstation/server chip gets AVX-512 and believed to carry 512 KB of L2 cache per core for example. I can see Intel implementing a few IPC increases that don't adhere to their current design rules just to push the market forward.
Michael Bay - Saturday, April 9, 2016 - link
"Intel is cornered" song is so old nobody even registers it consciously anymore.Believing absolute technology leader can`t eke more than 5% IPC increase yearly because of anything other than marketing considerations is plain dumb.
protomech - Wednesday, April 6, 2016 - link
"The Earth Simulator consumed 20 kW of POWER"Should be 20 MW, I believe.
I worked on a contemporary cluster computer, 16 TFLOPS (linpack) and consumed slightly over 1MW. GPGPU was starting to take off; we had some test systems with dual 7870s.
Pretty impressive that you can get this much performance on your desk (or more likely a rolling rack you can tuck somewhere far away.. these will still be pretty loud).
ruthan - Wednesday, April 6, 2016 - link
Maybe you have more knowledge, but last time i checked, there was some ARM servers ready boards with 48 or 96 CPU cores, but problem was that individual core was relatively weak.. Good for webservers, but not for other on easily parallel computing. Also ARM visualization still looks like something which is not still happening - Vmware dont support ARM at all, there is not big company behind it, like IBM for Lpars virtualisation or HP - vPars.As far as i know is strongest cores are in Tegra X1 ARM which is 4 "strong" APU (A57) + 4 weak (A53 )in little big design, even if i would ignore weak ones, its 4 cores in 15W TDP evelope, which means that they are 15/4- 4W cores.. in comparison with x86 were we have 65W package (Broadwell / Skylake) for 4 core - 1 cores uses - 16W.
So i my eyes, inability to create just 1 beefier ARM core is main issue, why we not using ARM in our desktops, servers or consoles.. Im really curious how hard job is just create this beefy ARM, is angaist ARM architecture fundamentals or just about add some more L2, L3 cache, silicon and increase frequency?
Meteor2 - Friday, April 8, 2016 - link
Apple is creeping towards that with A9X. Not a million miles off Intel m3.tygrus - Wednesday, April 6, 2016 - link
The whole point of using GPU's for highly parallised workloads is to do the parallel work and leave the single-threaded work to a CPU. It's pointless to have more CPU cores with lower single-threaded performance in conjunction with GPU's. The best system will have fast single-threaded performance in fewer CPU cores and leave the parallel work to the GPU's.Arnulf - Thursday, April 7, 2016 - link
Who in their right mind thought that putting a flat-chested lady wearing a male shirt next to a rack of switches would make for a good PR image? WTF is it supposed to represent anyway?SaolDan - Thursday, April 7, 2016 - link
Thats funny. Had to scroll up to look at the picture. She kinda looks like Amy from Big bang theory.lashek37 - Thursday, April 7, 2016 - link
She a lesbian.Good Pr move.doggface - Thursday, April 7, 2016 - link
Wow. And who says IT isnt full of sexist male pigs.Maybe she is meant to represent a scientist, who is busy sciencing things in a professional environment.
I am sorry there isnt enough bikini there for you.
Michael Bay - Saturday, April 9, 2016 - link
>muh PC>muh workplace sexism
>muh powerful womyn that need no man
IT is a male field.
Scratch that, EVERYTHING is a male field. Deal with it.
doncornelius01 - Friday, April 15, 2016 - link
Sexism defended?