The Era of Tera: Intel Reveals more about 80-core CPU
by Anand Lal Shimpi on February 11, 2007 5:44 PM EST- Posted in
- CPUs
Clocks and Power Management
In a modern day microprocessor, making sure the clock signal arrives at the same time across all parts of the chip can be a difficult task for a designer, especially as CPU frequencies and chip area both increase. However it's a necessary part of chip design as the clock needs to arrive at all parts of the chip within tight parameters in order for the CPU to behave normally. Intel tells us that in modern day microprocessors, clock distribution is responsible for approximately 30% of a chip's total power consumption and thus any power savings you can make here will be significant.
The teraflop chip however isn't a conventional chip; as each tile is independent, the clock only needs to arrive to all parts of a 3mm^2 tile at the same time, not to the entire 275mm^2 chip. With this in mind, Intel designed the teraflop chip to allow for the clock to arrive at individual tiles out of phase. This approach means that tile-to-tile communication may end up being a bit slower than it could, but the power savings are tremendous. Intel estimates at the power required to distribute the clock to all of the tiles on the teraflop chip at 4GHz is 2.2W, or 1.2% of total power consumption under load.
Obviously if we had a network of more complicated cores, distributing the clock within a larger more complex tile would require more power than this, but the take away point is that in a network like this you can simplify overall chip clock distribution by only worrying about the clock within a tile.
Clock management isn't the only area where Intel looked to save power, as the teraflop chip architecture is very power conscious in its design. Each tile is divided up into 21 individual sleep regions that can be powered down independently depending on the type of instruction being executed, not to mention that individual tiles can be powered down independently of one another. And as we mentioned before, the PE and router on each tile can be powered down independently.
Within the router itself, each one of the five ports can be powered down independently as well. With 80 cores, the teraflop chip can also redistribute load according to thermal needs. If a handful of tiles are getting too hot, the chip can dynamically wake up a different set of tiles to begin working in order to avoid creating hotspots.
The FPMACs remain in sleep mode until they are needed, so there's an additional latency penalty when waking them up but it prevents power consumption from spiking as soon as there's load which can help simplify power delivery and other elements of the chip as well. Approximately 90% of the FPMAC logic and a total of 74% of each PE uses sleep transistors to help reduce power consumption as we described above. Intel states that sleep transistors take up, on average, 5.4% more die area than regular transistors and come with a 4% frequency penalty, but the power savings are worth it. Sleep transistors are used in other Intel processors, including the Core 2 family.
25 Comments
View All Comments
F1N3ST - Monday, February 19, 2007 - link
800 cores for 10 TFlops I say.jiulemoigt - Wednesday, February 14, 2007 - link
maybe 80 un-synced in-order chips is pointless but that stack as a mem controller80 socketed un-synced in-order chips is pointless, since most of the functionally comes from branch logic and out-of-order operations, and not syncing them together means that you could not pass data through them only to them, and even then, issues with passing data would be a mess.
Yet that stack sitting underneath a modern cpu, especially if it could be used as a modern memory stack, with cache speed data access to four cores, that would speed many corp customers could use. Though the memory controller on the chip in the center to control the data flow treat the system memory as virtual extension of it, just like modern hard drives are virtual extensions of system memory, now we are a talking about access data as fast as we can use it. Though the branch logic is going to have to get even better.
najames - Monday, February 12, 2007 - link
Remember the Itanimum and the BILLIONS of dollars Intel spent on the thing? Remember how they thought every company would buy them by the truckload? Remember how expensive they were?Intel did deliver on the Core 2, but I am still leary of anything they hype up.
Brian23 - Monday, February 12, 2007 - link
I know that this chip won't run x86 code, but how does a Core 2 Duo 6600 compare to this as far as teraflops go?AnnihilatorX - Monday, February 12, 2007 - link
I believe that due to physical structures of the silicon lattice silicon is just not a good material candidate for a silicon-on-chip design. Exact same reason why blue laser diodes are made of Gallium arsenide rather than silicon.It's time to move on the much faster and better material than silicon.
fitten - Monday, February 12, 2007 - link
Yes, but silicon has the advantage of beinga) very cheap, comparatively
b) plentiful
benx - Monday, February 12, 2007 - link
I think it is time to stop building computers around the van neumann cycle idea. There wil always be the FSB preformance hit. To counter the problem cpu builders just add more L1/L2/L3 and now maybe L4?time to make the intel cycle with out fsb =)
fikimiki - Monday, February 12, 2007 - link
80 cores sounds great for webserver, java or paralell-processing but how does it stand against to the price and performance of 4 x QuadCore stacked on a single board?Intel is trying to achieve the same thing as Transmeta or just show the marketing muscle once again. I'm sure that Teraflop is going to loose with specialized variety of chips like nVidia, ATI, Cell or Opteron together. You put 3-4 of those and that's it.
We hear that R580 (ATI) can run some calculations 20x faster than ordinary x86, the same with Cell so what the hell is teraflop chip? Especially with integer only calculations?
JarredWalton - Monday, February 12, 2007 - link
I think you're missing the point of this article and the processor. Intel has no intention of ever releasing this particular Teraflop chip into the mainstream market. This is an R&D project, nothing more nothing less. All you have to do is look at the transistor counts to realize that performance isn't going to be competitive right now. Intel chose 80 cores simply because that was what fit within their die size constraints. If they could've fit 100 cores, they would have done that instead.In the future, Intel is going to take some of what they've learned with this research project and apply it to other processors that they actually intend to mass produce and sell. That probably won't happen for several more years at least, and when they get around to releasing those chips you can be sure that they won't have 80 cores and that the course of that they do have won't be anything like the simple processing units on this proof of concept.
How long before anything like this ever becomes practical on desktop computers? How long before it becomes necessary? Those are both interesting questions, and software are obviously has a long way to go first. I have no doubt that someday people are going to have computers with dozens of processor cores sitting on their desktops and in their laptops. Whether that's going to be in 10 years or 100 years... time will tell. I just hope I'm around long enough to see it! :-)
Andrwken - Monday, February 12, 2007 - link
Basically they are just using it as a proving ground to show what can be done when more bandwidth is needed than traditional fsb and hypertransport can deliver. It would definitely be worthwhile in a configuration with say 20 cores and using 8 for cpu, 8 for video, and 2 for physix (one example). But my question is, doesn't this kind of go along with the supposed programmable generic cores that intel wants to use in their new discreet graphics cards? If so, it could be supposed that the code for this kind of monster is already being worked out and one multicore chip can be programmed to use each core as necessary, finally eliminating all the discreet cards and levying the power of one large multicore chip as needed? (sony came close with POS3 but still needed a discreet graphics chip at this point) They get the programming down with the discreet graphics cards and then use that for single chip integration down the road. That's just how I am reading into it and I may be way off base, but this tech maybe much closer to viable than we are giving it credit for. Especially in a cheap laptop or small formfactor application.