Researchers Unveil Experimental 36-Core Chip
rtoz writes The more cores — or processing units — a computer chip has, the bigger the problem of communication between cores becomes. For years, Li-Shiuan Peh, the Singapore Research Professor of Electrical Engineering and Computer Science at MIT, has argued that the massively multicore chips of the future will need to resemble little Internets, where each core has an associated router, and data travels between cores in packets of fixed size. This week, at the International Symposium on Computer Architecture, Peh's group unveiled a 36-core chip that features just such a "network-on-chip." In addition to implementing many of the group's earlier ideas, it also solves one of the problems that has bedeviled previous attempts to design networks-on-chip: maintaining cache coherence, or ensuring that cores' locally stored copies of globally accessible data remain up to date.
All this performance in just one chip. I mean, sure, it has 36 cores but lets be rational here...does it seriously expect to run crysis?
Good people go to bed earlier.
According to the comparison table, (Refer timeline 4:21 of this video) this chip uses 1.1V while other standard chips are using 1.0V. This difference may make it hard for the chip makers to use this technology.
Really? They won't be able to specify a 1.1V VRM instead of a 1.0V VRM? Those poor, poor chip makers. They sound like a bunch of incompetent fucks.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
That doesn't matter. The power supply surrounding the socket/pads will account for whatever Vcc needs to be.
More data, damnit!
That's a fun post! 36-core is immense! As an aside: It's been a while since we've seen any decent rise in processor Ghz. I remember IBM talking about functioning reasonably cool 10 Ghz processors (ref needed) in the early 2000s, but no one has them in the shops yet! I'm sure this was discussed in Moore's Law lectures prior to Y2K, but mention it these days and everyone scowls! So some people can (and they run cool) and some people can't, what normally happens in computing when the faster items are released?
The purpose of existence is to make money.
http://www.adapteva.com/epipha...
64 cores, mesh network that extends off the chip, in production.
Try harder MIT :-p
The previous comments are only true, if no-one says they're wrong.
So what's special about this chip that Intel's Xeon Phi (first demonstrated in 2007 as Knights Landing with 80 or so cores) isn't already doing? Or is this just a rehash of 7 year old technology that's already in production? It sounds like a copy/paste of Intel's research.
"Intel's research chip has 80 cores, or "tiles," Rattner said. Each tile has a computing element and a router, allowing it to crunch data individually and transport that data to neighboring tiles." - Feb 11, 2007
I would be curious to know more about the architecture and all around chip specs they are using in their prototype: clock speed, memory interface, etc. The article states they are developing a version of Linux to test it on, so it's safe to say it's an established architecture. Anyway, I am excited to see the results once they have tested it on Linux. While this does not help with the density per core problem, perhaps it will help extend Moore's Law from the perspective of speed increase in respect to micro circuitry.
Brought to you by Carl's Junior.
So, in one die, it's a little interesting, though GPU stream processors and Intel's Phi would seem to suggest this is not that novel. The latter even let's you ssh in and see the core count for yourself in a very familiar way (though it's not exactly the easiest of devices to manage, it's still a very much real world example of how this isn't new to the world).
The 'not all cores are connected' is even older. In the commodity space, hypertransport and QPI can be used to construct topologies that are not full mesh. So not only is it not all cores on a bus, it is also not all cores mesh connected, the two attributes claimed as novel here.
Basically, as of AMD64 people had relatively affordable access to an implementation of the concept, and as of Nehalm both major x86 vendors had this concept in place. Each die included all the logic needed to implement a fabric, with the board providing essentially passive traces.
XML is like violence. If it doesn't solve the problem, use more.
Erlang on a chip :-)
A higher high/low voltage swing (with a reasonable amount of other stuff being equal) will be more of a thermal nuisance; but if the perks make up for it, that's hardly a dealbreaker. The toasty end of boring desktop CPUs is somewhere north of 200watts already, with a little shoving that they typically survive, so if somebody really wants 36 cache-coherent cores on-die, they'll suck it up and make it work.
For applications that don't specifically demand that, I'd be interested to know how the costs and benefits of 'dealing with the cooling demands of a smaller number of denser parts' compare with 'dealing with the cooling demands of more, cooler, parts, closer to whatever the performance per watt sweet spot is; but with more cabling, PSUs, switches, and similar interconnect and support stuff to buy and power'...
While adding an extra core or two made big jumps in performance (because you are almost always running at least two applications) there comes a point where most users won't see a performance boost. While I may now be able to throw 36 processors at a problem, you have to program all those cores to work together. Right now that's a lot of effort, and until programming languages catch up and can optimize code by making it massively parallel, this is going to be a non-starter.
pointer arithmetic, cache invalidation, and off-by-one errors
"I'd just like to emphasise that taking a million years isn't a metaphor here..." -Rich Bradshaw
Cache coherency has been one of the banes of multicore architecture for years. It's nice to see a different approach but chip manufacturers are already getting high performance results without introducing additional complexity. The Oracle (Sun) Sparc T5 architecture has 16 cores with 128 threads running at 3.6Ghz. It gives a few more years to Solaris at least but it's still a hell of a processor. For you Intel fans the E7-2790 v2 sports 15 cores with 30 threads with a 37.5MB cache so they're doing something right because it screams and is capable of 85GB/s memory throughput.
I'm sure the chip architects are looking at this research but somehow I think they're already ahead of the curve because these kinds of cores/threads are jumps ahead of where we were just a few years ago. Anybody remember the first Pentium Dual Core and The UltraSparc T1?
Harrison's Postulate - "For every action there is an equal and opposite criticism"
A "new programming language" isn't a magical solution to make a non-parallel algorithm work well on a multi processor architecture.
Basic problem is this, even if just 5% of the work has to be serial, the maximum speedup is 20x, that is the theoretical maximum. YMMV, and it does. Internet and search has opened up another vast area where a thread can do lots of work and send just very small set of results back to the caller. Hits are so small compared to misses, you can make some headway. Even then we have found very few applications suitable for massively parallel solutions.
We need a big breakthrough. If you divide a 3D domain into a number of sub domains, the interfaces between the subdomains is 2D. The volume of 3D domain represents computational load, and the area interfaces represent the communication load. If we could come up with domain-division algorithms that guarantee the interfaces would be an order of magnitude smaller, even as we go from 3D to higher number of dimensions, and if we could organize these subdomains into hierarchies, we would be able to deploy more and more of computational work, and be confident the communication load would not overwhelm the algorithm. This break through is yet to come. Delaunay Tessellations (and its dual Voronoi polygons) have been defined in higher dimensions. But the number of "cells" to number of "vertices" ratio explodes in higher dimensions, last time we tried, we could not even fit a 10 dimensional mesh of 10 points into all the available memory of the machine. It did not look promising.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
CellBE was designed like this.
It had a shared loop where data was kicked around in both directions and each core picked out only that which was addressed to it.
And I don't even think this is an idea that was new to Cell either.
So the point here, which is not insignificant if they really have solved it, is cache coherency. Snooping other caches can become a massively expensive task in terms of round trip latency. Potentially 1000s of cylces. In low power architectures (which is certainly a consideration) moving data around is really expensive from a power perspective and is where a large portion of the power is actually spent in real world uses. IF they have solved it in a more efficient manner rather than the current brute-force approaches, then that's good research.
No kidding... If that was all there was to it the guys at the CPU level would just do it for us.
The problem is previous results dependency. If you do not care about previous results then multi threaded programming is dead easy and a scheduling problem which is fairly well understood. It is when you need the previous results or external I/O that parallel fails.
I predict that at some point some bright spark of a CPU guy will come up with the idea of discarded results. As 'if' conditions tend to create pipeline stalls. You could go ahead a run both paths of code. Then decide which one is correct and discard the unused results. Maybe the already have... I have not followed CPU acrh for a while now...
..the Transputer. Great idea, but a giant market fail.
Oh, I'm sorry sir, I thought you were referring to me, Mr. Wensleydale.
Li-Shiuan Peh, the Singapore Research Professor of Electrical Engineering and Computer Science at MIT
If he is the Singapore Research Professor of Electrical Engineering and Computer Science at MIT, who is the Research Professor of Electrical Engineering and Computer Science?
The question is - do you always need a parallel tasking software? Most tasks are bread&butter tasks, no need to chew them up. Put your energy into the few things that do need to be broken up.
But mostly it's a "hen and egg" problem - can't do multi-core software since there aren't enough serious multi-core machines, or the owners in software companies don't see a benefit in it.
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
1V over 1um is a million volts per meter, assnozzle. Your incompetent and imbecilic commentary really drag down this place.
You could go ahead a run both paths of code. Then decide which one is correct and discard the unused results.
Intel is already doing partial speculative execution in the case of conditional branches. The pipeline is filled with the predicted path which is then frequently executed out of order (before the condition is known) ..
.. resources that would be completely wasted whenever there isnt a conditional branch in the pipeline...
Intel is not however doing the full concept you have described (eager speculative execution) and I don't think its likely that they ever will. The best case for eager speculative execution would be when the branches are completely unpredictable, which is only very rarely true. Further, it requites significant over-provisioning of execution units to have enough to execute both paths of a conditional branch each at "best possible speed"
"His name was James Damore."
If it was up to me, you could have ours. The refrigerators in the parking garage are really annoying.
I figure you must be some type of alcoholic or have another substance problem. Can you confirm?
According to the comparison table, (Refer timeline 4:21 of this video) this chip uses 1.1V while other standard chips are using 1.0V. This difference may make it hard for the chip makers to use this technology.
No, it's the only way to make it faster because it goes to eleven...
Browsing at +1 - no ACs, I ignore their posts. So refreshing!
Been there, done that:
http://en.wikipedia.org/wiki/Transputer
http://en.wikipedia.org/wiki/Network_on_a_chip
And still, the modern interconnects from the likes of ARM (CCN-508) are, in effect, the same thing.
And then there's this:
http://www.xmos.com/
IBM even does this with their MCM's for their high end servers & Mainframes.
Serializing things to send over to another core also costs time/transistors.
What's really needed is a novel approach in how to exploit all of this processing power and (oh by the way, as the man in the corner says) get a better SW architecture in place that can take advantage of all of this. Things today are just soooo inefficient.
Best of luck!!!
Banging my head on the table right now.
Why do people with zero actual semiconductor knowledge try to speak as an authority*?!
It's a research chip, meaning they don't need to be on the latest process node to show their proof of concept. Larger nodes (much cheaper to design a chip on) have thicker gate passivation layers and run at higher voltage. From an architecture standpoint the process node/voltage are irrelevant. So if their architecture proves out, some bigger outfit can run with it while targetting the latest-greatest itty-bitty process node to increase the clock-rate, drop the power, and reduce the area/price.
*I am not a processor designer, just a mixed signal (mostly analog) guy, but I've been working in the semiconductor industry, including doing process bake-offs for over a dozen years.
Why do people with zero actual semiconductor knowledge try to speak as an authority*?
Is this your first day on Slashdot?
There are hundreds of processors with 64 cores or more, each of them claiming to have solved the scalability problem.
The toasty end of boring desktop CPUs is somewhere north of 200watts already
Well... somewhere south of 100W, anyway, and even high end workstation/server chips are under 150W.
You people seem to forget we're dealing with chips that have features counted in individual atoms. 1V across three atoms may work, 1.1V across three atoms arcs over.
Luckily we're still dealing with features hundreds of atoms across, and not just three...
Oh look, it's Mr Heat Controls Transistors, who still hasn't provided a single source for his heat theory.
But anyways
https://www.youtube.com/watch?...
Three atoms, dickweed.
I figure you must be some type of alcoholic or have another substance problem. Can you confirm?
Yeah, I'm allergic to stupidity. I have to take a pill before I can come anywhere near slashdot, and keep an inhaler and epi pen on hand.
"You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
Because Windows programs have a habit of taking over a processor; acting like I am still using DOS.
227-3517
This is a nice little trick. This has the potential to extend shared consistent memory multiprocessor designs to far larger numbers of processors. Whether this is a performance win remains to be seen. Good idea, though. Note that the prototype chip is just a feasibility test; they used an off the shelf Power CPU design, added their interconnect network, and send the job off to a fab. A production chip would have optimizations this does not.
We known only two general purpose multiprocessor architectures that are broadly useful - shared consistent memory multiprocessors, and clusters of machines with no shared memory. Dozens of other schemes have been tried - SIMD machines (the Connection Machine), non-shared memory with DMA to a bigger memory (the Cell), message passing to adjacent machines in N dimensions (Hypercube), message passing over an on-chip network (several examples), cross-CPU DMA access (Infiniband) and shared memory without cache consistency (Intel experimental). In all cases, the hardware worked, the programming was a problem from hell, and the concept was dropped. The Cell in the PS3 is the only high-volume product with an exotic multiprocessor architecture, and that was such a pain that the PS4 dropped it for a more conventional architecture.
https://en.wikipedia.org/wiki/...
I don't see what the big deal is. I'm currently working with early silicon on a cache coherent 48-core 64-bit MIPS chip with NUMA support and built-in 40Gbps Ethernet support. The chip also has a lot of extended instructions for encryption and hashing plus a lot of hardware engines for things like zip compression, RAID calculations, regular expression engines and networking support among other things. It also has built-in support for content addressable memory.
It also has a network on-chip where each core or group of cores can have its own network interface to other cores. This is useful for things like virtualization or when you want to run multiple Linux kernels and other applications side by side since we also support running binaries on bare metal without an OS underneath.
http://cavium.com/OCTEON-III_C...
This post is encrypted twice with ROT-13. Documenting or attempting to crack this encryption is illegal.
Maybe Scala can be your language. It supports creating your code out of mostly immutable objects, which makes it good for parellelism.
Democracy Now! - your daily, uncensored, corporate-free
So seriously I know it gets bitchy In the comments some times but today I scrolled down to what appears to be a bunch a kids bickering.
Get a grip and try and have an intelligent discussion!
P4 Northwood processor load sink: 89W (source: my own 2.66GHz single core and board-bundle monitoring), the design spec TDP for a Northwood is 67W@2.2GHz-103W@3GHz (source: Intel). The Extreme Edition processors are pretty much unlocked and will suck in whatever power's available, the Gallatin cores are all >100W TDP. Late Prescotts are all 115W. Pentium Ds are all shy of 150W (source: Wikipedia). The server lines (eg Itanium, Xeon, Pentium Pro) are all comparable in terms of TDP to the midrange desktop processors.
Current 22nm Intel CPU cores run perfectly fine with a core voltage of 1.26V.
Current CPU's can run perfectly fine on 1.1V.
A common core voltage is currently 0.6V - 1.35V, depending on clock.
Voltage is a function of process technology, not system architecture.
Yeah, like the massive stupidity of not grasping that when you have IC features three atoms thick, that 0.1V more spells the difference between "working" and "arced over".
You must have quite a drug setup, it can't be easy being allergic to your own stupidity.
At 34:40 minutes it starts.
https://www.youtube.com/watch?...
Three. Atoms.
Shitguzzler.
Add an extra atom. Pretty simple. No reason why it has to be 3 atoms thick.
In fact I hear that a few years ago the smallest features were hundreds of atoms! Who knows how they managed to deal with this tricky issue of higher voltages.
Intel abandoned the Netburst architecture in the mid-2000s. All that hardware is a decade old, except for the Pentium Pro, which is almost two decades old.
Post Netburst, AMD is the one having TDP issues, and their current enthusiast-gamer-nutjob CPU is specced at 220 watts. Intel has their numbers down from the Prescott Pentium D days, though the use of 'TDP' rather than peak, and thermal throttling that actually works, makes it a little tricky to pin a precise ceiling value on some of them without actually getting out the test equipment.
Most are, of course, much lower, given the popularity of laptops and desktops that don't need water cooling, and so on.
My intended point, which I should have clarified better, is that 150-200watt CPUs, while the market generally doesn't like them, can, are, and have been, sold for use by relatively unskilled users running cheaply mass constructed computers under minimally controlled 'room temperature' conditions, so it is only reasonable to assume that, were a part with a moderately alarming power draw to have some virtue for server use that compensated for that, it could be made to work with relatively little fuss. It'd probably be really noisy once they got it down to 1-2U, and the hot aisle would be even less pleasant than usual; but if people wanted them no major engineering problems would have to be overcome to deliver.
From TFA:
After testing the prototype chips to ensure that they’re operational, Daya intends to load them with a version of the Linux operating system, modified to run on 36 cores, and evaluate the performance of real applications, to determine the accuracy of the group’s theoretical projections. At that point, she plans to release the blueprints for the chip, written in the hardware description language Verilog, as open-source code.
It's a Verilog RTL core, nothing about an RTL core dictates supply voltages. You can tailor the synthesis to target any supply voltage or operating frequency you want.
I can't believe this wasn't mentioned in the summary, as it's probably the most significant aspect that she intends to release it as open-source. There are other NoC processors like Tilera, who recently released a 72 core chip. What is innovative here is that it uses scoreboarding and knowledge of the chip topology to know what messages could arrive before they arrive to deal with message reordering and that the design will be open source, neither of which was mentioned in the headline or summary.
and their current enthusiast-gamer-nutjob CPU is specced at 220 watts.
I'll admit, the AMD FX was the only line I didn't check before posting. Their next closest chips are only 140W, and they've only got a couple at that. Most are 115W or lower. I didn't even know the AM3 socket was capable of 220W.
Based on the mixed reviews, it sounds like 220w is really pushing your luck unless the motherboard has some heroically overqualified VRM onboard, and your PSU is descended from an arc welder on its mothers side; but I've yet to see a single report of somebody actually fusing a pin rather than just crashing a lot, so apparently the socket is tougher than it looks. I was very surprised to see such a part being sold at that power level, though, rather than just 'unlocked, and we'll just look the other way'.
MIT is expert a making these sort of PR stunts were they claim they invented something novel when they replicate some boring old result from 10yr ago. Well, here it is 30yr ago.