Researchers Unveil Experimental 36-Core Chip
rtoz writes The more cores — or processing units — a computer chip has, the bigger the problem of communication between cores becomes. For years, Li-Shiuan Peh, the Singapore Research Professor of Electrical Engineering and Computer Science at MIT, has argued that the massively multicore chips of the future will need to resemble little Internets, where each core has an associated router, and data travels between cores in packets of fixed size. This week, at the International Symposium on Computer Architecture, Peh's group unveiled a 36-core chip that features just such a "network-on-chip." In addition to implementing many of the group's earlier ideas, it also solves one of the problems that has bedeviled previous attempts to design networks-on-chip: maintaining cache coherence, or ensuring that cores' locally stored copies of globally accessible data remain up to date.
All this performance in just one chip. I mean, sure, it has 36 cores but lets be rational here...does it seriously expect to run crysis?
Good people go to bed earlier.
That doesn't matter. The power supply surrounding the socket/pads will account for whatever Vcc needs to be.
More data, damnit!
As an aside: It's been a while since we've seen any decent rise in processor Ghz.
Just to abuse a car analogy: Maybe it's time we stop revving up and instead shift gears.
We used to have a Bill of Rights. Now, with the rights gone, all we have left is the bill.
So what's special about this chip that Intel's Xeon Phi (first demonstrated in 2007 as Knights Landing with 80 or so cores) isn't already doing? Or is this just a rehash of 7 year old technology that's already in production? It sounds like a copy/paste of Intel's research.
"Intel's research chip has 80 cores, or "tiles," Rattner said. Each tile has a computing element and a router, allowing it to crunch data individually and transport that data to neighboring tiles." - Feb 11, 2007
Whilst I have my foot to the floor ... I still think it's a failure of science - there's nothing wrong with doing both simultaneously - to believe otherwise would be to buy into a rhetorical device based on "false opposites."
The purpose of existence is to make money.
A higher high/low voltage swing (with a reasonable amount of other stuff being equal) will be more of a thermal nuisance; but if the perks make up for it, that's hardly a dealbreaker. The toasty end of boring desktop CPUs is somewhere north of 200watts already, with a little shoving that they typically survive, so if somebody really wants 36 cache-coherent cores on-die, they'll suck it up and make it work.
For applications that don't specifically demand that, I'd be interested to know how the costs and benefits of 'dealing with the cooling demands of a smaller number of denser parts' compare with 'dealing with the cooling demands of more, cooler, parts, closer to whatever the performance per watt sweet spot is; but with more cabling, PSUs, switches, and similar interconnect and support stuff to buy and power'...
pointer arithmetic, cache invalidation, and off-by-one errors
"I'd just like to emphasise that taking a million years isn't a metaphor here..." -Rich Bradshaw
Cache coherency has been one of the banes of multicore architecture for years. It's nice to see a different approach but chip manufacturers are already getting high performance results without introducing additional complexity. The Oracle (Sun) Sparc T5 architecture has 16 cores with 128 threads running at 3.6Ghz. It gives a few more years to Solaris at least but it's still a hell of a processor. For you Intel fans the E7-2790 v2 sports 15 cores with 30 threads with a 37.5MB cache so they're doing something right because it screams and is capable of 85GB/s memory throughput.
I'm sure the chip architects are looking at this research but somehow I think they're already ahead of the curve because these kinds of cores/threads are jumps ahead of where we were just a few years ago. Anybody remember the first Pentium Dual Core and The UltraSparc T1?
Harrison's Postulate - "For every action there is an equal and opposite criticism"
Immense? Immense you say? Try IBM's mega footprint z196 at over 512mm^2 is one big ass chip.
Harrison's Postulate - "For every action there is an equal and opposite criticism"
Basic problem is this, even if just 5% of the work has to be serial, the maximum speedup is 20x, that is the theoretical maximum. YMMV, and it does. Internet and search has opened up another vast area where a thread can do lots of work and send just very small set of results back to the caller. Hits are so small compared to misses, you can make some headway. Even then we have found very few applications suitable for massively parallel solutions.
We need a big breakthrough. If you divide a 3D domain into a number of sub domains, the interfaces between the subdomains is 2D. The volume of 3D domain represents computational load, and the area interfaces represent the communication load. If we could come up with domain-division algorithms that guarantee the interfaces would be an order of magnitude smaller, even as we go from 3D to higher number of dimensions, and if we could organize these subdomains into hierarchies, we would be able to deploy more and more of computational work, and be confident the communication load would not overwhelm the algorithm. This break through is yet to come. Delaunay Tessellations (and its dual Voronoi polygons) have been defined in higher dimensions. But the number of "cells" to number of "vertices" ratio explodes in higher dimensions, last time we tried, we could not even fit a 10 dimensional mesh of 10 points into all the available memory of the machine. It did not look promising.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
The basic idea isn't new. What the paper is really claiming is new is their particular cache coherence scheme, which (to quote from the Conclusion) "supports global ordering of requests on a mesh network by decoupling the message delivery from the ordering", making it "able to address key coherence scalability concerns".
How novel and useful that is I don't know, because it's really a more specialist contribution than the headline claims, to be evaluated by people who are experts in multicore cache coherence schemes.
10 PRINT CHR$(205.5+RND(1)); : GOTO 10
A better analogy is that they keep adding seats and making the whole vehicle slower.
Kawasaki Ninja == 10GHZ single core (fastest way to get anywhere alone)
Ford Mustang == 4GHz quad-core (most people only use the front two seats, but if desperate you can squeeze more people in)
Chevy Suburban == 3.3 GHz 8-core (it seems like everyone wants one, but most people who have a full load just have a bunch of little kiddies)
Mercedes Sprinter == 2.7 GHz 12-core (just meant to be a grinding people hauler)
School Bus == 1.2GHz Xeon Phi (slow as hell and very specialized, no normal person would ever want one)
Double Decker Bus == Peh's stuff (probably a use for mass transit(i.e virtualization) and as a cool novelty)
And hopefully in any lectures on Moore's Law, the students learn that Moore's Law refers to transistors on a die, not the speed of the chips. This 36-core chip probably jumps ahead of Moore's Law a bit, as it's got to be a fairly large die.
Moore's Law refers to the number of components per integrated circuit for minimum cost. Note that this is basically transistor density and is not impacted by core size. Silicon defects and transistor size determine the optimal number of components per IC.
A quote from Wikipedia,
Moore himself wrote only about the density of components, "a component being a transistor, resistor, diode or capacitor,"[26] at minimum cost.
Nope, Liquid Nitrogen cooling gets you past the speed limits. How about over 8Ghz on a chip that costs less than $200? Going to Helium and you can get over 8.5Ghz. although both become a bit unweildy when it comes to game play because I don't want my hard drives to freeze. I love that last video there's some real country boy engineering there including using a propane torch and a hair dryer to keep certain components from freezing.
I'm a little confused as to why you're citing the chip's low low price of "less than $200" if you need liquid nitrogen to get it to perform the way you want it to. You do realize that cooling systems cost money, too...right? There's no point in being able to use a cheap processor to get to X performance benchmark if the required additional support systems cost thousands of dollars more than a more powerful and more expensive processor that can do it out of the box. Not to mention the fact that liquid nitrogen cooling isn't exactly hassle-free, especially in a household environment. And it's worth noting that even if you boost Ghz, you eventually run into choke points related to pushing data to and from the chip anyways. You can give the most important worker on an assembly line all the crystal meth they can eat, but they can't work any faster than the conveyor belt in front of them.
For your security, this post has been encrypted with ROT-13, twice.
You could go ahead a run both paths of code. Then decide which one is correct and discard the unused results.
Intel is already doing partial speculative execution in the case of conditional branches. The pipeline is filled with the predicted path which is then frequently executed out of order (before the condition is known) ..
.. resources that would be completely wasted whenever there isnt a conditional branch in the pipeline...
Intel is not however doing the full concept you have described (eager speculative execution) and I don't think its likely that they ever will. The best case for eager speculative execution would be when the branches are completely unpredictable, which is only very rarely true. Further, it requites significant over-provisioning of execution units to have enough to execute both paths of a conditional branch each at "best possible speed"
"His name was James Damore."
Nitrogen overclocking is done for contests. You can get phase change cooling, which is the next best thing and will still get your processor far below zero. The big downside to that is just power consumption. It's also bulky and noisy.
Uh, she.
My point exactly. What is a simple task on an modern Intel becomes nearly impossible on the GA144. We've already tried the idea of combining large numbers of simple processors, and it has failed every single time. If NxM simple cores together can't beat a modern Intel processor for a range of useful tasks, there's not much point in developing it.
Banging my head on the table right now.
Why do people with zero actual semiconductor knowledge try to speak as an authority*?!
It's a research chip, meaning they don't need to be on the latest process node to show their proof of concept. Larger nodes (much cheaper to design a chip on) have thicker gate passivation layers and run at higher voltage. From an architecture standpoint the process node/voltage are irrelevant. So if their architecture proves out, some bigger outfit can run with it while targetting the latest-greatest itty-bitty process node to increase the clock-rate, drop the power, and reduce the area/price.
*I am not a processor designer, just a mixed signal (mostly analog) guy, but I've been working in the semiconductor industry, including doing process bake-offs for over a dozen years.
The core count isn't the interesting thing about this chip. The cores themselves are pretty boring off-the-shelf parts too. I was at the ISCA presentation about this last week and it's actually pretty interesting. I'd recommend reading the paper (linked to from the press release) rather than the press release, because the press release is up to MIT's press department's usual standards (i.e. completely content-free and focussing on totally the wrong thing). The cool stuff is in the interconnect, which uses the bounded latency of the longest path multiplied by single-cycle one-hop delivery times to define an ordering, allowing you to implement a sequentially consistent view of memory relatively cheaply.
Since I'm here, I'll also throw out a plug for the work we presented at ISCA, The CHERI capability model: Revisiting RISC in an age of risk . We've now open sourced (as a code dump, public VCS coming soon) our (64-bit) MIPS softcore, which is the basis for the experimentation in CHERI. It boots FreeBSD and there are a few sitting around the place that we can ssh into and run. This is pretty nice for experimentation, because it takes about 2 hours to produce and boot a new revision of the CPU.
I am TheRaven on Soylent News
You can give the most important worker on an assembly line all the crystal meth they can eat, but they can't work any faster than the conveyor belt in front of them.
Ah! The 21st Century version of the 'mythical man month' - so much more apropos for this audience than the pregnancy analogy.
Faster! Faster! Faster would be better!
Some knowledge about multicore cache coherence here. You are completely right, Slashdot's summary does not introduce any novel idea. In fact, a cache-coherent mesh-based multicore system with one router associated to each core was presented on the market years ago by a startup from MIT, Tilera. Also, the article claims that today's cores are connected by a single shared bus -- that's far outdated, since most processors today employ some form of switched communication (an arbitrated ring, a single crossbar, a mesh of routers, etc).
What the actual ISCA paper presents is a novel mechanism to guarantee total ordering on a distributed network. Essentially, when your network is distributed (i.e., not a single shared bus, basically most current on-chip network) there are several problems with guaranteeing ordering: i) it is really hard to provide a global ordering of messages (like a bus) without making all messages cross a single centralized point which becomes a bottleneck, and ii) if you employ adaptive routing, it is impossible to provide point-to-point ordering of messages.
Coherence messages are divided in different classes in order to prevent deadlock. Depending on the coherence protocol implementation, messages of certain classes need to be delivered in order between the same pair of endpoints, and for this, some of the virtual networks can require static routing (e.g. Dimension-Ordered Routing in a mesh). Note a "virtual network" is a subset of the network resources which is used by the different classes of coherence messages to prevent deadlock. This is a remedy for the second problem. However, a network that provided global ordering would allow for potentially huge simplifications of the coherence mechanisms, since many races would disappear (the devil is in the details), and a snoopy mechanism would be possible -- as they implement. Additionally, this might also impact the consistency model. In fact, their model implements sequential consistency, which is the most restrictive -- yet simple to reason about -- consistency model.
Disclaimer: I am not affiliated with their research group, and in fact, I have not read the paper in detail.
Yes, we've also released the generated Verilog for anyone who wants to use just that. If you're a university, you can easily get a free license for Bluespec. If you're not, then you either most likely don't have the resources to get a decent FPGA (the ones that can run a processor at a useable speed start at about $3K), or you can probably afford the license. We're also talking to Bluespec about open sourcing their compiler, as most of their real value is from other services on top of it, but that's likely to take some time.
We're evaluating CHISEL, which is promising, but currently there's nothing else in the open source world that comes even vaguely close to Bluespec in terms of productivity for hardware designers, and CHISEL was not available when we started.
I am TheRaven on Soylent News