Multicore Chips As 'Mini-Internets'
An anonymous reader writes "Today, a typical chip might have six or eight cores, all communicating with each other over a single bundle of wires, called a bus. With a bus, only one pair of cores can talk at a time, which would be a serious limitation in chips with hundreds or even thousands of cores. Researchers at MIT say cores should instead communicate the same way computers hooked to the Internet do: by bundling the information they transmit into 'packets.' Each core would have its own router, which could send a packet down any of several paths, depending on the condition of the network as a whole."
This technology that networks different cores can also serve another purpose, to prevent damage from core failure, and diagnose such failures. If the cores are connected to other cores, the same data can be processed by bypassing a damaged core, making over heating or manufacturing problems important, but almost treatable. Who knows, cores might even get replaceable.
I guess MIT has forgotten about the Transputer....
I started reading an immediately had flashbacks to the Transputer
Patent litigation: A doctrine of Mutually Assured Destruction... in which everyone seems willing to push the button
Ah, you're clever; but it's internets all the way down.
Errr... the internal "bus" between cores on modern x86 chips already is either a ring of point to point links or a star with a massive crossbar at the center.
The researchers can't be this far removed from the state of the art, so I am hoping that it is just a really badly written article. I hope they are comparing their newer research chips with their own previous generation of research chips. Intel and AMD aren't handing out their current chip designs to the universities, so many things have to be re-invented.
AMD uses HT and Intel has its ring bus, both of which use point-to-point links. Buses have serious trouble with the impedance jumps at the taps and clock skew between the lines, that's why nobody is using them in high speed applications any more. Even the venerable SCSI and ATA buses went the way of the dodo. The only bus I can see in my system is DDR3 (and I think that will go away with DDR4 due do the same problems.)
thegodmovie.com - watch it
Yeah, great idea. Take the very fastest communication that we have on the entire planet, and replace it with the absolute slowest communication we have on the planet. Great idea. And with it, more complexity, more caches, more lookup tables, and more things to go wrong.
The best part is that it's totally unbalanced. Internet protocols are based on a network that's ever-changing and totally unreliable. The bus, on the other hand, is best on total reliability and static.
I'd have thought that a pool concept, or a mailbox metaphor, or a message board analog would have been more appropriate. Something where streams are naturally quantized and sending is unpaired from receiving. Where a recipient can operate at it's own rate uncommon to the sender.
You know, like typical linux interactive sockets, for example. But what do I know.
As mentioned in other comments, this has been done before. The method of message passing isn't as fundamental as one key point - that it is all explicit message passing.
Intel and AMD x86/x64 CPUs use coherent cache between cores to make sure that a thread running on CPU 1 sees the same RAM as a thread running on CPU 3. This leads to horrible bottlenecks and huge amounts of die tied up in trying to coordinate the writes, maintain coherency between N cores (N-1 ^2 connections!), and it all just goes to hell pretty fast. Intel has this super new transactional memory rollback thing, but it's turd polishing.
The next step is pretty obvious (see Barrelfish) and easy: no shared coherency. Everything is done with message passing. If two threads or processes (it doesn't really matter at that point) want to communicate they need to do it with messages. It's much cleaner than dealing with shared memory synchronization, and makes program flow much more obvious (to me at least - I use message queues even on x86/x64). If you need to share BIG MEMORY between threads, which is reasonable for something like image processing, you at least use messages to explicitly coordinate access to shared memory and the cores don't have to worry about coherency.
This scales extremely well for at least a couple thousand CPUs, which is where the 'local internet' becomes useful.
Where it becomes not easy is that almost all programs written for x86/x64 assume threads can share memory at will. They'd need to be rewritten for this model or would suddenly run a whole lot slower since you'd have to lock them to one core or somehow do the coordination behind their back. It'd be worth it for me!
For low-level ccNUMA, you'd want three things:
If you were really clever, the MTU would become a CPU with a very limited instruction set (since there's no point re-inventing the rest of the architecture and external caching for CPUs is better developed than external caching for MTUs). In fact, you could slowly replace a lot of the chips in the system with highly specialized CPUs that could communicate with each other via a tunneled CPU network protocol.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I admit that despite being a technical user, I was not aware that only 2 chips are allowed to "talk" at a given time. I had (erroneously, it would seem) assumed that in order for a 3+-core chip to be fully useful, such a switch/router would have to already be in place.
For [most] current designs, Intel/AMD have multilevel cache memory. The cores run independently and fully in parallel and if they need to communicate they do so via shared memory. Thus, they all run full bore, flat out, and don't need to wait for each other [there are some exceptions--read on]. They have cache snoop logic that keeps them up-to-date. In other words, all cores have access to the entire DRAM space through the cache hierarchy. When the system is booted, the DRAM is divided up (so each core gets its 1/N share of it).
Let's say you have an 8 core chip. Normally, each program gets its own core [sort of]. Your email gets a core, your browser gets a core, your editor gets one, etc. and none of them wait for another [unless they do filesystem operations, etc.] Disjoint programs don't need to communicate much usually [and not at the level we're talking about here].
But, if you have a program designed for heavy computation (e.g. video compression or transcoding), it might be designed to use multiple cores to get its work done faster. It will consist of multiple sections (e.g. processes/threads). If a process/thread so designates, it can share portions of its memory space with other processes/threads. Each thread takes input data from a memory pool somewhere, does some work on it, and deposits the results in a memory output pool. It then alerts the next thread in the processing "pipeline" as to which memory buffer it placed the result. The next thread does much the same. x86 architectures have some locking primitives to assist this. It's a bit more complex than that, but you don't need a "router". If the multicore application is designed correctly, any delays for sync between pipeline stages occur infrequently and are on the order of a few CPU cycles.
This works fine up to about 16-32 cores. Beyond that, even the cache becomes a bottleneck. Or, consider a system were you have a 16 core chip (all on the same silicon substrate). The cache works fine there. But now suppose you want to have a motherboard that has 100 of these chips on it. That's right--16 cores/chip X 100 chips for a total of 160 cores. Now, you need some form of interchip communication.
x86 systems already have this in the form of Hypertransport (AMD) or the PCI Express Bus (Intel) [there are others as well]. PCIe isn't a bus in the classic sense at all. It functions like an onboard store-and-forward point-to-point routing system with guaranteed packet delivery. This is how a SATA host adapter communicates with DRAM (via a PCIe link). Likewise for your video controller. Most current systems don't need to use PCIe beyond this (e.g. to hook up multiple CPU chips) because most desktop/laptop systems have only one chip (with X cores in it). But, in the 100 chip example, you would need something like this and HT and PCIe already do something similar. Intel/AMD are already working on any enhancements to HT/PCIe as needed. Actually, Intel [unwilling to just use HT], is pushing "Quick Path Interconnect" or QPI.
Like a good neighbor, fsck is there
The researchers can't be this far removed from the state of the art
They aren't. The way this works is a conversation something like this:
MIT PR: We want to write about your research, what do you do?
Researcher: We're looking at highly scalable interconnects for future manycore systems.
MIT PR: Interconnects? Like wires?
Researcher: No, the way in which the cores on a chip communicate.
MIT PR: So how does that work?
Researcher: {long explanation}
MIT PR: {blank expression}
Researcher: You know how the Internet works? With packet switching?
MIT PR: I guess...
Researcher: Well, kind-of like that.
MIT PR: Our researchers are putting the Internet in a CPU!!1!111eleventyone
I am TheRaven on Soylent News