Slashdot Mirror


Multicore Chips As 'Mini-Internets'

An anonymous reader writes "Today, a typical chip might have six or eight cores, all communicating with each other over a single bundle of wires, called a bus. With a bus, only one pair of cores can talk at a time, which would be a serious limitation in chips with hundreds or even thousands of cores. Researchers at MIT say cores should instead communicate the same way computers hooked to the Internet do: by bundling the information they transmit into 'packets.' Each core would have its own router, which could send a packet down any of several paths, depending on the condition of the network as a whole."

20 of 132 comments (clear)

  1. A fault-tolerant chip? by Anonymous Coward · · Score: 5, Interesting

    This technology that networks different cores can also serve another purpose, to prevent damage from core failure, and diagnose such failures. If the cores are connected to other cores, the same data can be processed by bypassing a damaged core, making over heating or manufacturing problems important, but almost treatable. Who knows, cores might even get replaceable.

    1. Re:A fault-tolerant chip? by Osgeld · · Score: 4, Interesting

      pretty good, few years ago I ran for months on a dual core with one blown out, worked fine until I fired up something that used both, then it would die.

    2. Re:A fault-tolerant chip? by AdamHaun · · Score: 4, Interesting

      This sort of technology already exists to an extent. TI's Hercules TMS570 microcontrollers have two CPUs that run in lockstep along with a bus comparison module. I think total fail-tolerance might take three CPUs, but this provides strong hardware fault detection in addition to the usual ECC and other monitoring/correction stuff.

      Note that run-time fault tolerance is mostly needed for safety-critical systems. The customers who buy these products do not do so to get better yield, they do so to guarantee that their airbags, anti-lock brakes, or medical devices won't kill anyone. As such, manufacturing quality is very high. Also, die size is significantly larger than comparable general market (non-safety) devices. This means they cost a small fortune. The PC equivalent would be MLC vs. SLC SSDs. Consumer products usually don't waste money on that kind of reliability unless they need it. Now a super-expensive server CPU, maybe...

      [Disclaimer: I am a TI employee, but this is not an official advertisement for TI. Do not use any product in safety-critical systems without contacting the manufacturer, or at least a good lawyer. I am not responsible for damage to humans, machinery, or small woodland creatures that may result from improper use of TI products.]

      --
      Visit the
    3. Re:A fault-tolerant chip? by Electricity+Likes+Me · · Score: 4, Informative

      Also this is exactly what chip makers already do to a great extent: the binning of CPUs by speeds is not a targeted process. You make a bunch of a chips, test them, and then sell them as whatever clock speed they are robustly stable at.

    4. Re:A fault-tolerant chip? by Joce640k · · Score: 5, Interesting

      Also this is exactly what chip makers already do to a great extent: the binning of CPUs by speeds is not a targeted process. You make a bunch of a chips, test them, and then sell them as whatever clock speed they are robustly stable at.

      Nope. The markings on a chip do NOT necessarily indicate what the chip is capable of.

      Chips are sorted by ability, yes, but many are deliberately downgraded to fill incoming orders for less powerful chips. Bits of them are disabled/underclocked even though they passed all stability tests simply because that's what the days incoming orders were for.

      --
      No sig today...
    5. Re:A fault-tolerant chip? by Joce640k · · Score: 3, Interesting

      This sort of technology already exists to an extent. TI's Hercules TMS570 microcontrollers have two CPUs that run in lockstep along with a bus comparison module. I think total fail-tolerance might take three CPUs....

      This is just to detect when an individual CPU has failed. To build a fault-tolerant system you need multiple CPUs.

      nb. The 'three CPUs' thing isn't done for detection of hardware faults it's for software faults. The idea is to get three different programmers to write three different programs with a specified output. You then compare the outputs of the programs and if one is different it's likely to be a bug.

      --
      No sig today...
    6. Re:A fault-tolerant chip? by morgauxo · · Score: 3, Interesting

      Years ago I had a single core chip with a damaged FPU. It took me forever to figure out the problem, my computer could only run Gentoo. Windows and Debian, both which it had ran previously gave me all sorts of weird errors I had never seen before. I had to keep using it because I was in college and didn't have money for another one so I just got used to Gentoo. Even in Gentoo anything which wasn't compiled from scratch was likely to crash in weird ways. (a clue) I finally diagnosed the problem a couple years later when a family member gave me a disk that boots up and runs all sorts of tests on the hardware. It turned out Gentoo worked because when software compiled it recognized the lack of an FPU and compiled in floating point emulation like it was dealing with an old 486sx chip.

      So, anyway, if that can happen I would imagine damaging a single core of a multicore chip is quite possible.

  2. way back machine by Anonymous Coward · · Score: 5, Insightful

    I guess MIT has forgotten about the Transputer....

  3. Back to the future moment? by GumphMaster · · Score: 4, Insightful

    I started reading an immediately had flashbacks to the Transputer

    --
    Patent litigation: A doctrine of Mutually Assured Destruction... in which everyone seems willing to push the button
    1. Re:Back to the future moment? by tibit · · Score: 4, Interesting

      Alive and well as XMOS products. I love those chips.

      --
      A successful API design takes a mixture of software design and pedagogy.
    2. Re:Back to the future moment? by jd · · Score: 3, Informative

      The Transputer was a brilliant design. Intel came up with a next-gen variant, called the iWarp, but never did anything with it and eventually abandoned the concept.

      IIRC, each Transputer had four serial lines where each could be in transmit or receive mode. They each had their own memory management (16K on-board, extendable up to 4 gigs - it was a true 32-bit architecture) so there was never any memory contention. Arrays of thousands of Transputers, arranged in a Hypercube topology, were developed and could out-perform the Cray X-MP at a fraction of the cost.

      Having a similar communications system in modern CPUs would certainly be doable. It would have the major benefit over a bus in that it's a local communications channel so you always have maximum bandwidth. Having said that, a switched network would have fewer interconnects and be simpler to construct and scale since the switching logic is isolated and not part of the core. You can also multicast and anycast on a switched network - technically doable on the Transputer but not trivial. Multicasting is excellent for MISD-type problems (multi-instruction, single-data) since you can have the instructions in the L1 cache and then just deliver the data in a single burst to all applicable cores.

      (Interestingly, although PVM and MPI support collective operations of this kind, they're usually done as for loops, which - by definition - means your network latency goes up with the number of processes you send to. Since collective operations usually end in a barrier, even the process you first send to has this extra latency built into it.)

      It's also arguable that it would be better if the networking in the CPU was compatible with the networking on the main bus since this would mean core-to-core communications across SMP would not require any translation or any extra complexities in the support chips. It would also mean CPU-to-GPU communications would be greatly simplified.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    3. Re:Back to the future moment? by 91degrees · · Score: 3, Interesting

      My Computer Architecture lecturer at University was David May - lead architect for the Transputer. Our architecture notes consisted of a treatise on transputer design.

      Now multi-processor is becoming standard, it's interesting to see the the same problems being rediscovered, and often the same solutions reinvented. Their next problem will be contention between two cores that happen to be running processes that require a lot of communication. Inmos had a simple solution to this one as well.

      Rather a shame that Inmos came up with the technology a quarter of a century too early. I've known a lot of engineers say wonderful things about them. The reason they weren't a huge success was because nobody had found a need for them yet. Extra silicon could be used to make the current generation faster much more easily than now.

  4. But what does the internet stand on? by keekerdc · · Score: 4, Funny

    Ah, you're clever; but it's internets all the way down.

  5. Re:Say what? by hamjudo · · Score: 4, Interesting

    Errr... the internal "bus" between cores on modern x86 chips already is either a ring of point to point links or a star with a massive crossbar at the center.

    The researchers can't be this far removed from the state of the art, so I am hoping that it is just a really badly written article. I hope they are comparing their newer research chips with their own previous generation of research chips. Intel and AMD aren't handing out their current chip designs to the universities, so many things have to be re-invented.

  6. Buses are so '90s by rrohbeck · · Score: 5, Informative

    AMD uses HT and Intel has its ring bus, both of which use point-to-point links. Buses have serious trouble with the impedance jumps at the taps and clock skew between the lines, that's why nobody is using them in high speed applications any more. Even the venerable SCSI and ATA buses went the way of the dodo. The only bus I can see in my system is DDR3 (and I think that will go away with DDR4 due do the same problems.)

  7. the worst replaces the best by holophrastic · · Score: 3, Interesting

    Yeah, great idea. Take the very fastest communication that we have on the entire planet, and replace it with the absolute slowest communication we have on the planet. Great idea. And with it, more complexity, more caches, more lookup tables, and more things to go wrong.

    The best part is that it's totally unbalanced. Internet protocols are based on a network that's ever-changing and totally unreliable. The bus, on the other hand, is best on total reliability and static.

    I'd have thought that a pool concept, or a mailbox metaphor, or a message board analog would have been more appropriate. Something where streams are naturally quantized and sending is unpaired from receiving. Where a recipient can operate at it's own rate uncommon to the sender.

    You know, like typical linux interactive sockets, for example. But what do I know.

  8. The important bit : No coherent shared cache by Sarusa · · Score: 5, Informative

    As mentioned in other comments, this has been done before. The method of message passing isn't as fundamental as one key point - that it is all explicit message passing.

    Intel and AMD x86/x64 CPUs use coherent cache between cores to make sure that a thread running on CPU 1 sees the same RAM as a thread running on CPU 3. This leads to horrible bottlenecks and huge amounts of die tied up in trying to coordinate the writes, maintain coherency between N cores (N-1 ^2 connections!), and it all just goes to hell pretty fast. Intel has this super new transactional memory rollback thing, but it's turd polishing.

    The next step is pretty obvious (see Barrelfish) and easy: no shared coherency. Everything is done with message passing. If two threads or processes (it doesn't really matter at that point) want to communicate they need to do it with messages. It's much cleaner than dealing with shared memory synchronization, and makes program flow much more obvious (to me at least - I use message queues even on x86/x64). If you need to share BIG MEMORY between threads, which is reasonable for something like image processing, you at least use messages to explicitly coordinate access to shared memory and the cores don't have to worry about coherency.

    This scales extremely well for at least a couple thousand CPUs, which is where the 'local internet' becomes useful.

    Where it becomes not easy is that almost all programs written for x86/x64 assume threads can share memory at will. They'd need to be rewritten for this model or would suddenly run a whole lot slower since you'd have to lock them to one core or somehow do the coordination behind their back. It'd be worth it for me!

  9. Re:Sounds like... by jd · · Score: 4, Interesting

    For low-level ccNUMA, you'd want three things:

    • A CPU network/bus with a "delay tolerant protocol" layer and support for tunneling to other chips
    • An MTU-to-MTU network/bus which used a compatible protocol to the CPU network/bus
    • MTUs to cache results locally

    If you were really clever, the MTU would become a CPU with a very limited instruction set (since there's no point re-inventing the rest of the architecture and external caching for CPUs is better developed than external caching for MTUs). In fact, you could slowly replace a lot of the chips in the system with highly specialized CPUs that could communicate with each other via a tunneled CPU network protocol.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  10. Re:They aren't doing this already? by Forever+Wondering · · Score: 4, Insightful

    I admit that despite being a technical user, I was not aware that only 2 chips are allowed to "talk" at a given time. I had (erroneously, it would seem) assumed that in order for a 3+-core chip to be fully useful, such a switch/router would have to already be in place.

    For [most] current designs, Intel/AMD have multilevel cache memory. The cores run independently and fully in parallel and if they need to communicate they do so via shared memory. Thus, they all run full bore, flat out, and don't need to wait for each other [there are some exceptions--read on]. They have cache snoop logic that keeps them up-to-date. In other words, all cores have access to the entire DRAM space through the cache hierarchy. When the system is booted, the DRAM is divided up (so each core gets its 1/N share of it).

    Let's say you have an 8 core chip. Normally, each program gets its own core [sort of]. Your email gets a core, your browser gets a core, your editor gets one, etc. and none of them wait for another [unless they do filesystem operations, etc.] Disjoint programs don't need to communicate much usually [and not at the level we're talking about here].

    But, if you have a program designed for heavy computation (e.g. video compression or transcoding), it might be designed to use multiple cores to get its work done faster. It will consist of multiple sections (e.g. processes/threads). If a process/thread so designates, it can share portions of its memory space with other processes/threads. Each thread takes input data from a memory pool somewhere, does some work on it, and deposits the results in a memory output pool. It then alerts the next thread in the processing "pipeline" as to which memory buffer it placed the result. The next thread does much the same. x86 architectures have some locking primitives to assist this. It's a bit more complex than that, but you don't need a "router". If the multicore application is designed correctly, any delays for sync between pipeline stages occur infrequently and are on the order of a few CPU cycles.

    This works fine up to about 16-32 cores. Beyond that, even the cache becomes a bottleneck. Or, consider a system were you have a 16 core chip (all on the same silicon substrate). The cache works fine there. But now suppose you want to have a motherboard that has 100 of these chips on it. That's right--16 cores/chip X 100 chips for a total of 160 cores. Now, you need some form of interchip communication.

    x86 systems already have this in the form of Hypertransport (AMD) or the PCI Express Bus (Intel) [there are others as well]. PCIe isn't a bus in the classic sense at all. It functions like an onboard store-and-forward point-to-point routing system with guaranteed packet delivery. This is how a SATA host adapter communicates with DRAM (via a PCIe link). Likewise for your video controller. Most current systems don't need to use PCIe beyond this (e.g. to hook up multiple CPU chips) because most desktop/laptop systems have only one chip (with X cores in it). But, in the 100 chip example, you would need something like this and HT and PCIe already do something similar. Intel/AMD are already working on any enhancements to HT/PCIe as needed. Actually, Intel [unwilling to just use HT], is pushing "Quick Path Interconnect" or QPI.

    --
    Like a good neighbor, fsck is there ...
  11. Re:Say what? by TheRaven64 · · Score: 3, Insightful

    The researchers can't be this far removed from the state of the art

    They aren't. The way this works is a conversation something like this:

    MIT PR: We want to write about your research, what do you do?
    Researcher: We're looking at highly scalable interconnects for future manycore systems.
    MIT PR: Interconnects? Like wires?
    Researcher: No, the way in which the cores on a chip communicate.
    MIT PR: So how does that work?
    Researcher: {long explanation}
    MIT PR: {blank expression}
    Researcher: You know how the Internet works? With packet switching?
    MIT PR: I guess...
    Researcher: Well, kind-of like that.
    MIT PR: Our researchers are putting the Internet in a CPU!!1!111eleventyone

    --
    I am TheRaven on Soylent News