Slashdot Mirror


Multicore Chips As 'Mini-Internets'

An anonymous reader writes "Today, a typical chip might have six or eight cores, all communicating with each other over a single bundle of wires, called a bus. With a bus, only one pair of cores can talk at a time, which would be a serious limitation in chips with hundreds or even thousands of cores. Researchers at MIT say cores should instead communicate the same way computers hooked to the Internet do: by bundling the information they transmit into 'packets.' Each core would have its own router, which could send a packet down any of several paths, depending on the condition of the network as a whole."

33 of 132 comments (clear)

  1. A fault-tolerant chip? by Anonymous Coward · · Score: 5, Interesting

    This technology that networks different cores can also serve another purpose, to prevent damage from core failure, and diagnose such failures. If the cores are connected to other cores, the same data can be processed by bypassing a damaged core, making over heating or manufacturing problems important, but almost treatable. Who knows, cores might even get replaceable.

    1. Re:A fault-tolerant chip? by Mitchell314 · · Score: 2

      What are the chances you damage the chip without damaging enough of it to be rendered inoperable?

      --
      I read TFA and all I got was this lousy cookie
    2. Re:A fault-tolerant chip? by Osgeld · · Score: 4, Interesting

      pretty good, few years ago I ran for months on a dual core with one blown out, worked fine until I fired up something that used both, then it would die.

    3. Re:A fault-tolerant chip? by AdamHaun · · Score: 4, Interesting

      This sort of technology already exists to an extent. TI's Hercules TMS570 microcontrollers have two CPUs that run in lockstep along with a bus comparison module. I think total fail-tolerance might take three CPUs, but this provides strong hardware fault detection in addition to the usual ECC and other monitoring/correction stuff.

      Note that run-time fault tolerance is mostly needed for safety-critical systems. The customers who buy these products do not do so to get better yield, they do so to guarantee that their airbags, anti-lock brakes, or medical devices won't kill anyone. As such, manufacturing quality is very high. Also, die size is significantly larger than comparable general market (non-safety) devices. This means they cost a small fortune. The PC equivalent would be MLC vs. SLC SSDs. Consumer products usually don't waste money on that kind of reliability unless they need it. Now a super-expensive server CPU, maybe...

      [Disclaimer: I am a TI employee, but this is not an official advertisement for TI. Do not use any product in safety-critical systems without contacting the manufacturer, or at least a good lawyer. I am not responsible for damage to humans, machinery, or small woodland creatures that may result from improper use of TI products.]

      --
      Visit the
    4. Re:A fault-tolerant chip? by Electricity+Likes+Me · · Score: 4, Informative

      Also this is exactly what chip makers already do to a great extent: the binning of CPUs by speeds is not a targeted process. You make a bunch of a chips, test them, and then sell them as whatever clock speed they are robustly stable at.

    5. Re:A fault-tolerant chip? by Osgeld · · Score: 2

      yep, its also why overclocking is popular/popular, robustly stable, and stable are 2 different things depending on where they end up at and testing tolerances. That 2.5Ghz chip may run at 2.7Ghz just fine and dandy, but out of spec with regards to voltage or temperature, even by a little.

      you dont want dell refusing a gigantic pile of chips cause a few bad products, causing a quality alert, which is very costly and time consuming to both parties

    6. Re:A fault-tolerant chip? by Joce640k · · Score: 5, Interesting

      Also this is exactly what chip makers already do to a great extent: the binning of CPUs by speeds is not a targeted process. You make a bunch of a chips, test them, and then sell them as whatever clock speed they are robustly stable at.

      Nope. The markings on a chip do NOT necessarily indicate what the chip is capable of.

      Chips are sorted by ability, yes, but many are deliberately downgraded to fill incoming orders for less powerful chips. Bits of them are disabled/underclocked even though they passed all stability tests simply because that's what the days incoming orders were for.

      --
      No sig today...
    7. Re:A fault-tolerant chip? by Joce640k · · Score: 3, Interesting

      This sort of technology already exists to an extent. TI's Hercules TMS570 microcontrollers have two CPUs that run in lockstep along with a bus comparison module. I think total fail-tolerance might take three CPUs....

      This is just to detect when an individual CPU has failed. To build a fault-tolerant system you need multiple CPUs.

      nb. The 'three CPUs' thing isn't done for detection of hardware faults it's for software faults. The idea is to get three different programmers to write three different programs with a specified output. You then compare the outputs of the programs and if one is different it's likely to be a bug.

      --
      No sig today...
    8. Re:A fault-tolerant chip? by morgauxo · · Score: 3, Interesting

      Years ago I had a single core chip with a damaged FPU. It took me forever to figure out the problem, my computer could only run Gentoo. Windows and Debian, both which it had ran previously gave me all sorts of weird errors I had never seen before. I had to keep using it because I was in college and didn't have money for another one so I just got used to Gentoo. Even in Gentoo anything which wasn't compiled from scratch was likely to crash in weird ways. (a clue) I finally diagnosed the problem a couple years later when a family member gave me a disk that boots up and runs all sorts of tests on the hardware. It turned out Gentoo worked because when software compiled it recognized the lack of an FPU and compiled in floating point emulation like it was dealing with an old 486sx chip.

      So, anyway, if that can happen I would imagine damaging a single core of a multicore chip is quite possible.

  2. way back machine by Anonymous Coward · · Score: 5, Insightful

    I guess MIT has forgotten about the Transputer....

  3. Back to the future moment? by GumphMaster · · Score: 4, Insightful

    I started reading an immediately had flashbacks to the Transputer

    --
    Patent litigation: A doctrine of Mutually Assured Destruction... in which everyone seems willing to push the button
    1. Re:Back to the future moment? by tibit · · Score: 4, Interesting

      Alive and well as XMOS products. I love those chips.

      --
      A successful API design takes a mixture of software design and pedagogy.
    2. Re:Back to the future moment? by jd · · Score: 3, Informative

      The Transputer was a brilliant design. Intel came up with a next-gen variant, called the iWarp, but never did anything with it and eventually abandoned the concept.

      IIRC, each Transputer had four serial lines where each could be in transmit or receive mode. They each had their own memory management (16K on-board, extendable up to 4 gigs - it was a true 32-bit architecture) so there was never any memory contention. Arrays of thousands of Transputers, arranged in a Hypercube topology, were developed and could out-perform the Cray X-MP at a fraction of the cost.

      Having a similar communications system in modern CPUs would certainly be doable. It would have the major benefit over a bus in that it's a local communications channel so you always have maximum bandwidth. Having said that, a switched network would have fewer interconnects and be simpler to construct and scale since the switching logic is isolated and not part of the core. You can also multicast and anycast on a switched network - technically doable on the Transputer but not trivial. Multicasting is excellent for MISD-type problems (multi-instruction, single-data) since you can have the instructions in the L1 cache and then just deliver the data in a single burst to all applicable cores.

      (Interestingly, although PVM and MPI support collective operations of this kind, they're usually done as for loops, which - by definition - means your network latency goes up with the number of processes you send to. Since collective operations usually end in a barrier, even the process you first send to has this extra latency built into it.)

      It's also arguable that it would be better if the networking in the CPU was compatible with the networking on the main bus since this would mean core-to-core communications across SMP would not require any translation or any extra complexities in the support chips. It would also mean CPU-to-GPU communications would be greatly simplified.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    3. Re:Back to the future moment? by WrecklessSandwich · · Score: 2

      Yep, thought of XMOS immediately when I saw the title. 16 quad-core CPUs linked together in a 4D hypercube: https://www.xmos.com/products/development-kits/xmp-64

    4. Re:Back to the future moment? by 91degrees · · Score: 3, Interesting

      My Computer Architecture lecturer at University was David May - lead architect for the Transputer. Our architecture notes consisted of a treatise on transputer design.

      Now multi-processor is becoming standard, it's interesting to see the the same problems being rediscovered, and often the same solutions reinvented. Their next problem will be contention between two cores that happen to be running processes that require a lot of communication. Inmos had a simple solution to this one as well.

      Rather a shame that Inmos came up with the technology a quarter of a century too early. I've known a lot of engineers say wonderful things about them. The reason they weren't a huge success was because nobody had found a need for them yet. Extra silicon could be used to make the current generation faster much more easily than now.

  4. But what does the internet stand on? by keekerdc · · Score: 4, Funny

    Ah, you're clever; but it's internets all the way down.

  5. Say what? by Anonymous Coward · · Score: 2, Insightful

    Errr... the internal "bus" between cores on modern x86 chips already is either a ring of point to point links or a star with a massive crossbar at the center.

    1. Re:Say what? by hamjudo · · Score: 4, Interesting

      Errr... the internal "bus" between cores on modern x86 chips already is either a ring of point to point links or a star with a massive crossbar at the center.

      The researchers can't be this far removed from the state of the art, so I am hoping that it is just a really badly written article. I hope they are comparing their newer research chips with their own previous generation of research chips. Intel and AMD aren't handing out their current chip designs to the universities, so many things have to be re-invented.

    2. Re:Say what? by TheRaven64 · · Score: 3, Insightful

      The researchers can't be this far removed from the state of the art

      They aren't. The way this works is a conversation something like this:

      MIT PR: We want to write about your research, what do you do?
      Researcher: We're looking at highly scalable interconnects for future manycore systems.
      MIT PR: Interconnects? Like wires?
      Researcher: No, the way in which the cores on a chip communicate.
      MIT PR: So how does that work?
      Researcher: {long explanation}
      MIT PR: {blank expression}
      Researcher: You know how the Internet works? With packet switching?
      MIT PR: I guess...
      Researcher: Well, kind-of like that.
      MIT PR: Our researchers are putting the Internet in a CPU!!1!111eleventyone

      --
      I am TheRaven on Soylent News
  6. Sounds like... by ArchieBunker · · Score: 2

    ccNUMA?

    --
    Only the State obtains its revenue by coercion. - Murray Rothbard
    1. Re:Sounds like... by jd · · Score: 4, Interesting

      For low-level ccNUMA, you'd want three things:

      • A CPU network/bus with a "delay tolerant protocol" layer and support for tunneling to other chips
      • An MTU-to-MTU network/bus which used a compatible protocol to the CPU network/bus
      • MTUs to cache results locally

      If you were really clever, the MTU would become a CPU with a very limited instruction set (since there's no point re-inventing the rest of the architecture and external caching for CPUs is better developed than external caching for MTUs). In fact, you could slowly replace a lot of the chips in the system with highly specialized CPUs that could communicate with each other via a tunneled CPU network protocol.

      --
      It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  7. Buses are so '90s by rrohbeck · · Score: 5, Informative

    AMD uses HT and Intel has its ring bus, both of which use point-to-point links. Buses have serious trouble with the impedance jumps at the taps and clock skew between the lines, that's why nobody is using them in high speed applications any more. Even the venerable SCSI and ATA buses went the way of the dodo. The only bus I can see in my system is DDR3 (and I think that will go away with DDR4 due do the same problems.)

  8. Inefficient... by solidraven · · Score: 2

    That's just plain inefficient use of silicon area. They wish to waste some of that limited space on additional logic that isn't strictly necessary. And it will cause a significant bottleneck to be created. Did they forget about DMA controllers or something? You already need a DMA controller no mater what and it's perfectly capable of accessing the necessary memories as it is. Adding some extra capabilities to the DMA controller would be far more efficient in logic area size and most likely lead to a better performance compared to this bad idea.

    1. Re:Inefficient... by Theovon · · Score: 2

      Silicon AREA is cheap, and it's getting cheaper. Today's processors dedicate half their die space to CACHE. Transistors per die, cores per die, and transistors per core are all increasing at (different) exponential rates. And with power density increasing at a quadratic rate, we're already facing the dark silicon problem, where if we power on the entire chip at nominal voltage, we have trouble delivering the power, and we can't dissipate the heat.

      With 16 cores, a bus is tolerable. At 64, it's a liability, and we NEED a more sophisticated network.

  9. They aren't doing this already? by DaneM · · Score: 2

    I admit that despite being a technical user, I was not aware that only 2 chips are allowed to "talk" at a given time. I had (erroneously, it would seem) assumed that in order for a 3+-core chip to be fully useful, such a switch/router would have to already be in place.

    So, have Intel, AMD, and others simply tricked us into thinking that a 3+-core chip can actively use all its cores at once (as is the natural assumption), or am I misinterpreting something? If they have, why on earth didn't they include a "router" in the original designs? It seems entirely too obvious for the eggheads in R&D to have missed (or so one would think, anyway). I'm sure there are technical hurdles to overcome, but unless that can be managed, what is really the point of many-core CPUs that can't have all cores acting at once?

    1. Re:They aren't doing this already? by Anonymous Coward · · Score: 2, Informative

      You are misinterpreting it. The chips CAN work independently. It is only when one needs to talk to another or use a shared resource (hard drive, main memory, network) that this becomes a potential issue. It is like a family of three sharing a single bathroom - not such a big deal, Bump that up to 20 using the same bathroom, and you start having serious issues.

    2. Re:They aren't doing this already? by Forever+Wondering · · Score: 4, Insightful

      I admit that despite being a technical user, I was not aware that only 2 chips are allowed to "talk" at a given time. I had (erroneously, it would seem) assumed that in order for a 3+-core chip to be fully useful, such a switch/router would have to already be in place.

      For [most] current designs, Intel/AMD have multilevel cache memory. The cores run independently and fully in parallel and if they need to communicate they do so via shared memory. Thus, they all run full bore, flat out, and don't need to wait for each other [there are some exceptions--read on]. They have cache snoop logic that keeps them up-to-date. In other words, all cores have access to the entire DRAM space through the cache hierarchy. When the system is booted, the DRAM is divided up (so each core gets its 1/N share of it).

      Let's say you have an 8 core chip. Normally, each program gets its own core [sort of]. Your email gets a core, your browser gets a core, your editor gets one, etc. and none of them wait for another [unless they do filesystem operations, etc.] Disjoint programs don't need to communicate much usually [and not at the level we're talking about here].

      But, if you have a program designed for heavy computation (e.g. video compression or transcoding), it might be designed to use multiple cores to get its work done faster. It will consist of multiple sections (e.g. processes/threads). If a process/thread so designates, it can share portions of its memory space with other processes/threads. Each thread takes input data from a memory pool somewhere, does some work on it, and deposits the results in a memory output pool. It then alerts the next thread in the processing "pipeline" as to which memory buffer it placed the result. The next thread does much the same. x86 architectures have some locking primitives to assist this. It's a bit more complex than that, but you don't need a "router". If the multicore application is designed correctly, any delays for sync between pipeline stages occur infrequently and are on the order of a few CPU cycles.

      This works fine up to about 16-32 cores. Beyond that, even the cache becomes a bottleneck. Or, consider a system were you have a 16 core chip (all on the same silicon substrate). The cache works fine there. But now suppose you want to have a motherboard that has 100 of these chips on it. That's right--16 cores/chip X 100 chips for a total of 160 cores. Now, you need some form of interchip communication.

      x86 systems already have this in the form of Hypertransport (AMD) or the PCI Express Bus (Intel) [there are others as well]. PCIe isn't a bus in the classic sense at all. It functions like an onboard store-and-forward point-to-point routing system with guaranteed packet delivery. This is how a SATA host adapter communicates with DRAM (via a PCIe link). Likewise for your video controller. Most current systems don't need to use PCIe beyond this (e.g. to hook up multiple CPU chips) because most desktop/laptop systems have only one chip (with X cores in it). But, in the 100 chip example, you would need something like this and HT and PCIe already do something similar. Intel/AMD are already working on any enhancements to HT/PCIe as needed. Actually, Intel [unwilling to just use HT], is pushing "Quick Path Interconnect" or QPI.

      --
      Like a good neighbor, fsck is there ...
  10. Re:So, why not move from "hub" to "switch"? by Osgeld · · Score: 2

    I still think switches on tiny low traffic networks is a silly notion, though now that cost of switches are insignificant(and when was the last time you saw a hub for sale) I just go with the flow.

      Back in the day we had a client who dumped their hubs in each branch for much more expensive at the time switches, then whined that there was no advantage. I replied you insisted on putting your 2 386's and a dot matrix printer on it, and even threatened to take your biz elsewhere, you what you wanted, enjoy

  11. the worst replaces the best by holophrastic · · Score: 3, Interesting

    Yeah, great idea. Take the very fastest communication that we have on the entire planet, and replace it with the absolute slowest communication we have on the planet. Great idea. And with it, more complexity, more caches, more lookup tables, and more things to go wrong.

    The best part is that it's totally unbalanced. Internet protocols are based on a network that's ever-changing and totally unreliable. The bus, on the other hand, is best on total reliability and static.

    I'd have thought that a pool concept, or a mailbox metaphor, or a message board analog would have been more appropriate. Something where streams are naturally quantized and sending is unpaired from receiving. Where a recipient can operate at it's own rate uncommon to the sender.

    You know, like typical linux interactive sockets, for example. But what do I know.

  12. The important bit : No coherent shared cache by Sarusa · · Score: 5, Informative

    As mentioned in other comments, this has been done before. The method of message passing isn't as fundamental as one key point - that it is all explicit message passing.

    Intel and AMD x86/x64 CPUs use coherent cache between cores to make sure that a thread running on CPU 1 sees the same RAM as a thread running on CPU 3. This leads to horrible bottlenecks and huge amounts of die tied up in trying to coordinate the writes, maintain coherency between N cores (N-1 ^2 connections!), and it all just goes to hell pretty fast. Intel has this super new transactional memory rollback thing, but it's turd polishing.

    The next step is pretty obvious (see Barrelfish) and easy: no shared coherency. Everything is done with message passing. If two threads or processes (it doesn't really matter at that point) want to communicate they need to do it with messages. It's much cleaner than dealing with shared memory synchronization, and makes program flow much more obvious (to me at least - I use message queues even on x86/x64). If you need to share BIG MEMORY between threads, which is reasonable for something like image processing, you at least use messages to explicitly coordinate access to shared memory and the cores don't have to worry about coherency.

    This scales extremely well for at least a couple thousand CPUs, which is where the 'local internet' becomes useful.

    Where it becomes not easy is that almost all programs written for x86/x64 assume threads can share memory at will. They'd need to be rewritten for this model or would suddenly run a whole lot slower since you'd have to lock them to one core or somehow do the coordination behind their back. It'd be worth it for me!

    1. Re:The important bit : No coherent shared cache by dkf · · Score: 2

      Seems like you are talking about switching from a "strong memory model" to a "weak memory model" and TBQH I know my share of developers that can barely handle multithreaded programming as it is... throwing this at them could be a disaster on the software side.

      Depends on the model. If the model is "oh, you got one big space of memory; anything goes but you'd better sprinkle a few locks in" then yes, that will suck boulders when the hardware switches to message passing, but there are other parallelism models in use in programming. Those that have each thread as being essentially isolated and only communicating with the other threads by sending messages will adapt much more easily; that's basically MPI, and that's known to scale massively. It's also a heck of a lot easier to reason about message passing parallelism; that's been known since at least the '80s. What's more, there are actually quite a lot of programmers who have experience with distributed component programming; they just tend to work at a much higher level than a single process (or single computer).

      --
      "Little does he know, but there is no 'I' in 'Idiot'!"
  13. Google's on it by bill_mcgonigle · · Score: 2

    I can't seem to find the old story or my comment on it, but when Google acquired a 'stealth' startup a year or so ago the most interesting thing about it was that the primary investigator had a few patents for packet-switched CPU's.

    --
    My God, it's Full of Source!
    OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
  14. Re:A glorified name for better bus arbitrators by dyingtolive · · Score: 2

    You'd think someone with a 7-digit UID wouldn't be so arrogant.

    --
    Support the EFF and Creative Commons. The war is coming, and they're supporting you...