Slashdot Mirror


MIT's Swarm Chip Architecture Boosts Multi-Core CPUs, Offering Up To 18x Faster Processing (gizmag.com)

An anonymous reader writes from a report via Gizmag: MIT's new Swarm chip could help unleash the power of parallel processing for up to 75-fold speedups, while requiring programmers to write a fraction of the code that is usually necessary for programs to take full advantage of their hardware. Swarm is a 64-core chip developed by Prof. Daniel Sanchez and his team that includes specialized circuitry for both executing and prioritizing tasks in a simple and efficient manner. Neowin reports: "For example, when using multiple cores to process a task, one core might need to access a piece of data that's being used by another core. Developers usually need to write code to avoid these types of conflict, and direct how each part of the task should be processed and split up between the processor's cores. This almost never gets done with normal consumer software, hence the reason why Crysis isn't running better on your new 10-core Intel. Meanwhile, when such optimization does get done, mainly for industrial, scientific and research computers, it takes a lot of effort on the developer's side and efficiency gains may sometimes still be minimal." Swarm is able to take care of all of this, mostly through its hardware architecture and customizable profiles that can be written by developers in a fraction of the time needed for regular multi-core silicon. The 64-core version of Swarm came out on top after MIT researchers tested it out against some highly-optimized parallel processing algorithms, offering three to 18 times faster processing. The most impressive result was when Swarm achieved results 75 times better than the regular chips, because that particular algorithm had failed to be parallelized on classic multi-core processors. There's no indication as to when this technology will be available for consumer devices.

12 of 55 comments (clear)

  1. Parallelization... by Timothy2.0 · · Score: 5, Insightful

    It's important for the average consumer to realize that not all processing tasks are easily parallizable, and some downright aren't. In those cases, additional cores aren't going to give you much in the way speed increases. Of course, your average consumer *doesn't* realize that, and when they go to their favourite big-box store for a new computer, the sales associate isn't going to sit down and discuss the reality of the situation either.

    1. Re:Parallelization... by HiThere · · Score: 2

      While true, multi-processor systems are considerably more responsive while busy with another task. So, e.g., you can be downloading upgrades, compressing files, and word processing all at the same time without penalty. Admittedly, it's hard to see how that particular scenario would be better with 100 cores than with 5 or 6. But a batch of them could be rendering an animation or some such.

      FWIW, I have a task in mind where 1,000 cores would not be overkill, but most users would never do it. However they might be doing local speech understanding, or image recognition. GPUs aren't the only, or even the best, way to do that. They're just currently the cheapest.

      --

      I think we've pushed this "anyone can grow up to be president" thing too far.
    2. Re:Parallelization... by phantomfive · · Score: 2

      The summary is bad. My understanding of what they did (after reading the article): implemented a shortest-path algorithm in software, but parallelizing it by putting a priority queue into hardware to allocate tasks.

      --
      "First they came for the slanderers and i said nothing."
    3. Re:Parallelization... by willy_me · · Score: 3, Informative

      Branch prediction integrated with the pipeline. Most CPUs do not execute both branches so much as they perform all the work required to quickly switch to the alternate branch should a branch not go as predicted. This implies an alternate pipeline into which the instructions for the alternate branch are queued. This might not sound like much but it actually constitutes >90% of the work a CPU must perform. The ALU is fast and simple but getting the correct data to and from the ALU is challenging.

      CPUs can also support multiple ALUs - but this is not to speed branches. Multiple ALUs are used when the CPU detects that incoming instructions are not dependent on one another and can be executed concurrently. When detected, instructions are executed in parallel. The benefits gained are limited and it comes at the cost of extra transistors. However, because you have less movement of data, power requirements are reduced.

      Look at the Apple A9 CPU compared to alternate multi-core ARM chips that are available. The A9 is just as fast while running fewer cores at lower clock rate while consuming less power. It is able to do so by using the previously mentioned techniques. It uses billions of transistors and costs more to produce then other chips that are just as fast. Not a good choice for making devices with low profit margins, but an excellent choice if you can afford it.

    4. Re:Parallelization... by TheRaven64 · · Score: 3, Informative

      No. The P6 does branch prediction. When you get to a branch, the processor guesses which one is taken and executes that. If it guessed wrong, it throws away all of the speculative results. The grandparent is talking about executing both branches. The up side of this is that you never miss-predict a branch. The downside is that it's not really feasible and gives a huge increase in power consumption. A modern superscalar processor can easily have 50 instructions in flight at once (the Pentium 4 could have 140, which is partly why it rarely hit its peak performance). You have a branch, on average, every 7 instructions. To fill a pipeline of 50 instructions, you need to speculatively execute past 7 branches. Often these are loops, so branch prediction does a good job. Now imagine that you executed every path. After 7 branches, there are 128 possible places you could be. Each one of those includes an average of 7 instructions, so to be able to do all of that you'd need 18 times as many functional units. Register renaming (which is already one of the largest costs on the chip) would become vastly more complicated. Your processor would need liquid helium poured on it to keep it at a stable temperature. And, at the end of this, you'd still not have much better performance.

      And that is assuming that all branches are simple conditionals, not computed branches (C++ virtual calls, cross-library calls via a PLT, function pointer calls, and so on). You can't execute all of the possible targets for a computer branch, so you'd still need the branch predictor infrastructure to handle this case, so you're not even saving much on hardware.

      A few experimental chips have tried doing this for branches where the predictor doesn't give a high confidence of either path. In this kind of limited use, executing both branches at half speed, rather than executing one with a 50% chance of needing to discard the result, gives slightly better performance.

      --
      I am TheRaven on Soylent News
  2. Special-Purpose chips by ThosLives · · Score: 3, Insightful

    I guess the world is rediscovering that special-purpose chips will always be faster at their special purpose than a general-purpose chip will be.

    --
    "There are a dozen opinions on a matter until you know the truth. Then there is only one." - CS Lewis (paraprhase)
  3. can't this hardware be translated to software? by sittingnut · · Score: 3, Interesting

    i am dumb on this, but if 'hardware architecture' can be made to take care of avoiding conflicts and "direct how each part of the task should be processed and split up between the processor's cores", same can be done through software that imitate whatever 'hardware architecture' is doing?
    if this can be done, basically this software would be another step in compiling/assembling process?

    as i said, i am ignorant on this, but why not?

    1. Re:can't this hardware be translated to software? by complete+loony · · Score: 2

      I've only had a quick look at their press release, is there a pre-print of their paper anywhere?

      This looks like a hardware implementation of something like "Grand Central Dispatch". Combined with transactional memory.

      The basic idea seems to be that you can take a serial-ish process, break it up into tasks. Start running the first few tasks that should obviously run first. Then if you have spare CPU cores, you can also start speculatively executing later tasks. But if these speculative tasks hit a conflict in the transactional memory model, the results will be thrown away.

      So you might see a massive win from running those tasks early. But at worst, you'll still run every task in order.

      IMHO getting any kind of speed boost is going to depend on hardware support. But there might be a way to do something similar with OS kernel support.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    2. Re:can't this hardware be translated to software? by Gravis+Zero · · Score: 2

      i am dumb on this, but if 'hardware architecture' can be made to take care of avoiding conflicts and "direct how each part of the task should be processed and split up between the processor's cores", same can be done through software that imitate whatever 'hardware architecture' is doing?

      From reading the MIT page, I gather that it should be possible but it would result in substantial overhead. The bloom filter alone would also need it's own core.

      if this can be done, basically this software would be another step in compiling/assembling process?

      Yes, however, this would not be helpful for 99% of software because most software simply cannot benefit from parallel processing. The one area that benefits the most from parallel processing is graphics, specifically manipulation and rendering. That said, where this may be able to help is in creating a better GPU, so it should be no surprise that one of the professors working on this is also a senior researcher for NVidia.

      --
      Anons need not reply. Questions end with a question mark.
  4. Any hardware can be software. Doesn't mean it shou by raymorris · · Score: 2

    Sure it -could- be done in software. Essentially any design can be implemented as hardware, software, or a hybrid of the two. (A major problem for those complaining about "software patents".) I wouldn't be surpised if someone does take some of their ideas and implement them in software.

    In general, hardware will be faster and in some ways more reliable than a software implementation of the same algorithm. It also means software doesn't have to be recompiled for lots of different types of hardware, if the hardware hides the differences.

  5. an older paper describing Swarm by joris.w · · Score: 3, Informative
  6. Processes vs threads by DidgetMaster · · Score: 2

    There are two ways that multiple cores can help the average users. First, they allow multiple different processes to run at the same time. You can run a word processor, spreadsheet, browser, etc. all at once. Unless each of these processes are waiting on the same resource (e.g. all trying to write to the disk at the same time, or waiting for the user to press a key) then they can complete tasks much faster than a machine with fewer cores.

    Second, they allow a single program to do more than one thing at a time. Lots of programs will have a separate thread to handle the user interface while another does background tasks, but few will try and break big tasks into multiple pieces. For example, many database programs will be able to run several independent queries at the same time, but few will run a single query faster on a multi-core machine than on a single core one.

    I am working on a new data management system that does both. It can let lots of queries run at the same time, and it can break a single query into smaller pieces. The more cores the better. A query that takes 1 minute on a single core can often do the same thing in about 1/5 the time on a quad core (8 threads).