Slashdot Mirror


MIT's Swarm Chip Architecture Boosts Multi-Core CPUs, Offering Up To 18x Faster Processing (gizmag.com)

An anonymous reader writes from a report via Gizmag: MIT's new Swarm chip could help unleash the power of parallel processing for up to 75-fold speedups, while requiring programmers to write a fraction of the code that is usually necessary for programs to take full advantage of their hardware. Swarm is a 64-core chip developed by Prof. Daniel Sanchez and his team that includes specialized circuitry for both executing and prioritizing tasks in a simple and efficient manner. Neowin reports: "For example, when using multiple cores to process a task, one core might need to access a piece of data that's being used by another core. Developers usually need to write code to avoid these types of conflict, and direct how each part of the task should be processed and split up between the processor's cores. This almost never gets done with normal consumer software, hence the reason why Crysis isn't running better on your new 10-core Intel. Meanwhile, when such optimization does get done, mainly for industrial, scientific and research computers, it takes a lot of effort on the developer's side and efficiency gains may sometimes still be minimal." Swarm is able to take care of all of this, mostly through its hardware architecture and customizable profiles that can be written by developers in a fraction of the time needed for regular multi-core silicon. The 64-core version of Swarm came out on top after MIT researchers tested it out against some highly-optimized parallel processing algorithms, offering three to 18 times faster processing. The most impressive result was when Swarm achieved results 75 times better than the regular chips, because that particular algorithm had failed to be parallelized on classic multi-core processors. There's no indication as to when this technology will be available for consumer devices.

3 of 55 comments (clear)

  1. Re:Parallelization... by willy_me · · Score: 3, Informative

    Branch prediction integrated with the pipeline. Most CPUs do not execute both branches so much as they perform all the work required to quickly switch to the alternate branch should a branch not go as predicted. This implies an alternate pipeline into which the instructions for the alternate branch are queued. This might not sound like much but it actually constitutes >90% of the work a CPU must perform. The ALU is fast and simple but getting the correct data to and from the ALU is challenging.

    CPUs can also support multiple ALUs - but this is not to speed branches. Multiple ALUs are used when the CPU detects that incoming instructions are not dependent on one another and can be executed concurrently. When detected, instructions are executed in parallel. The benefits gained are limited and it comes at the cost of extra transistors. However, because you have less movement of data, power requirements are reduced.

    Look at the Apple A9 CPU compared to alternate multi-core ARM chips that are available. The A9 is just as fast while running fewer cores at lower clock rate while consuming less power. It is able to do so by using the previously mentioned techniques. It uses billions of transistors and costs more to produce then other chips that are just as fast. Not a good choice for making devices with low profit margins, but an excellent choice if you can afford it.

  2. an older paper describing Swarm by joris.w · · Score: 3, Informative
  3. Re:Parallelization... by TheRaven64 · · Score: 3, Informative

    No. The P6 does branch prediction. When you get to a branch, the processor guesses which one is taken and executes that. If it guessed wrong, it throws away all of the speculative results. The grandparent is talking about executing both branches. The up side of this is that you never miss-predict a branch. The downside is that it's not really feasible and gives a huge increase in power consumption. A modern superscalar processor can easily have 50 instructions in flight at once (the Pentium 4 could have 140, which is partly why it rarely hit its peak performance). You have a branch, on average, every 7 instructions. To fill a pipeline of 50 instructions, you need to speculatively execute past 7 branches. Often these are loops, so branch prediction does a good job. Now imagine that you executed every path. After 7 branches, there are 128 possible places you could be. Each one of those includes an average of 7 instructions, so to be able to do all of that you'd need 18 times as many functional units. Register renaming (which is already one of the largest costs on the chip) would become vastly more complicated. Your processor would need liquid helium poured on it to keep it at a stable temperature. And, at the end of this, you'd still not have much better performance.

    And that is assuming that all branches are simple conditionals, not computed branches (C++ virtual calls, cross-library calls via a PLT, function pointer calls, and so on). You can't execute all of the possible targets for a computer branch, so you'd still need the branch predictor infrastructure to handle this case, so you're not even saving much on hardware.

    A few experimental chips have tried doing this for branches where the predictor doesn't give a high confidence of either path. In this kind of limited use, executing both branches at half speed, rather than executing one with a 50% chance of needing to discard the result, gives slightly better performance.

    --
    I am TheRaven on Soylent News