Slashdot Mirror


MIT's Swarm Chip Architecture Boosts Multi-Core CPUs, Offering Up To 18x Faster Processing (gizmag.com)

An anonymous reader writes from a report via Gizmag: MIT's new Swarm chip could help unleash the power of parallel processing for up to 75-fold speedups, while requiring programmers to write a fraction of the code that is usually necessary for programs to take full advantage of their hardware. Swarm is a 64-core chip developed by Prof. Daniel Sanchez and his team that includes specialized circuitry for both executing and prioritizing tasks in a simple and efficient manner. Neowin reports: "For example, when using multiple cores to process a task, one core might need to access a piece of data that's being used by another core. Developers usually need to write code to avoid these types of conflict, and direct how each part of the task should be processed and split up between the processor's cores. This almost never gets done with normal consumer software, hence the reason why Crysis isn't running better on your new 10-core Intel. Meanwhile, when such optimization does get done, mainly for industrial, scientific and research computers, it takes a lot of effort on the developer's side and efficiency gains may sometimes still be minimal." Swarm is able to take care of all of this, mostly through its hardware architecture and customizable profiles that can be written by developers in a fraction of the time needed for regular multi-core silicon. The 64-core version of Swarm came out on top after MIT researchers tested it out against some highly-optimized parallel processing algorithms, offering three to 18 times faster processing. The most impressive result was when Swarm achieved results 75 times better than the regular chips, because that particular algorithm had failed to be parallelized on classic multi-core processors. There's no indication as to when this technology will be available for consumer devices.

55 comments

  1. Parallelization... by Timothy2.0 · · Score: 5, Insightful

    It's important for the average consumer to realize that not all processing tasks are easily parallizable, and some downright aren't. In those cases, additional cores aren't going to give you much in the way speed increases. Of course, your average consumer *doesn't* realize that, and when they go to their favourite big-box store for a new computer, the sales associate isn't going to sit down and discuss the reality of the situation either.

    1. Re:Parallelization... by Anonymous Coward · · Score: 0

      Parallelization can happen at the circuit level too. Every time there's a branch, there exists the possibility to execute *both* paths and throw away the invalid result. It takes extra gates and possibly virtual registers to pull that off. Modern CPUs can have over 100 instructions in-flight.

    2. Re:Parallelization... by rrohbeck · · Score: 1

      Yup, many big cores do that today. It's called speculative execution.

    3. Re:Parallelization... by HiThere · · Score: 2

      While true, multi-processor systems are considerably more responsive while busy with another task. So, e.g., you can be downloading upgrades, compressing files, and word processing all at the same time without penalty. Admittedly, it's hard to see how that particular scenario would be better with 100 cores than with 5 or 6. But a batch of them could be rendering an animation or some such.

      FWIW, I have a task in mind where 1,000 cores would not be overkill, but most users would never do it. However they might be doing local speech understanding, or image recognition. GPUs aren't the only, or even the best, way to do that. They're just currently the cheapest.

      --

      I think we've pushed this "anyone can grow up to be president" thing too far.
    4. Re:Parallelization... by justthinkit · · Score: 1

      I think the salesman handles the 'additional cores' question when they ask "What will you be using your computer for?" (while they size up the mark and frankly work on Internal Problem #1 "How much can I upsell to this sucker?")

      --
      I come here for the love
    5. Re: Parallelization... by hackwrench · · Score: 1

      The reason I want multiple cores is due to all the other processes that go on in the background. I know there are programs that shut down background tasks, but who knows?

    6. Re:Parallelization... by Anonymous Coward · · Score: 0

      I'm fairly sure that's been around for about 20 years, ever since the P6 core.

    7. Re:Parallelization... by phantomfive · · Score: 2

      The summary is bad. My understanding of what they did (after reading the article): implemented a shortest-path algorithm in software, but parallelizing it by putting a priority queue into hardware to allocate tasks.

      --
      "First they came for the slanderers and i said nothing."
    8. Re:Parallelization... by willy_me · · Score: 3, Informative

      Branch prediction integrated with the pipeline. Most CPUs do not execute both branches so much as they perform all the work required to quickly switch to the alternate branch should a branch not go as predicted. This implies an alternate pipeline into which the instructions for the alternate branch are queued. This might not sound like much but it actually constitutes >90% of the work a CPU must perform. The ALU is fast and simple but getting the correct data to and from the ALU is challenging.

      CPUs can also support multiple ALUs - but this is not to speed branches. Multiple ALUs are used when the CPU detects that incoming instructions are not dependent on one another and can be executed concurrently. When detected, instructions are executed in parallel. The benefits gained are limited and it comes at the cost of extra transistors. However, because you have less movement of data, power requirements are reduced.

      Look at the Apple A9 CPU compared to alternate multi-core ARM chips that are available. The A9 is just as fast while running fewer cores at lower clock rate while consuming less power. It is able to do so by using the previously mentioned techniques. It uses billions of transistors and costs more to produce then other chips that are just as fast. Not a good choice for making devices with low profit margins, but an excellent choice if you can afford it.

    9. Re:Parallelization... by Anonymous Coward · · Score: 1

      Most apps that need more processing benefit from multithreading, which you get with multiple cores. Parallel code is when a thread is broken down into mini threads and spread over multiple cores and then recombined to get the result. It creates overhead in a way, but for BIG number crunching it's especially useful.

      If you want to run a simple process as fast as possible, you just run it and it runs and it's done.You can't really benefit from parallel code for most tasks, as you say.

      But you seem to fail to realize that most apps are multithreaded, but not parallel and that's not because parallelization is so hard to write, it's because the app is already running multiple threads on multiple cores and waiting for the user. You can speed up waiting on the user reliably. You just have to wait.

      The average consumer should buy lots of cores because THEY ARE CHEAP, simple economic, but you still want to balance ghz and cores MOST people should error on the side of cock frequency, but more cores are generally so cheap you may as well get them. Of course RISC also changes that by doing more per cpu cycle

      Now we just need to get coders to remember how to do the most per cpu cycle and BOOM, that's more gain than silly embedded thread management on a chip.

    10. Re:Parallelization... by smallfries · · Score: 1

      How did they claim a 75x speedup using 64 cores?

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    11. Re:Parallelization... by TheRaven64 · · Score: 3, Informative

      No. The P6 does branch prediction. When you get to a branch, the processor guesses which one is taken and executes that. If it guessed wrong, it throws away all of the speculative results. The grandparent is talking about executing both branches. The up side of this is that you never miss-predict a branch. The downside is that it's not really feasible and gives a huge increase in power consumption. A modern superscalar processor can easily have 50 instructions in flight at once (the Pentium 4 could have 140, which is partly why it rarely hit its peak performance). You have a branch, on average, every 7 instructions. To fill a pipeline of 50 instructions, you need to speculatively execute past 7 branches. Often these are loops, so branch prediction does a good job. Now imagine that you executed every path. After 7 branches, there are 128 possible places you could be. Each one of those includes an average of 7 instructions, so to be able to do all of that you'd need 18 times as many functional units. Register renaming (which is already one of the largest costs on the chip) would become vastly more complicated. Your processor would need liquid helium poured on it to keep it at a stable temperature. And, at the end of this, you'd still not have much better performance.

      And that is assuming that all branches are simple conditionals, not computed branches (C++ virtual calls, cross-library calls via a PLT, function pointer calls, and so on). You can't execute all of the possible targets for a computer branch, so you'd still need the branch predictor infrastructure to handle this case, so you're not even saving much on hardware.

      A few experimental chips have tried doing this for branches where the predictor doesn't give a high confidence of either path. In this kind of limited use, executing both branches at half speed, rather than executing one with a 50% chance of needing to discard the result, gives slightly better performance.

      --
      I am TheRaven on Soylent News
    12. Re:Parallelization... by TheRaven64 · · Score: 1

      I didn't read TFA, but there are quite a lot of algorithms that exhibit superlinear speedup. The costs for a parallel algorithm are typically related to communication, but there's nothing magical about a sequential algorithm that means that it doesn't have its own costs. Storing temporary results and managing the queue of work to do are still requirements for a sequential one and often the parallel version can benefit from better locality of reference and so make better use of caches.

      --
      I am TheRaven on Soylent News
    13. Re:Parallelization... by Anonymous Coward · · Score: 0

      "And in one case, Swarm achieved a 75-fold speedup on a program that computer scientists had so far failed to parallelize."

      FTFA

    14. Re:Parallelization... by cdrudge · · Score: 1

      Don't underestimate the overhead expense of context switching.

    15. Re:Parallelization... by JanneM · · Score: 1

      When you subdivide a problem, each core works on a smaller subset. If those subsets fit into a cache that the bigger problem didn't, you can easily get superlinear increase as a result. In many cases you could actually rewrite the bigger problem to be more cache-friendly and get a similar speedup, so you generally don't make much of such "extra" performance increases.

      --
      Trust the Computer. The Computer is your friend.
    16. Re:Parallelization... by Big+Hairy+Ian · · Score: 1

      One wonders how much of a speed boost those algorithms would have got if they'd written them to run on a fairly average GPU

      --

      Build a Man a Fire, and He'll Be Warm for a Day. Set a Man on Fire, and He'll Be Warm for the Rest of His Life.

    17. Re:Parallelization... by phantomfive · · Score: 1

      You have a branch, on average, every 7 instructions. To fill a pipeline of 50 instructions, you need to speculatively execute past 7 branches.

      Oh, that gives me an idea of spacing my branches out better to speed things up. At least experimenting with it.

      --
      "First they came for the slanderers and i said nothing."
    18. Re:Parallelization... by TheRaven64 · · Score: 1

      Spacing them out will reduce the number of branches that need to be predicted, but it will also increase the cost of a single mispredicted branch.

      --
      I am TheRaven on Soylent News
    19. Re:Parallelization... by phantomfive · · Score: 1

      A lot of times that's ok.
      In any case, if you're going for efficiency, it's worth experimenting with.

      --
      "First they came for the slanderers and i said nothing."
    20. Re: Parallelization... by Anonymous Coward · · Score: 0

      How, not repeat the same fucking thing he just read.

  2. Which algorithm? by Anonymous Coward · · Score: 0

    The most impressive result was when Swarm achieved results 75 times better than the regular chips, because that particular algorithm had failed to be parallelized on classic multi-core processors.

    It'd be nice to know what this architecture excels at.

  3. Special-Purpose chips by ThosLives · · Score: 3, Insightful

    I guess the world is rediscovering that special-purpose chips will always be faster at their special purpose than a general-purpose chip will be.

    --
    "There are a dozen opinions on a matter until you know the truth. Then there is only one." - CS Lewis (paraprhase)
    1. Re:Special-Purpose chips by Anonymous Coward · · Score: 0

      These are intended to be general purpose chips. -PCP

    2. Re:Special-Purpose chips by TheRaven64 · · Score: 1

      The win for special-purpose chips has always been obvious. The recent change is in the economics. It used to be very expensive to have any functionality in an IC. One of the driving forces behind the original RISC and VLIW chips was to devote as much of your transistor budget to execution units and remove anything that didn't directly contribute to performance. Now, the economics are quite different. Transistors are cheap but power dissipation is hard. It's easy to stick more execution units on an SoC, but it's very hard to stay within your power budget. Specialised processors that use less power for a specific task and are turned off the rest of the time are a big win. For example, a lot of ARM SoCs for mobile use include a face detection algorithm as a discrete logic block: you write image data to it and read back a list of rectangles. The ARM core and the GPU would both be sufficiently powerful to do this entirely in software, but the coprocessors uses a fraction of the power (and, in a big.LITTLE configuration, means that the photo app can run on the LITTLE core).

      --
      I am TheRaven on Soylent News
  4. can't this hardware be translated to software? by sittingnut · · Score: 3, Interesting

    i am dumb on this, but if 'hardware architecture' can be made to take care of avoiding conflicts and "direct how each part of the task should be processed and split up between the processor's cores", same can be done through software that imitate whatever 'hardware architecture' is doing?
    if this can be done, basically this software would be another step in compiling/assembling process?

    as i said, i am ignorant on this, but why not?

    1. Re:can't this hardware be translated to software? by complete+loony · · Score: 2

      I've only had a quick look at their press release, is there a pre-print of their paper anywhere?

      This looks like a hardware implementation of something like "Grand Central Dispatch". Combined with transactional memory.

      The basic idea seems to be that you can take a serial-ish process, break it up into tasks. Start running the first few tasks that should obviously run first. Then if you have spare CPU cores, you can also start speculatively executing later tasks. But if these speculative tasks hit a conflict in the transactional memory model, the results will be thrown away.

      So you might see a massive win from running those tasks early. But at worst, you'll still run every task in order.

      IMHO getting any kind of speed boost is going to depend on hardware support. But there might be a way to do something similar with OS kernel support.

      --
      09F91102 no, 455FE104 nope, F190A1E8 uh-uh, 7A5F8A09 that's not it, C87294CE no. Ah! 452F6E403CDF10714E41DFAA257D313F.
    2. Re:can't this hardware be translated to software? by Gravis+Zero · · Score: 2

      i am dumb on this, but if 'hardware architecture' can be made to take care of avoiding conflicts and "direct how each part of the task should be processed and split up between the processor's cores", same can be done through software that imitate whatever 'hardware architecture' is doing?

      From reading the MIT page, I gather that it should be possible but it would result in substantial overhead. The bloom filter alone would also need it's own core.

      if this can be done, basically this software would be another step in compiling/assembling process?

      Yes, however, this would not be helpful for 99% of software because most software simply cannot benefit from parallel processing. The one area that benefits the most from parallel processing is graphics, specifically manipulation and rendering. That said, where this may be able to help is in creating a better GPU, so it should be no surprise that one of the professors working on this is also a senior researcher for NVidia.

      --
      Anons need not reply. Questions end with a question mark.
    3. Re:can't this hardware be translated to software? by KingMotley · · Score: 1

      I am going to assume that you are new to programming, because claiming that most software cannot benefit from parallel processing is hilariously false. It's just most programmers can't do it, or do it well that is the issue. Almost all software today can benefit from parallel processing, it's just a matter of how much, and if it is worth the expense of actually getting a programmer who can do it rather than throwing 10 code monkeys in a room to bang out barely functional code.

    4. Re:can't this hardware be translated to software? by imgod2u · · Score: 1

      http://livinglab.mit.edu/wp-co...

      They use individual cores to speculatively execute very short sequences of instructions, for instance, a function call or loop iteration. The algorithms they benchmark resemble the architecture -- where there's a lot of very small code sequences that aren't usually very dependent on each other, however the individual code sequences aren't large enough for traditional thread-based solutions with high-synchronization overhead to work.

      One wonders how this would compare to a lock-free implementation on modern multi-core server processors with transactional memory.

      One thing to note is that the algorithms they evaluate are largely biased towards huge numbers of tasks that are known ahead of time. There is a large amount of potential parallelism, but traditional synchronized thread-based methods add too much overhead since each task is very small. General-purpose computing (for instance, browser code) contains scarce little of such algorithms, though.

    5. Re:can't this hardware be translated to software? by Gravis+Zero · · Score: 1

      your arrogance is astounding.

      --
      Anons need not reply. Questions end with a question mark.
    6. Re:can't this hardware be translated to software? by Anonymous Coward · · Score: 0

      In a very real way he is correct, but of course there are many issues that are serial in nature. Async concurrency can help a lot of "serial" programs and many times you can add parallelism for a speedup, but rarely major speed-ups. I know some pretty good programmers, but even most of them have issues seeing situations where something could be done in parallel.

      Parallelism really isn't that hard, but I only know an extremely few programmers who find it easy. Most need their hands held.

    7. Re:can't this hardware be translated to software? by Anonymous Coward · · Score: 0

      The work in these algorithms is discovered dynamically. Tasks create other tasks. What this architecture guarantees is that the execution will appear to follow a valid sequential order, as defined by the programmer using timestamps during task creation. This lets them use algorithms that rely on order for correctness which are asymptotically more efficient than the standard parallel (and unordered) variants.

  5. profanity can help by Anonymous Coward · · Score: 0, Offtopic

    I once called a cable company known for cruddy service and the automated system told me the wait wait would be a while. I told the automated system using as much loud and colorful language as possible that I did not want their service anymore and to take the equipment out. A very nice rep immediately came on the phone, I was equally nice back to her and the problem was resolved.

    1. Re: profanity can help by Anonymous Coward · · Score: 0

      That's coincidence, not that someone silently listens to people on hold and takes the angry calls.

      Are you retarded?

  6. Once it's parallelized... by Anonymous Coward · · Score: 0

    and recompiled, Wordstar should be really, really fast on this device.

  7. awesome by Anonymous Coward · · Score: 0

    can't wait to try it.

  8. Hmm this sounds like it should be software by Crashmarik · · Score: 1

    Sounds much more like something that should be refinements to code generation than baked into chip architecture. That said it's good to see work being done on better parallel methods rather than just bigger.

    1. Re:Hmm this sounds like it should be software by Anonymous Coward · · Score: 0

      I think it has to be done in hardware to be fast enough for the specific purpose. It has to execute and manage this code faster than the cpu and memory. So they are using a special chip and their own math lets say. I'm not sure if it's RISC or a new instruction set, but you can VASTLY lower latency of a chip when you reduce the instruction set.

      This chip could be MANY times faster at this task than any x86 chip could ever hope to be. For this application it might take that excessive amount of low latency 'decisions' making.

      People don't consider that x86 chips are very very slow. Even RISC is very very slow in what it can do per cycle because it can do so much. If you make a chip that just does one or two things, it can blow away a general purpose CPU.

      Similar to how a GPU can blow away a GPU or a bitcoin miner can blow away either.

      I think we do need much smarter compilers to get the full benefit, but basically there isn't much demand for that because CPUs are cheap and fast. It makes more sense to go back and focus on actually coding skills and compilers for performance gain... as ghz is topping out. Intel has said it's going for lower power use, not highest ghz or performance.

      The future of computing has to be in coding for now. This chip seems like a stop gap solution or for high demand solutins, not for desktops. The article is probably mistleading us for the sake of getting more views.

  9. Any hardware can be software. Doesn't mean it shou by raymorris · · Score: 2

    Sure it -could- be done in software. Essentially any design can be implemented as hardware, software, or a hybrid of the two. (A major problem for those complaining about "software patents".) I wouldn't be surpised if someone does take some of their ideas and implement them in software.

    In general, hardware will be faster and in some ways more reliable than a software implementation of the same algorithm. It also means software doesn't have to be recompiled for lots of different types of hardware, if the hardware hides the differences.

  10. Re:Any hardware can be software. Doesn't mean it s by Anonymous Coward · · Score: 0

    This.

  11. aah? by Anonymous Coward · · Score: 0

    75 time faster than WHAT?

  12. aah? by fubarrr · · Score: 0

    75 times speedups over WHAT?

  13. Not as you describe it by Anonymous Coward · · Score: 1

    If this hardware does something that could be done at compile time, it is IMHO indeed pretty useless. That's why I hope it is "runtime-smart", meaning that it reacts to data access conflicts as they actually happen while the program is running. That would be something that is much harder to achieve, in an efficient manner at least, through software. The talk about the profiles devs have to declare doesn't sound good to me: people who don't bother writing software that uses proper locking or libs implementing Actors/DataFlow/STM or whatever will probably not care about defining proper profiles for this thing either.

  14. an older paper describing Swarm by joris.w · · Score: 3, Informative
  15. Processes vs threads by DidgetMaster · · Score: 2

    There are two ways that multiple cores can help the average users. First, they allow multiple different processes to run at the same time. You can run a word processor, spreadsheet, browser, etc. all at once. Unless each of these processes are waiting on the same resource (e.g. all trying to write to the disk at the same time, or waiting for the user to press a key) then they can complete tasks much faster than a machine with fewer cores.

    Second, they allow a single program to do more than one thing at a time. Lots of programs will have a separate thread to handle the user interface while another does background tasks, but few will try and break big tasks into multiple pieces. For example, many database programs will be able to run several independent queries at the same time, but few will run a single query faster on a multi-core machine than on a single core one.

    I am working on a new data management system that does both. It can let lots of queries run at the same time, and it can break a single query into smaller pieces. The more cores the better. A query that takes 1 minute on a single core can often do the same thing in about 1/5 the time on a quad core (8 threads).

  16. Headline has 18% More Unresolved References... by BrendaEM · · Score: 1

    ...than other articles. No, really, "more" and "less" only work when comparing things.

    --
    https://www.youtube.com/c/BrendaEM
  17. Re:Any hardware can be software. Doesn't mean it s by Immerman · · Score: 1

    Actually, not so much a problem for software patents. Software is, generally speaking, a general solution, an algorithm. That is to say, math - something explicitly exempted from patent protection because it would inherently be overbroad and cut off all further development in that direction. Hardware is a machine - a specific implementation. Make some slight modifications, and it's no longer protected by the original patent.

    If software patents followed the same rules as hardware patents they'd be far less of a problem, but that would pretty much require source code since design and implementation are practically synonymous in software. But in that case any non-trivial changes to the source code would then be recognized as a different invention not covered by the original patent. Basically copyright already offers broader protection than you would get from "legitimate" software patents adhering to the same limitations as hardware patents.

    --
    --- Most topics have many sides worth arguing, allow me to take one opposite you.
  18. Crysis? by SoftwareArtist · · Score: 1

    I'm pretty sure the developers of Crysis did put in the work to parallelize it effectively. Game engines are one of the most heavily optimized types of software out there, and CryEngine is one of the fastest game engines out there.

    --
    "I'm too busy to research this and form an educated opinion, but I do have time to tell everyone my uninformed opinion."
  19. Look up Verilog, SystemC by raymorris · · Score: 1

    To to your point, look up "SystemC". It's the C programming language, used to write programs which are often compiled as pure hardware. Often, but not always - the same code can be rendered as either pure hardware or pure software. See also Verilog and PLAs. PLAs start and end their life as pure hardware devices. In between, connections in the hardware are destroyed to create a new hardware array as specified by programming language code.

    What you're missing is that any algorithm, most any code, can be compiled either as an object file (what you'd call "software", as pure hardware (see PLAs and ASICs), or anything in between. I can write C code and I don't know which users will render it as hardware and which will render it as software. The distinction you're trying to make between hardware and software simply doesn't exist in practice.

    Fortunately it doesn't NEED to exist, because your need to distinguish the two is based on a misleading description of patent law. What I'm about to explain isn't what I WISH the law said, and it's probably not what you WISH it said, I'm going to explain what the law ACTUALLY says. Please don't bitch at me if you don't like it. I didn't write the law, I just read it.

    The law says "the laws of nature, including the LAWS of science and the LAWS OF MATHEMATICS, may not be patented." So you can't patent the laws of science, such as Newton's laws. You can't patent gravity. You CAN patent a new type of elevator, which USES gravity in a useful new way. You can't patent heat. You CAN patent a new type of oven, which uses thermodynamics in a useful new way. You can't patent the commutative law, a+b=b+a. You CAN patent a new way of predicting traffic flow which uses addition, multiplication, etc. That is not my preference, that is the law.

    1. Re:Look up Verilog, SystemC by Immerman · · Score: 1

      The thing is - the compiler could potentially generate a long list of different binaries or hardware configurations that all result in the same functionality within some performance envelope. As hardware, every one of those different assemblies would potentially require a separate patent as it does the same thing in a different manner, and hardware patents only protect specific implementations. As a software patent though, as they stand now, you don't even need to offer the source code that could generate all those implementations - and often not even a specific algorithm. As such they are *radically* broader than hardware patents, and generally prevent competitors from accomplishing the same thing in a different manner, with the resultant heavy chilling effect on progress.

      --
      --- Most topics have many sides worth arguing, allow me to take one opposite you.
  20. Mandatory comment by phrackthat · · Score: 1

    Could you imagine a Beowulf cluster of these?