Slashdot Mirror


Wintel, Universities Team On Parallel Programming

kamlapati writes in with a followup from the news last month that Microsoft and Intel are funding a laboratory for research into parallel computing at UC Berkeley. The new development is the imminent delivery of the FPGA-based Berkeley Emulation Engine version 3 (BEE3) that will allow researchers to emulate systems with up to 1,000 cores in order to explore approaches to parallel programming. A Microsoft researcher called BEE3 "a Swiss Army knife of computer research tools."

21 of 91 comments (clear)

  1. "stuck with a ...serial programming model" by BadAnalogyGuy · · Score: 5, Insightful

    It's a little disingenuous to claim that programmers are "stuck" with a serial programming model. The fact of the matter is that multi-threaded programming is a common paradigm which takes advantage of multiple cores just fine. Additionally, many algorithms cannot be parallelized.

    Even languages like Erlang which bring parallelization right to the front of the language are still stuck running serial operations serially. There is sometimes no way around doing something sequentially.

    Now, can we blow a few cycles on a few cores trying to predict which operations will get executed next? Yeah, sure, but that's not a programming problem, it's a hardware design problem.

    1. Re:"stuck with a ...serial programming model" by gdgib · · Score: 4, Insightful

      ParLab is so not about branch predictors and out-of-order execution. As you say, that's a hardware design problem and a solved one at that. Boring.

      While I'll agree that not all programmers are stuck with the serial programming model, threads aren't exactly a great solution (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.html). They're heavyweight and inefficient compared to running most algorithms on e.g. bare hardware or even an FPGA. Plus they deal badly with embarrasing parallelism (http://en.wikipedia.org/wiki/Embarrassingly_parallel). And finally they are HARD to use, the programmer must explicitly manage the parallelism by creating, synchronizing and destroying threads.

      Setting aside those problems which exhibit no parallelism (for whom there is no solution but a faster CPU really), there are many classes of problems which would benefit enormously from better programming models, which are more efficiently tied to the operating system and hardware rather than going through an OS level threading package.

    2. Re:"stuck with a ...serial programming model" by robizzle · · Score: 4, Insightful

      You are right, programmers aren't currently "stuck" with a serial programming model; however, looking into the future it is pretty clear that the hardware is developing faster than the programming models. Systems with dozens and dozens of cores aren't far off but we really don't have a good way to take advantage of all the cores.

      In the very near future, we could potentially have systems with hundreds of cores that sit idle all the time because none of the software takes advantage of much more than 5-10 cores. Of course, this would never actually happen, because once the hardware manufacturers notice this to be a problem, they will stop increasing the number of cores and try to make some other changes that would result in increased performance to the end user. There will always be a bottleneck -- either the software paradigms or the hardware and right now it looks like in the near future it will be the software.

      Yes, there are some algorithms that no matter what you do have to be executed sequentially. However, there is a huge truckload of algorithms that can be rewritten, with little added complexity, to take advantage of parallel computing. Furthermore, there is a slew of algorithms that could be rewritten with a slight loss in efficiency to be parallelized but with a net gain in performance. This third type of algorithm is what I think the most interesting is for researchers -- Even though parallelizing the algorithm may introduce redundant calculations or added work, the increased number of workers outweighs this.

      In other words, what is more efficient: 1 core that performs 20,000 instructions in 1 second or 5 cores that each perform 7,000 instructions, in parallel, in 0.35 seconds. Perhaps surprisingly to you, the single core is more efficient (20,000 instructions instead of 7,000*5 = 35,000 instructions) -- BUT, if we have the extra 4 cores sitting around doing nothing anyways, we may as well introduce inefficiency and finish the algorithm about 2.9 times faster.

    3. Re:"stuck with a ...serial programming model" by asills · · Score: 2, Interesting

      Threads are harder just like memory management in C++ is harder than Java and .NET.

      It's the people who really can't program that are having significant trouble with parallelization in modern applications. That's not to say that in the future I won't love to be able to express a solution and have it automatically parallelized, but for the time being creating applications that take advantage of multiple cores well (server apps, not client apps) is not that difficult if you know what you're doing.

      Though, like C++ with memory leaking, it is possible to shoot yourself in the foot with a deadlock occasionally.

      --
      -- What did Spock find in Kirk's toilet? The captain's log.
    4. Re:"stuck with a ...serial programming model" by 0xABADC0DA · · Score: 3, Interesting

      Setting aside those problems which exhibit no parallelism (for whom there is no solution but a faster CPU really), there are many classes of problems which would benefit enormously from better programming models, which are more efficiently tied to the operating system and hardware rather than going through an OS level threading package. The programming models we have are just fine. By far the vast majority of program time is spent in a loop of some kind, but languages which could easily parallelize loops don't. There is no reason why 'foreach' or 'collect' cannot use other processors (whereas 'inorder' or 'inject' would always be sequential). So our programming models are not the problem. The real problem is trying to use them with a 40 year old operating system design.

      Current operating system could run code in parallel if they stop scheduling threads a timeslice on a processor but instead schedule a timeslice across multiple processors. Take an array of 1000 strings and a regex to match them against. If the program is allocated 10 processors it can do a simple interrupt and have them start working on 100 strings each. By having the processors allocated can you avoid the overhead of switching memory spaces and of scheduling, making this kind of fine-grained parallelism feasible.

      But the problem here is that most programs will use one or two processors most of the time and all the available processors at other times. And if your parallel operation had to synchronize at some point then you'd have all your other allocated processors doing nothing while waiting for one to finish with its current work. So there is a huge amount of wasted time by allocating a thread to more than one processor.

      A solution to the unused processor problem is to have a single memory space, and so as a consequence only run typesafe code -- an operating system like JavaOS or Singularity or JXOS. This lets any processor be interrupted quickly to run any process's code in parallel, so CPU's can be dynamically assigned to different threads. Even small loops can be effectively run across many CPUs, and there is no waste from the heavyweight allocations and clunkiness that is caused ultimately by separate memory spaces needed to protect C-style programs from each other. This is why it is the operating system, not the programming models, that is the main problem.
    5. Re:"stuck with a ...serial programming model" by CustomDesigned · · Score: 2, Insightful
      My favorite massively parallel programming system is LINDA, and the Java distributed equivalent, Javaspaces. The idea is basically a job jar. For instance, a 3D ray tracer would put each output pixel in the job jar, and worker threads grab a pixel and trace it. (Naturally, the pixel coords can be generated algorithmically rather than actually stored). Even though the time to trace a pixel varies widely, all workers are kept at capacity. Watching it raytrace a scene in a fraction of a second is like watching a random fade - the pixels appear in essentially random order.

      I suspect that the job jar is the bottle neck for the LINDA approach, and further research is required. But the concept is really easy to work with.

  2. Re:1000 cores? by rthille · · Score: 4, Informative

    The point of the Berkeley program is to come up with toolsets so you don't have to "juggle 1000 cores in your head". Instead, you describe, using the toolset, the problem in a way which is decomposable, and the tools spread the work over the 1000+ cores. No more worrying if you incremented that semaphore correctly because you're operating at a much higher level.

    --
    Awesome furniture, accessories and cabinetry in Santa Rosa, CA: http://humanity-home.com/
  3. Linus Torvolds & Dave Patterson discuss it on by elwinc · · Score: 3, Interesting

    Actually, this is old news. There's a month old discussion thread on RWT Discussion forum. Berkeley proposes the "thirteen dwarfs" - 13 kinds of test algorithms they consider valuable to parallelize. Linus doesn't think the 13 dwarfs correspond well to everyday computing loads. My 2 cents: Intel & others are spending hundreds of millions of bucks per year trying to speed up single-thread style computing, so it's not a bad idea to put a few more million/year into thousand thread computing.

    --
    --- Often in error; never in doubt!
  4. Real Information by gdgib · · Score: 5, Informative
    The real websites are:
    ParLab (what's being funded): http://parlab.eecs.berkeley.edu/
    RAMP (the people who are building the architectural simulators for ParLab): http://ramp.eecs.berkeley.edu/
    BEE2 (the precursor to the not-quite-so-microsoft BEE3): http://bee2.eecs.berkeley.edu/

    The funding being announced here is for ParLab whose mission is to "solve the parallel programming problem". Basically they want to design new architectures, operating systems and languages. And before you get all "we tried that an it didn't work" there are some genuinely new ideas here and the wherewithall to make them work. ParLab grew out of the Berkeley View report (http://view.eecs.berkeley.edu/) which was the work of very large group of people to standardize on the same language and figure out what the problems in parallel computing were. This included everyone from architecture to applications (e.g. the music department).

    RAMP is a multi-university group working to build architectural simulators in FPGAs. In fact you can go download one such system right now called RAMP Blue (http://ramp.eecs.berkeley.edu/index.php?downloads). With ParLab starting up there will be another project RAMP Gold which will build a similar simulator but specifically designed for the architectures ParLab will be experimenting with.

    As a side note, keep in mind when you read articles like this that statements like the "Microsoft BEE3" are amusing when you take in to account that "B.E.E." standards for Berkeley Emulation Engine. Microsoft did a lot of the work and did a good job of it, but still...

  5. Re:cool by MightyYar · · Score: 4, Funny

    They are actually trying to model hell on earth. Microsoft provides the evil and Intel the heat.

    (Okay, the joke would have worked better in the P4 days.)

    --
    W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.
  6. Re:Why 1000 ? by gdgib · · Score: 4, Informative

    Actually RAMP Blue (the precursor to what ParLab used) had 1008. Good times. ahref=http://ramp.eecs.berkeley.edu/index.php?picturesrel=url2html-8126http://ramp.eecs.berkeley.edu/index.php?pictures>

    Basically 1000 is the goal, anything over that is a bonus. And yes, we like powers of 2 as much as you.

  7. Re:1000 cores? by SeekerDarksteel · · Score: 3, Insightful

    1) Quantum computing != parallel computing.

    2) A significant number of applications can and do run on 1000+ cores. Sure, most are scientific apps rather than consumer apps, but there is a market for it nevertheless. Go tell a high performance computing guy that there's no need for 1k cores on a single chip and watch him collapse laughing at you.

    --
    The laws of probability forbid it!
  8. Re:1000 cores? by Anonymous Coward · · Score: 4, Funny

    640 Cores should be enough for anybody.

  9. Cheap Bastards. by cyc · · Score: 4, Interesting

    Rick Merritt, who wrote the lead article also posted an opinion piece in EE Times lambasting Wintel for their lackluster funding efforts in parallel programming. I thoroughly agree with this guy. To quote:

    Wintel should not just tease multiple researchers with a $10 million grant awarded to one institution. They need to significantly up the ante and fund multiple efforts.

    Ten million is a drop in the bucket of the R&D budgets at Intel and Microsoft. You have to wonder about who is piloting the ship in Redmond these days when the company can afford a $44 billion bid for Yahoo to try to bolster its position in Web search but only spends $10 million to attack a needed breakthrough to save its core Windows business.
  10. Use your GPU by TheSync · · Score: 4, Interesting

    If you have a GeForce 8800 GT, you already have a 112 processor parallel computer that you can program using CUDA.

  11. Re:kinda silly by gdgib · · Score: 3, Insightful

    Interestingly enough, Dave Patterson http://www.eecs.berkeley.edu/Faculty/Homepages/patterson.html, once president of ACM http://membernet.acm.org/public/membernet/storypage_2.cfm?ci=June_2006&announcement=1&CFID=1668767&CFTOKEN=37941036 was once on a project to do that http://iram.cs.berkeley.edu/. Now he's working on ParLab http://parlab.eecs.berkeley.edu/. I don't always agree with him (and vice versa) but he's nobody's fool.

    Faith, young grasshopper...

    If you want a more technical reason DRAM and CPU's don't go together, spend an informative hour looking up the IC fab process for CMOS logic (CPUs) and DRAM. They're VERY VERY different. DRAM needs capacitory density to get the price-per-bit down so they use their own custom fabs optimized for that. This makes it really hard to fit lots of logic and DRAM on to one chip.

  12. Reconfigurable Computing / FPGA Acceleration by fpgaprogrammer · · Score: 3, Informative

    The BEE boards are being trumpeted as multicore experimentation environment, but the FPGA itself is a powerful computational engine in its own right. FPGAs have to overcome the inertia of their history as verification tools for ASIC designs if they want to grow into being algorithm executers in their own right.

    There's a growing community of FPGA programmers making accelerators for supercomputing applications. DRC (www.drccomputing.com) and XtremeData (www.xtremedatainc.com) both make co-processors for Opteron sockets with HyperTransport connections, and Cray uses these FPGA accelerators in their latest machines. There is even an active open standards body (www.openfpga.org).

    FPGAs and multicore BOTH suffer from the lack of a good programming model. Any good programming model for multicore chips will also be a good programming model for FPGA devices. The underlying similarity here is the need to place dataflow graphs into a lattice of cells (be they fine-grained cells like FPGA CLBs or coarse-grained cells like a multicore processor). I can make a convincing argument that spreadsheets will be both the programming model and killer-app for future parallel computers: think scales with cells.

    I've kept a blog on this stuff if you still care: fpgacomputing.blogspot.com

  13. Parallel Computing is not magic by technienerd · · Score: 4, Insightful

    I'm about to start a graduate degree in this area so I'm a little biased. However, I think a lot of problems can be solved in parallel. For example, maybe, LZW compression as it's implemented in the "zip" format might not be easily parallelizable but that doesn't prevent us from developing a compression algorithm with parallelism in mind. I did some undergraduate research in parallel search algorithms and I know for a fact that there are many, many ways you can parallelize search. Frankly, saying that you can't parallelize algorithms is a bit closed minded. Many problems don't inherently require serial solutions, it's just current algorithms handle them that way. Rather than trying to implement existing algorithms on a massively parallel processor, you want to re-tackle the problem under a new model, a model of an arbitrary number of processors. You build around the idea of data-parallelism rather than task-parallelism. Many, many things are possible under this model and I think it's naive to think otherwise. You don't need to think, how do I juggle 1000 threads around, you think, how do I take a problem, break it up into arbitrarily many chunks and distribute those chunks to an arbitrary number of processors and how do I do all that scheduling efficiently? This model doesn't work for interactive tasks mind you (where you're waiting for user input), but I'm very confident a model can be developed that can.

  14. Fine-Grain Parallelism Is the Future, Not Threads by MOBE2001 · · Score: 2, Insightful

    Instead, you describe, using the toolset, the problem in a way which is decomposable, and the tools spread the work over the 1000+ cores.

    One day soon, the computer industry will realize that, 150 years after Charles Babbage invented his idea of a general purpose sequential computer, it is time to move on and change to a new computing model. The industry will be dragged kicking and screaming into the 21st century. Threads were not originally intended to be the basis of a parallel software model but only a mechanism for running multiple sequential (not parallel) programs concurrently. The multithreading approach to parallel computing embraced by Intel, AMD and Microsoft is a disaster in the making because the future of computing is not multithreaded. See Nightmare on Core Street for more.

  15. Re:Almost by kramulous · · Score: 2, Funny

    for a old priest, a young priest and a couple of virgins Oohh, Oohh. I know this one! First you take the virgins and the old priest across, and then ... dammit! Young priest masturbates.

    First you take the ... dammit!

    I wish I had more than one boat. Taking all across concurrently would be easier.
    --
    .
  16. Re:Fine-Grain Parallelism Is the Future, Not Threa by MOBE2001 · · Score: 2, Insightful

    I don't think anyone except you said anything about threads. You may have just described exactly what the GP was describing -- point is, why should you have to break them down into individual programs yourself?

    This is precisely what is wrong with the current approach. The idea that the problem should be addressed from the top down has been tried for decades and has failed miserably. The idea that we should continue to write implicitly sequential programs and have some tool extract the parallelism by dividing the program into concurrent threads is completely ass-backwards, IMO. We should start with parallel elements and build implicitly parallel modules from those elements. Nothing needs to be broken down. Hardware engineers have been doing it for years and it works.

    Personally, I like Erlang, but the point is the same -- come up with a toolset and/or programming paradigm which makes scaling to thousands of cores easy and natural.

    Erlang is not the answer for many reasons, otherwise the entire computer world (especially the high performance parallel research community) would have jumped on it in earnest since it's been around for quite a while. One reason is that it uses a coarse-grain approach to parallelism; you can't even parallelize a quicksort routine (an ideal candidate for parallel processing) in Erlang. Consider also that Erlang is not deterministic and has no mechanism for automatic load balancing. The same goes for all the other functional programming language approaches to concurrency.

    The only problem I have yet to see addressed is how to properly test a threaded app, as it's non-deterministic.

    The solution is not to use threads at all. They are not needed for parallelism. The difficulty of programming with threads is the primary reason that the multicore industry is in a panic right now. Multithreaded programs are a nightmare to maintain, especially if you did not write the original code. Ask the folks at Intel, Microsoft and AMD. And it's not for the lack of trying. Billions of dollars have been spent on making multithreaded parallel programming easy in the last two decades. They're still at it. What's wrong with this picture?