Wintel, Universities Team On Parallel Programming
kamlapati writes in with a followup from the news last month that Microsoft and Intel are funding a laboratory for research into parallel computing at UC Berkeley. The new development is the imminent delivery of the FPGA-based Berkeley Emulation Engine version 3 (BEE3) that will allow researchers to emulate systems with up to 1,000 cores in order to explore approaches to parallel programming. A Microsoft researcher called BEE3 "a Swiss Army knife of computer research tools."
This is getting to be ridiculous. There's no way that anyone could juggle 1000 cores in their head and make a synchronous-threaded program. Put the money into quantum computing research and we'll have proper parallel computing.
how nice of microsoft to help train the next generation of google engineers.
It's a little disingenuous to claim that programmers are "stuck" with a serial programming model. The fact of the matter is that multi-threaded programming is a common paradigm which takes advantage of multiple cores just fine. Additionally, many algorithms cannot be parallelized.
Even languages like Erlang which bring parallelization right to the front of the language are still stuck running serial operations serially. There is sometimes no way around doing something sequentially.
Now, can we blow a few cycles on a few cores trying to predict which operations will get executed next? Yeah, sure, but that's not a programming problem, it's a hardware design problem.
Why not 1024, or 1000 cores will be enough ...
Actually, this is old news. There's a month old discussion thread on RWT Discussion forum. Berkeley proposes the "thirteen dwarfs" - 13 kinds of test algorithms they consider valuable to parallelize. Linus doesn't think the 13 dwarfs correspond well to everyday computing loads. My 2 cents: Intel & others are spending hundreds of millions of bucks per year trying to speed up single-thread style computing, so it's not a bad idea to put a few more million/year into thousand thread computing.
--- Often in error; never in doubt!
I remember working on the now system http://now.cs.berkeley.edu/ . Its a distributed system and they have parallel programming languages such as split-c or titanium (parallel java) and it support MPI. I guess a network of those BEE3s would be called a bee hive?
Imagine a beowulf clus... never mind.
ParLab (what's being funded): http://parlab.eecs.berkeley.edu/
RAMP (the people who are building the architectural simulators for ParLab): http://ramp.eecs.berkeley.edu/
BEE2 (the precursor to the not-quite-so-microsoft BEE3): http://bee2.eecs.berkeley.edu/
The funding being announced here is for ParLab whose mission is to "solve the parallel programming problem". Basically they want to design new architectures, operating systems and languages. And before you get all "we tried that an it didn't work" there are some genuinely new ideas here and the wherewithall to make them work. ParLab grew out of the Berkeley View report (http://view.eecs.berkeley.edu/) which was the work of very large group of people to standardize on the same language and figure out what the problems in parallel computing were. This included everyone from architecture to applications (e.g. the music department).
RAMP is a multi-university group working to build architectural simulators in FPGAs. In fact you can go download one such system right now called RAMP Blue (http://ramp.eecs.berkeley.edu/index.php?downloads). With ParLab starting up there will be another project RAMP Gold which will build a similar simulator but specifically designed for the architectures ParLab will be experimenting with.
As a side note, keep in mind when you read articles like this that statements like the "Microsoft BEE3" are amusing when you take in to account that "B.E.E." standards for Berkeley Emulation Engine. Microsoft did a lot of the work and did a good job of it, but still...
...does it run Linux?
Somehow, I doubt it.
Goodbye Slashdot. You've changed.
1000 core machines? Imagine a beowulf cluster of those!
Intel and other chip vendors are pushing the manycore vision as The True Path Forward. this is disingenuous, since it's merely the easy path forward for said chip vendors. everyone agrees "morecore" will be common in the future, but 1k cores? definitely not clear. is it even meaningful to call it shared-memory programming if you have 1k cores? it's not as if 1k cores can ever sanely share particular data, at least not if it's ever written. and what's the value of 1k cores all sharing the same RO data?
this is not to say that there's no good work to be done, especially in programming tools. but you can do this all today, with current hardware, even uniprocessor hardware. after all, it's _always_ most interesting to debug parallel programs on hardware platforms that do parallelism poorly, since that exaggerates your hotspots.
IMO, we'll be putting cpus into dram chips before we have widespread manycore chips.
Rick Merritt, who wrote the lead article also posted an opinion piece in EE Times lambasting Wintel for their lackluster funding efforts in parallel programming. I thoroughly agree with this guy. To quote:
Wintel should not just tease multiple researchers with a $10 million grant awarded to one institution. They need to significantly up the ante and fund multiple efforts. Ten million is a drop in the bucket of the R&D budgets at Intel and Microsoft. You have to wonder about who is piloting the ship in Redmond these days when the company can afford a $44 billion bid for Yahoo to try to bolster its position in Web search but only spends $10 million to attack a needed breakthrough to save its core Windows business.If you have a GeForce 8800 GT, you already have a 112 processor parallel computer that you can program using CUDA.
Microsoft has actually released a library which I would imagine is related to this work. PLINQ lets you very easily and declaratively multithread tasks. http://msdn2.microsoft.com/en-us/magazine/cc163329.aspx
The RAMP project (http://ramp.eecs.berkeley.edu/) tried that. We're actually looking for a old priest, a young priest and a couple of virgins. We haven't been able to get near the rack since we booted it up, and frankly the blood pouring from the faucets is starting to make a mess.
The BEE boards are being trumpeted as multicore experimentation environment, but the FPGA itself is a powerful computational engine in its own right. FPGAs have to overcome the inertia of their history as verification tools for ASIC designs if they want to grow into being algorithm executers in their own right.
There's a growing community of FPGA programmers making accelerators for supercomputing applications. DRC (www.drccomputing.com) and XtremeData (www.xtremedatainc.com) both make co-processors for Opteron sockets with HyperTransport connections, and Cray uses these FPGA accelerators in their latest machines. There is even an active open standards body (www.openfpga.org).
FPGAs and multicore BOTH suffer from the lack of a good programming model. Any good programming model for multicore chips will also be a good programming model for FPGA devices. The underlying similarity here is the need to place dataflow graphs into a lattice of cells (be they fine-grained cells like FPGA CLBs or coarse-grained cells like a multicore processor). I can make a convincing argument that spreadsheets will be both the programming model and killer-app for future parallel computers: think scales with cells.
I've kept a blog on this stuff if you still care: fpgacomputing.blogspot.com
I'm about to start a graduate degree in this area so I'm a little biased. However, I think a lot of problems can be solved in parallel. For example, maybe, LZW compression as it's implemented in the "zip" format might not be easily parallelizable but that doesn't prevent us from developing a compression algorithm with parallelism in mind. I did some undergraduate research in parallel search algorithms and I know for a fact that there are many, many ways you can parallelize search. Frankly, saying that you can't parallelize algorithms is a bit closed minded. Many problems don't inherently require serial solutions, it's just current algorithms handle them that way. Rather than trying to implement existing algorithms on a massively parallel processor, you want to re-tackle the problem under a new model, a model of an arbitrary number of processors. You build around the idea of data-parallelism rather than task-parallelism. Many, many things are possible under this model and I think it's naive to think otherwise. You don't need to think, how do I juggle 1000 threads around, you think, how do I take a problem, break it up into arbitrarily many chunks and distribute those chunks to an arbitrary number of processors and how do I do all that scheduling efficiently? This model doesn't work for interactive tasks mind you (where you're waiting for user input), but I'm very confident a model can be developed that can.
Instead, you describe, using the toolset, the problem in a way which is decomposable, and the tools spread the work over the 1000+ cores.
One day soon, the computer industry will realize that, 150 years after Charles Babbage invented his idea of a general purpose sequential computer, it is time to move on and change to a new computing model. The industry will be dragged kicking and screaming into the 21st century. Threads were not originally intended to be the basis of a parallel software model but only a mechanism for running multiple sequential (not parallel) programs concurrently. The multithreading approach to parallel computing embraced by Intel, AMD and Microsoft is a disaster in the making because the future of computing is not multithreaded. See Nightmare on Core Street for more.
Solaris 10 goes to at least 256 cores,Nevada much higher. Why not just run it rather than simulate?
Organization? You must be joking..
We should be "stuck with a serial programming model". If your program runs too slow on a single 1 GHz CPU, lack of multicore techniques is the last thing you should be concerned about. The first thing you ought to do is optimize your damn code! There are very few applications that are CPU-bound, and in those that are, only one or two inner loops need parallelizing. The overwhelming majority of slow code is slow because you wrote it badly. So fix the software before blaming the hardware!
First you take the
I wish I had more than one boat. Taking all across concurrently would be easier.
.
I don't think anyone except you said anything about threads. You may have just described exactly what the GP was describing -- point is, why should you have to break them down into individual programs yourself?
Personally, I like Erlang, but the point is the same -- come up with a toolset and/or programming paradigm which makes scaling to thousands of cores easy and natural.
The only problem I have yet to see addressed is how to properly test a threaded app, as it's non-deterministic.
Don't thank God, thank a doctor!
Oh sure, Code Pink protests the military's presence in Berkley, but leaves Microsoft free to enter the flagship university. What a bunch of commie pussies!
I don't think anyone except you said anything about threads. You may have just described exactly what the GP was describing -- point is, why should you have to break them down into individual programs yourself?
This is precisely what is wrong with the current approach. The idea that the problem should be addressed from the top down has been tried for decades and has failed miserably. The idea that we should continue to write implicitly sequential programs and have some tool extract the parallelism by dividing the program into concurrent threads is completely ass-backwards, IMO. We should start with parallel elements and build implicitly parallel modules from those elements. Nothing needs to be broken down. Hardware engineers have been doing it for years and it works.
Personally, I like Erlang, but the point is the same -- come up with a toolset and/or programming paradigm which makes scaling to thousands of cores easy and natural.
Erlang is not the answer for many reasons, otherwise the entire computer world (especially the high performance parallel research community) would have jumped on it in earnest since it's been around for quite a while. One reason is that it uses a coarse-grain approach to parallelism; you can't even parallelize a quicksort routine (an ideal candidate for parallel processing) in Erlang. Consider also that Erlang is not deterministic and has no mechanism for automatic load balancing. The same goes for all the other functional programming language approaches to concurrency.
The only problem I have yet to see addressed is how to properly test a threaded app, as it's non-deterministic.
The solution is not to use threads at all. They are not needed for parallelism. The difficulty of programming with threads is the primary reason that the multicore industry is in a panic right now. Multithreaded programs are a nightmare to maintain, especially if you did not write the original code. Ask the folks at Intel, Microsoft and AMD. And it's not for the lack of trying. Billions of dollars have been spent on making multithreaded parallel programming easy in the last two decades. They're still at it. What's wrong with this picture?
I really think that Intel needs to skip doing quad-core and whatever processors, and jump directly to doing a kilocore processor. Such a processor would have 1024 cores. It would be the pride of any self-respecting geek to own such a computer. Then they could improve on it by gradually going to two kilocores, four kilocores, etc. In a number of years, when the average computer processor has 250 gigacores, we'll laugh and poke fun at the good ol' days when 640 kilocores were enough for anyone.
No more worrying if you incremented that semaphore correctly because you're operating at a much higher level.
You only need to "worry" about that if you insist on programming your multi-core machine in low-level C. Better solutions have existed for decades, people just don't use them. How is the BEE3 going to change that?
The fact of the matter is that multi-threaded programming is a common paradigm which takes advantage of multiple cores just fine.
Multi-threaded programming is cumbersome. There have been better was of doing parallel programming for a long time.
Additionally, many algorithms cannot be parallelized.
Whether algorithms can be parallelized doesn't matter. What matters is whether there are parallel algorithms that solve problems faster than serial algorithms, and in most cases there are.
Even languages like Erlang which bring parallelization right to the front of the language are still stuck running serial operations serially.
They aren't "stuck" doing that, they do that because programmers find it convenient, not because they have to. There are many languages that don't even have a defined order of execution.
it's a hardware design problem.
Actually, it's a programmer education problem: most programmers have no idea what kinds of tools they have available for parallel programming, they have no idea how to use them, and they don't even understand what parallel programming paradigms exist. Like you, for example.
Maybe so, but it's certainly not what I was suggesting.
Rather, I'm suggesting that we should have tools which make it easy to write a parallel model, even if individual tasks are sequential -- after all, they are ultimately executed in sequence on each core.
Can't? Or isn't easy to?
I suspect such a mechanism would be easier to build in Erlang than in most other modern languages. And parallel programming is inherently non-deterministic.
It might not be fast, though, as what immediately came to mind is a bunch of worker threads and one master thread -- workers notify the master when they're ready for more tasks.
I am referring to threads as the OS concept, not as a programming concept. That is: I am not talking about the Erlang processes, I'm talking about the real OS threads it uses (generally one per core, or just one). And I'm referring to threads as a generalization of OS-level processes.
Are you suggesting that a different CPU and/or OS architecture could be built which would make it possible to write deterministic, threaded programs? Or are you talking about an entirely language-level approach?
Or are you suggesting that we try to keep cranking up the clock?
Don't thank God, thank a doctor!