Wintel, Universities Team On Parallel Programming

1000 cores? by Brian+Gordon · 2008-03-14 05:17 · Score: 1

This is getting to be ridiculous. There's no way that anyone could juggle 1000 cores in their head and make a synchronous-threaded program. Put the money into quantum computing research and we'll have proper parallel computing.

Re:1000 cores? by rthille · 2008-03-14 05:23 · Score: 4, Informative

The point of the Berkeley program is to come up with toolsets so you don't have to "juggle 1000 cores in your head". Instead, you describe, using the toolset, the problem in a way which is decomposable, and the tools spread the work over the 1000+ cores. No more worrying if you incremented that semaphore correctly because you're operating at a much higher level.

--
Awesome furniture, accessories and cabinetry in Santa Rosa, CA: http://humanity-home.com/
Re:1000 cores? by SeekerDarksteel · 2008-03-14 05:43 · Score: 3, Insightful

1) Quantum computing != parallel computing.

2) A significant number of applications can and do run on 1000+ cores. Sure, most are scientific apps rather than consumer apps, but there is a market for it nevertheless. Go tell a high performance computing guy that there's no need for 1k cores on a single chip and watch him collapse laughing at you.

--
The laws of probability forbid it!
Re:1000 cores? by Ngarrang · 2008-03-14 05:56 · Score: 1

Just think what SETI@Home could do with a 1000 core processor. Or for the more practical and useful to our real world, Folding@Home.

--
Bearded Dragon
Re:1000 cores? by Anonymous Coward · 2008-03-14 06:17 · Score: 4, Funny

640 Cores should be enough for anybody.
Re:1000 cores? by BadHaggis · 2008-03-14 06:20 · Score: 0, Troll

Finally! A vista capable system.

--
Homo homini lupus
Re:1000 cores? by mikael · 2008-03-14 06:37 · Score: 1

And people said the same with 128+ variable CPU stack frames combined with RISC instruction sets. Nobody is going to be able to do that kind of juggling in their head, so "register scoreboarding" was built into the compilers.

You could try and have a process running on each core, but even on a university server, you will only have a few hundred processes running, so giving every user a single core is still going to underutilize 80% of those cores. And even then many of those processes are hourly cron jobs or daemons running idle most of the time.

The only alternative is to try and break up each module or subroutine into separate tasks that can be run on each core. The most data consuming tasks are signal processing and visualization (1D audio, 2D video, 3D tomography). Breaking up a fast-fourier transform into chunks so that each core can do a single row or column is one way to achieve this. And the same can be done for 3D rendering.

--
Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
Re:1000 cores? by gdgib · 2008-03-14 07:02 · Score: 1

Actually, if you check out the BEE2 website (http://bee2.eecs.berkeley.edu/, BEE2 being the precursor to BEE3) you'll notice a Casper (http://casper.berkeley.edu/) logo in the upper left. That is the SETI folks!

Except that instead of running SETI@home, they used heavily FPGA optimized designs. Since most radio astronomy only requires a few bits of precision (2-8) modern CPUs or GPUs are incredibly wasteful for them. So intead they use heavily optimized fixed-point math circuitry. By using FPGAs they can keep them easily reconfigurable and get enough throughput to deal with e.g. Hat Creek (http://en.wikipedia.org/wiki/Hat_Creek_Radio_Observatory).
Re:1000 cores? by serviscope_minor · 2008-03-14 07:24 · Score: 1

Because all of those supercomputers with 1000 CPUs are never used by anyone for anything because noone can figure them out. If you use the same techniques on a 1000 core processor, it will work just the same. I expect that a 1000 core processor will be very NUMA, so you'll probably just resort to MPI anyway.

OK, so I know there's no "wrong" mod, but don't mod it insightful.

--
SJW n. One who posts facts.
Re:1000 cores? by zkiwi34 · 2008-03-14 07:44 · Score: 0

Hmmm, recursive process abuse springs to mind as a fun thing to do to a 1000+ core machine...
Re:1000 cores? by Fry-kun · 2008-03-14 08:04 · Score: 1

Actually, functional programming is the answer. Unlike procedural code, you don't need to think of what you want the CPUs to do, but of the result you want them to achieve - in other words, "threads" are not necessary for your code to utilize multiple processors.

--
Did you know that "FTW" ("for the win") is a direct translation of "Sieg Heil"?
Re:1000 cores? by rrohbeck · 2008-03-14 09:44 · Score: 1

640 Cores should be enough for anybody. No, 640K cores! We're still a ways off.

I keep wondering when we're going to put processing closer to the memory again. As in, put a couple of SPUs right on the memory chips. At least an FPGA with a couple 1,000 gates, that would be very general purpose.

--
thegodmovie.com - watch it
Re:1000 cores? by Timothy+Brownawell · 2008-03-14 11:44 · Score: 1

Go tell a high performance computing guy that there's no need for 1k cores on a single chip and watch him collapse laughing at you. It seems pretty ridiculous, unless you can also cram in enough pins for 1k memory channels. Even current 4-core chips are enough to make memory the bottleneck on some workloads, and AIUI scientific computing often tends to be that sort of workload.
Re:1000 cores? by jonaskoelker · 2008-03-14 11:51 · Score: 1

2) A significant number of applications can and do run on 1000+ cores. If you're on gentoo, they'll all be gcc ;)

(just teasing)
Re:1000 cores? by ultranova · 2008-03-16 04:31 · Score: 1

This is getting to be ridiculous. There's no way that anyone could juggle 1000 cores in their head and make a synchronous-threaded program.

Why would you need to ? Either your program is multithreaded or it isn't; and if it is, it either is or isn't properly synchronized. The number of cores is completely irrelevant; a broken multithreaded program will fail randomly in a single-core machine too, and a singlethreaded program in a 1000-core machine won't run into any issues either.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:1000 cores? by Raenex · 2008-03-16 09:00 · Score: 1

I keep wondering when we're going to put processing closer to the memory again. Isn't that essentially what the L1 and L2 caches do?
Re:1000 cores? by jimicus · 2008-03-17 00:33 · Score: 1

Yes, but they're very expensive and they only put the most recently needed instructions/data close to the core.

Granted, in 90% of day to day uses that's all you need. But the other 10% would probably love to see RAM running synchronously with the CPU.
Re:1000 cores? by Anonymous Coward · 2008-03-17 02:43 · Score: 0

You could try and have a process
to try and break up

"try to".

cool by fred+fleenblat · 2008-03-14 05:20 · Score: 1

how nice of microsoft to help train the next generation of google engineers.

Re:cool by MightyYar · 2008-03-14 05:38 · Score: 4, Funny

They are actually trying to model hell on earth. Microsoft provides the evil and Intel the heat.

(Okay, the joke would have worked better in the P4 days.)

--
W..w..W - Willy Waterloo washes Warren Wiggins who is washing Waldo Woo.

"stuck with a ...serial programming model" by BadAnalogyGuy · 2008-03-14 05:21 · Score: 5, Insightful

It's a little disingenuous to claim that programmers are "stuck" with a serial programming model. The fact of the matter is that multi-threaded programming is a common paradigm which takes advantage of multiple cores just fine. Additionally, many algorithms cannot be parallelized.

Even languages like Erlang which bring parallelization right to the front of the language are still stuck running serial operations serially. There is sometimes no way around doing something sequentially.

Now, can we blow a few cycles on a few cores trying to predict which operations will get executed next? Yeah, sure, but that's not a programming problem, it's a hardware design problem.

Re:"stuck with a ...serial programming model" by gdgib · 2008-03-14 05:50 · Score: 4, Insightful

ParLab is so not about branch predictors and out-of-order execution. As you say, that's a hardware design problem and a solved one at that. Boring.

While I'll agree that not all programmers are stuck with the serial programming model, threads aren't exactly a great solution (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.html). They're heavyweight and inefficient compared to running most algorithms on e.g. bare hardware or even an FPGA. Plus they deal badly with embarrasing parallelism (http://en.wikipedia.org/wiki/Embarrassingly_parallel). And finally they are HARD to use, the programmer must explicitly manage the parallelism by creating, synchronizing and destroying threads.

Setting aside those problems which exhibit no parallelism (for whom there is no solution but a faster CPU really), there are many classes of problems which would benefit enormously from better programming models, which are more efficiently tied to the operating system and hardware rather than going through an OS level threading package.
Re:"stuck with a ...serial programming model" by $RANDOMLUSER · 2008-03-14 05:51 · Score: 1

There is sometimes no way around doing something sequentially.
Like I always say (sometimes): "A process blocked on I/O is blocked, no matter how many processors you throw at it."

--
No folly is more costly than the folly of intolerant idealism. - Winston Churchill
Re:"stuck with a ...serial programming model" by Anonymous Coward · 2008-03-14 06:00 · Score: 0

With that many processors, couldn't you try many combinations of input, then choose the right one at a later date once you know the precursors? I believe this is called speculative execution and it sounds like a good way to do serial operations on a subset of problems that have a finite set of inputs.
Re:"stuck with a ...serial programming model" by Mongoose+Disciple · 2008-03-14 06:07 · Score: 1

There is sometimes no way around doing something sequentially.

Yup. And as Amdahl's Law (paraphrased) puts it: the amount of speed increase you can achieve with parallelization is always constrained by the parts of the process that can't be parallelized.
Re:"stuck with a ...serial programming model" by robizzle · 2008-03-14 06:13 · Score: 4, Insightful

You are right, programmers aren't currently "stuck" with a serial programming model; however, looking into the future it is pretty clear that the hardware is developing faster than the programming models. Systems with dozens and dozens of cores aren't far off but we really don't have a good way to take advantage of all the cores.

In the very near future, we could potentially have systems with hundreds of cores that sit idle all the time because none of the software takes advantage of much more than 5-10 cores. Of course, this would never actually happen, because once the hardware manufacturers notice this to be a problem, they will stop increasing the number of cores and try to make some other changes that would result in increased performance to the end user. There will always be a bottleneck -- either the software paradigms or the hardware and right now it looks like in the near future it will be the software.

Yes, there are some algorithms that no matter what you do have to be executed sequentially. However, there is a huge truckload of algorithms that can be rewritten, with little added complexity, to take advantage of parallel computing. Furthermore, there is a slew of algorithms that could be rewritten with a slight loss in efficiency to be parallelized but with a net gain in performance. This third type of algorithm is what I think the most interesting is for researchers -- Even though parallelizing the algorithm may introduce redundant calculations or added work, the increased number of workers outweighs this.

In other words, what is more efficient: 1 core that performs 20,000 instructions in 1 second or 5 cores that each perform 7,000 instructions, in parallel, in 0.35 seconds. Perhaps surprisingly to you, the single core is more efficient (20,000 instructions instead of 7,000*5 = 35,000 instructions) -- BUT, if we have the extra 4 cores sitting around doing nothing anyways, we may as well introduce inefficiency and finish the algorithm about 2.9 times faster.
Re:"stuck with a ...serial programming model" by asills · 2008-03-14 06:28 · Score: 2, Interesting

Threads are harder just like memory management in C++ is harder than Java and .NET.

It's the people who really can't program that are having significant trouble with parallelization in modern applications. That's not to say that in the future I won't love to be able to express a solution and have it automatically parallelized, but for the time being creating applications that take advantage of multiple cores well (server apps, not client apps) is not that difficult if you know what you're doing.

Though, like C++ with memory leaking, it is possible to shoot yourself in the foot with a deadlock occasionally.

--
-- What did Spock find in Kirk's toilet? The captain's log.
Re:"stuck with a ...serial programming model" by gdgib · 2008-03-14 06:35 · Score: 1

I find it amusing that the original post was by "BadAnalogyGuy" and you just used one.

Why do Java and .NET have better memory management? So that it's easier for people to work with memory. Why does ParLab exist? So that it will be easier to work with parallelism.

Part of the goal here is to make it so that, like with memory management, someone who knows what their doing (i.e. a hardcore manage-their-own-memory assembly or C++ programmer) can write a large parallel library, and someone who doesn't (i.e. a newbie-MCSE-.NET-programmer) can use it to solve a real problem.

Methings your argument supported the opposite conclusion from the one you drew.
Re:"stuck with a ...serial programming model" by gdgib · 2008-03-14 06:43 · Score: 1

Bugger. I read the GP again, and he might've been making exactly the same point I just did. Oops.
Re:"stuck with a ...serial programming model" by 0xABADC0DA · 2008-03-14 06:55 · Score: 3, Interesting

Setting aside those problems which exhibit no parallelism (for whom there is no solution but a faster CPU really), there are many classes of problems which would benefit enormously from better programming models, which are more efficiently tied to the operating system and hardware rather than going through an OS level threading package. The programming models we have are just fine. By far the vast majority of program time is spent in a loop of some kind, but languages which could easily parallelize loops don't. There is no reason why 'foreach' or 'collect' cannot use other processors (whereas 'inorder' or 'inject' would always be sequential). So our programming models are not the problem. The real problem is trying to use them with a 40 year old operating system design.

Current operating system could run code in parallel if they stop scheduling threads a timeslice on a processor but instead schedule a timeslice across multiple processors. Take an array of 1000 strings and a regex to match them against. If the program is allocated 10 processors it can do a simple interrupt and have them start working on 100 strings each. By having the processors allocated can you avoid the overhead of switching memory spaces and of scheduling, making this kind of fine-grained parallelism feasible.

But the problem here is that most programs will use one or two processors most of the time and all the available processors at other times. And if your parallel operation had to synchronize at some point then you'd have all your other allocated processors doing nothing while waiting for one to finish with its current work. So there is a huge amount of wasted time by allocating a thread to more than one processor.

A solution to the unused processor problem is to have a single memory space, and so as a consequence only run typesafe code -- an operating system like JavaOS or Singularity or JXOS. This lets any processor be interrupted quickly to run any process's code in parallel, so CPU's can be dynamically assigned to different threads. Even small loops can be effectively run across many CPUs, and there is no waste from the heavyweight allocations and clunkiness that is caused ultimately by separate memory spaces needed to protect C-style programs from each other. This is why it is the operating system, not the programming models, that is the main problem.
Re:"stuck with a ...serial programming model" by CustomDesigned · 2008-03-14 07:15 · Score: 2, Insightful

My favorite massively parallel programming system is LINDA, and the Java distributed equivalent, Javaspaces. The idea is basically a job jar. For instance, a 3D ray tracer would put each output pixel in the job jar, and worker threads grab a pixel and trace it. (Naturally, the pixel coords can be generated algorithmically rather than actually stored). Even though the time to trace a pixel varies widely, all workers are kept at capacity. Watching it raytrace a scene in a fraction of a second is like watching a random fade - the pixels appear in essentially random order.
I suspect that the job jar is the bottle neck for the LINDA approach, and further research is required. But the concept is really easy to work with.
Re:"stuck with a ...serial programming model" by 0xABADC0DA · 2008-03-14 10:13 · Score: 1

Until you can take

(1..100).foreach { |e| e.function() }

And turn it into

(1..50).foreach { |e| e.function() }
(50..100).foreach { |e| e.function() }

and these get run on two or more processors faster than it runs on one you'll never get much use out of the extra ones. Our current operating systems can't take a small loop like say 100 iterations and divide it up across processors and have it run faster than just doing it on one. That's the problem. Just a rough guess but I bet .function would have to take well over 10k cycles to make it worth attempting on a modern operating system like Linux or Mac Os X.

The most benefit from parallel execution is at the really fine-grained level. We already have threads and processes. Threads are really great at doing multiple actions at once... but really bad at speeding up any one particular action. What's needed is some kind of operating system support for running small loops on multiple processors.
Re:"stuck with a ...serial programming model" by SanityInAnarchy · 2008-03-14 13:04 · Score: 1

I know I built a scheme exactly like this in Perl, once. And Erlang is built around message-passing, which could be used to implement something like this.

And I don't think it needs to be a bottleneck unless it needs to be sequential. All you do is, have multiple LINDAs (or queues, or jars, or whatever), and have multiple sources to each.

For example: Suppose you wanted that raytracer to do some simple anti-aliasing which took into account the surrounding pixels. The antialiasing "jar" could be fed by all of the initial threads. Or, in Erlang terminology, you'd fire up a new process for each antialiased pixel, and after rendering each unantialiased pixel, it sends a message with the result to each antialiased pixel which is interested. As soon as you get one area sufficiently rendered, the antialiasing can start.

Maybe you're right, though -- I did notice that a benchmark designed to test exclusively Erlang's message-passing features ended up running slower when I told Erlang to run on both cores.

--
Don't thank God, thank a doctor!
Re:"stuck with a ...serial programming model" by bit01 · 2008-03-14 15:10 · Score: 1

Additionally, many algorithms cannot be parallelized.

Conventional wisdom but it's just not true. Maybe you meant to say automatically parallelizable however that's much the same as saying that it is necessary to

I work in parallel programming and I have never seen a real world problem/algorithm that was not parallelizable. Maybe there's a few obscure ones out there but I've never seen them. Anybody want to suggest even one?

In any case almost all PC's these days are already highly parallel; display cards, monitors, keyboards, USB devices, disks, printers, network cards, you name it, all have their own CPU's and memory.

---

You're a fool if you think advertising pays for anything at all.
Re:"stuck with a ...serial programming model" by BadAnalogyGuy · 2008-03-14 15:28 · Score: 1

Maybe there's a few obscure ones out there but I've never seen them. Anybody want to suggest even one?

Reading a sector of data from a hard disk.

But that's pretty obscure, I'll grant you that.
Re:"stuck with a ...serial programming model" by bit01 · 2008-03-14 17:08 · Score: 1

Reading a sector of data from a hard disk.

But that's embarrassingly parallelizable, it's why hard disks have multiple heads and multiple platters to read/write in parallel and up the throughput. RAID's make it even more parallel. Within limits and for normal volumes of hard disk data, which are much larger than a sector size, (small data is held in memory+caches) this will increase throughput proportional to the number of heads.

In any case that's a hardware limitation, nothing to do with an algorithm as such.

I do agree that if you drill down far enough you get to atomic operations that can't be parallelized however I'm not aware of any atomic operation in real world programming that takes any more than a tiny fraction of second and thus cause real world impact. Even machine instructions are done in parallel these days.

(Correction to previous post: ... Maybe you meant to say automatically parallelizable however that's much the same as saying that you hobble yourself by using a non-parallel programming language; the programming language must be appropriate for the target, whether serial or parallel. We can't program in a natural language yet...)

---

Has your software been deliberately crippled?
Re:"stuck with a ...serial programming model" by BadAnalogyGuy · 2008-03-14 17:28 · Score: 1

In any case that's a hardware limitation, nothing to do with an algorithm as such.

Then you admit that there are operations upon which software must wait. There is no way for a program that relies on the read data to act upon it until it arrives in the CPU registers. How are you going to serialize that?

I'm not aware of any atomic operation in real world programming that takes any more than a tiny fraction of second and thus cause real world impact

These things add up. You would be the first to admit, I'm sure, that an inefficient algorithm (bubble sort) takes longer to run on average than a more efficient algorithm (say, heapsort). But each individual instruction requires only the tiniest fraction of a second to run. The thing is, that these instruction execution times add up, just as waiting for HW interrupts does, just as every single thing that a computer is doing takes time.

Algorithm efficiency is important because it has real world impact. Contrary to what you are saying above.

The goal, obviously, is to get as many independent operations running at once as possible. And as great as that goal is, the limit is the extent to which we can design programs to have independent operations. The raytracing example given in a separate thread is an example of a successfully parallelized program. Blitting a framebuffer is something that is not so easily (nor benefitted by) parallelization.
Re:"stuck with a ...serial programming model" by octogen · 2008-03-15 00:34 · Score: 1

If an operating system knew, which parts of a program can be executed in parallel, then it already COULD run ONE thread on MULTIPLE processors. But even if it knew, when you have multiple tasks or threads running in parallel, then you want every one of them to stay on the same CPU as long as possible, because scheduling every thread on every processor pollutes the caches and slows down the entire system.
Re:"stuck with a ...serial programming model" by Anonymous Coward · 2008-03-15 04:17 · Score: 0

If an operating system knew, which parts of a program can be executed in parallel, then it already COULD run ONE thread on MULTIPLE processors. But even if it knew, when you have multiple tasks or threads running in parallel, then you want every one of them to stay on the same CPU as long as possible, because scheduling every thread on every processor pollutes the caches and slows down the entire system. I don't think you understood what I was trying to say. What I mean is, you take ONE thread running a for 0..100 loop and run that ONE thread SIMULTANEOUSLY on say FOUR processors, each doing one of 0..25, 25..50, 50..75, 75..100. Modern operating systems cannot do this because of having multiple memory spaces, so they either have to semi-permanently allocate several processors to the ONE thread and waste them, or suffer massive reallocation costs that make it not efficient to run a single thread simultaneously on multiple processors.

The reason to keep a thread on the same CPU 'as long as possible' is because you can potentially avoid having to change the memory map and flush. The caches you speak of are already polluted by switching processes in any normal system, so these are of no additional consequence.
Re:"stuck with a ...serial programming model" by asills · 2008-03-15 09:41 · Score: 1

I won't hold it against you ;)

Yes, we're making the same point, though I'm also pointing out that the doom and gloom that is always presented wrt parallelism in current programming languages isn't so. It's only so for those that don't know what they're doing.

--
-- What did Spock find in Kirk's toilet? The captain's log.
Re:"stuck with a ...serial programming model" by jgrahn · 2008-03-15 21:05 · Score: 1

Threads are harder just like memory management in C++ is harder than [in] Java and .NET.

I find memory management is trivial in most real-life C++ code. And management of non-memory resources is easier than in Java.
Re:"stuck with a ...serial programming model" by ultranova · 2008-03-16 04:55 · Score: 1

Like I always say (sometimes): "A process blocked on I/O is blocked, no matter how many processors you throw at it."

But if it is blocked because the target of the I/O operation, say, a local database server, is busy calculating the data to be returned, throwing more processors at it might indeed cause it to unblock sooner.

--
Forget magic. Any technology distinguishable from divine power is insufficiently advanced.
Re:"stuck with a ...serial programming model" by brysgo · 2008-03-16 13:55 · Score: 1

There is no reason why 'foreach' or 'collect' cannot use other processors While it sounds like a good idea, 'foreach' loops often collect the value for each into a single instance variable to get a sum, or similar compilation of the contents of the array. If the cycles were to run at the same time, the second iteration of the loop would not have the data from the first to append to.

I do believe that parallel processing could be used to improve the speed of 3d rendering and particle simulations though, and that is reason enough to be optimistic about it.
Re:"stuck with a ...serial programming model" by Anonymous Coward · 2008-03-17 04:27 · Score: 0

someone who knows what their doing

"they're".

Methings your argument

"Methinks".
Re:"stuck with a ...serial programming model" by Unoti · 2008-03-17 10:49 · Score: 1

Isn't that generally true for anything you can possibly think of with programming languages, not just parallelism? Once a language has memory, loops, conditionals, and iterative execute-- couldn't we say that every single other language feature is just for "those who don't know what they're doing"? If the language makes things easier then it makes it easier for everyone, both the elite and the unwashed masses.

Why 1000 ? by anandpur · 2008-03-14 05:22 · Score: 1

Why not 1024, or 1000 cores will be enough ...

Re:Why 1000 ? by gdgib · 2008-03-14 05:38 · Score: 4, Informative

Actually RAMP Blue (the precursor to what ParLab used) had 1008. Good times. ahref=http://ramp.eecs.berkeley.edu/index.php?picturesrel=url2html-8126http://ramp.eecs.berkeley.edu/index.php?pictures>

Basically 1000 is the goal, anything over that is a bonus. And yes, we like powers of 2 as much as you.

Linus Torvolds & Dave Patterson discuss it on by elwinc · 2008-03-14 05:26 · Score: 3, Interesting

Actually, this is old news. There's a month old discussion thread on RWT Discussion forum. Berkeley proposes the "thirteen dwarfs" - 13 kinds of test algorithms they consider valuable to parallelize. Linus doesn't think the 13 dwarfs correspond well to everyday computing loads. My 2 cents: Intel & others are spending hundreds of millions of bucks per year trying to speed up single-thread style computing, so it's not a bad idea to put a few more million/year into thousand thread computing.

--
--- Often in error; never in doubt!

NOW - Network Of Workstations by c0d3r · 2008-03-14 05:27 · Score: 1

I remember working on the now system http://now.cs.berkeley.edu/ . Its a distributed system and they have parallel programming languages such as split-c or titanium (parallel java) and it support MPI. I guess a network of those BEE3s would be called a bee hive?

Re:NOW - Network Of Workstations by gdgib · 2008-03-14 05:47 · Score: 1

No, but that's the name of one of our computers that we use with the BEE2s ahref=http://bee2.eecs.berkeley.edu/rel=url2html-27089http://bee2.eecs.berkeley.edu/>.

We have "beehappy", "newbee", "beehive" and even "sting". The tradition of using bad puns to name computers lives on!

Obligatory... by Anonymous Coward · 2008-03-14 05:28 · Score: 0

Imagine a beowulf clus... never mind.

Real Information by gdgib · 2008-03-14 05:32 · Score: 5, Informative

The real websites are:
ParLab (what's being funded): http://parlab.eecs.berkeley.edu/
RAMP (the people who are building the architectural simulators for ParLab): http://ramp.eecs.berkeley.edu/
BEE2 (the precursor to the not-quite-so-microsoft BEE3): http://bee2.eecs.berkeley.edu/

The funding being announced here is for ParLab whose mission is to "solve the parallel programming problem". Basically they want to design new architectures, operating systems and languages. And before you get all "we tried that an it didn't work" there are some genuinely new ideas here and the wherewithall to make them work. ParLab grew out of the Berkeley View report (http://view.eecs.berkeley.edu/) which was the work of very large group of people to standardize on the same language and figure out what the problems in parallel computing were. This included everyone from architecture to applications (e.g. the music department).

RAMP is a multi-university group working to build architectural simulators in FPGAs. In fact you can go download one such system right now called RAMP Blue (http://ramp.eecs.berkeley.edu/index.php?downloads). With ParLab starting up there will be another project RAMP Gold which will build a similar simulator but specifically designed for the architectures ParLab will be experimenting with.

As a side note, keep in mind when you read articles like this that statements like the "Microsoft BEE3" are amusing when you take in to account that "B.E.E." standards for Berkeley Emulation Engine. Microsoft did a lot of the work and did a good job of it, but still...

Yes but... by jfbilodeau · 2008-03-14 05:42 · Score: 0

...does it run Linux?

Somehow, I doubt it.

--
Goodbye Slashdot. You've changed.

Re:Yes but... by gdgib · 2008-03-14 05:53 · Score: 1

The BEE2 does: http://bee2.eecs.berkeley.edu/wiki/Bee2LinuxKernel.html. The BEE3 wont. But we'll put a linux computer right next to it, just for you. I promise.
Re:Yes but... by Anonymous Coward · 2008-03-14 12:31 · Score: 0

The BEE3 wont.

Is that a direct result of Microsoft involvement or is there a sound technical reason? Frankly, anything microsoft touches in parallel programming turns to shit, so watch out.

Beowulf Cluster... by andrewd18 · 2008-03-14 05:50 · Score: 1

1000 core machines? Imagine a beowulf cluster of those!

Re:Beowulf Cluster... by Bugs42 · 2008-03-14 06:37 · Score: 1

And yet, it still won't be able to run Crysis.

--
Programmer: an ingenious device that converts caffeine into code.

kinda silly by markhahn · 2008-03-14 06:12 · Score: 1

Intel and other chip vendors are pushing the manycore vision as The True Path Forward. this is disingenuous, since it's merely the easy path forward for said chip vendors. everyone agrees "morecore" will be common in the future, but 1k cores? definitely not clear. is it even meaningful to call it shared-memory programming if you have 1k cores? it's not as if 1k cores can ever sanely share particular data, at least not if it's ever written. and what's the value of 1k cores all sharing the same RO data?

this is not to say that there's no good work to be done, especially in programming tools. but you can do this all today, with current hardware, even uniprocessor hardware. after all, it's _always_ most interesting to debug parallel programs on hardware platforms that do parallelism poorly, since that exaggerates your hotspots.

IMO, we'll be putting cpus into dram chips before we have widespread manycore chips.

Re:kinda silly by gdgib · 2008-03-14 06:21 · Score: 3, Insightful

Interestingly enough, Dave Patterson http://www.eecs.berkeley.edu/Faculty/Homepages/patterson.html, once president of ACM http://membernet.acm.org/public/membernet/storypage_2.cfm?ci=June_2006&announcement=1&CFID=1668767&CFTOKEN=37941036 was once on a project to do that http://iram.cs.berkeley.edu/. Now he's working on ParLab http://parlab.eecs.berkeley.edu/. I don't always agree with him (and vice versa) but he's nobody's fool.

Faith, young grasshopper...

If you want a more technical reason DRAM and CPU's don't go together, spend an informative hour looking up the IC fab process for CMOS logic (CPUs) and DRAM. They're VERY VERY different. DRAM needs capacitory density to get the price-per-bit down so they use their own custom fabs optimized for that. This makes it really hard to fit lots of logic and DRAM on to one chip.

Cheap Bastards. by cyc · 2008-03-14 06:19 · Score: 4, Interesting

Rick Merritt, who wrote the lead article also posted an opinion piece in EE Times lambasting Wintel for their lackluster funding efforts in parallel programming. I thoroughly agree with this guy. To quote:

Wintel should not just tease multiple researchers with a $10 million grant awarded to one institution. They need to significantly up the ante and fund multiple efforts.

Ten million is a drop in the bucket of the R&D budgets at Intel and Microsoft. You have to wonder about who is piloting the ship in Redmond these days when the company can afford a $44 billion bid for Yahoo to try to bolster its position in Web search but only spends $10 million to attack a needed breakthrough to save its core Windows business.

Use your GPU by TheSync · 2008-03-14 06:20 · Score: 4, Interesting

If you have a GeForce 8800 GT, you already have a 112 processor parallel computer that you can program using CUDA.

Re:Use your GPU by gdgib · 2008-03-14 06:30 · Score: 1

Ironically the folks in ParLab (*ahem*) just arranged a field trip to meet with some of the people who do that kind of work. We read slashdot too you know...

But try putting that GeForce in your cell phone. And don't come crying to us when your ass catches on fire from the hot cell phone on your back pocket. Or for that matter when your pants fall down from carying the battery around.

ParLab (http://parlab.eecs.berkeley.edu/) is interested in MOBILE computing as well as your desktop.
Re:Use your GPU by TheSync · 2008-03-14 08:37 · Score: 1

Ironically the folks in ParLab (*ahem*) just arranged a field trip to meet with some of the people who do that kind of work.

Super...I'm hoping that GPUs can provide a cheap way for newcomers to learn parallel programming, and it appears that the GPU makers are really waking up to general purpose uses of GPUs.

(I learned parallel programming on a Connection Machine :)

PLINQ by boatboy · 2008-03-14 06:46 · Score: 1

Microsoft has actually released a library which I would imagine is related to this work. PLINQ lets you very easily and declaratively multithread tasks. http://msdn2.microsoft.com/en-us/magazine/cc163329.aspx

Re:Almost by gdgib · 2008-03-14 06:53 · Score: 1

The RAMP project (http://ramp.eecs.berkeley.edu/) tried that. We're actually looking for a old priest, a young priest and a couple of virgins. We haven't been able to get near the rack since we booted it up, and frankly the blood pouring from the faucets is starting to make a mess.

Reconfigurable Computing / FPGA Acceleration by fpgaprogrammer · 2008-03-14 06:57 · Score: 3, Informative

The BEE boards are being trumpeted as multicore experimentation environment, but the FPGA itself is a powerful computational engine in its own right. FPGAs have to overcome the inertia of their history as verification tools for ASIC designs if they want to grow into being algorithm executers in their own right.

There's a growing community of FPGA programmers making accelerators for supercomputing applications. DRC (www.drccomputing.com) and XtremeData (www.xtremedatainc.com) both make co-processors for Opteron sockets with HyperTransport connections, and Cray uses these FPGA accelerators in their latest machines. There is even an active open standards body (www.openfpga.org).

FPGAs and multicore BOTH suffer from the lack of a good programming model. Any good programming model for multicore chips will also be a good programming model for FPGA devices. The underlying similarity here is the need to place dataflow graphs into a lattice of cells (be they fine-grained cells like FPGA CLBs or coarse-grained cells like a multicore processor). I can make a convincing argument that spreadsheets will be both the programming model and killer-app for future parallel computers: think scales with cells.

I've kept a blog on this stuff if you still care: fpgacomputing.blogspot.com

Parallel Computing is not magic by technienerd · 2008-03-14 08:04 · Score: 4, Insightful

I'm about to start a graduate degree in this area so I'm a little biased. However, I think a lot of problems can be solved in parallel. For example, maybe, LZW compression as it's implemented in the "zip" format might not be easily parallelizable but that doesn't prevent us from developing a compression algorithm with parallelism in mind. I did some undergraduate research in parallel search algorithms and I know for a fact that there are many, many ways you can parallelize search. Frankly, saying that you can't parallelize algorithms is a bit closed minded. Many problems don't inherently require serial solutions, it's just current algorithms handle them that way. Rather than trying to implement existing algorithms on a massively parallel processor, you want to re-tackle the problem under a new model, a model of an arbitrary number of processors. You build around the idea of data-parallelism rather than task-parallelism. Many, many things are possible under this model and I think it's naive to think otherwise. You don't need to think, how do I juggle 1000 threads around, you think, how do I take a problem, break it up into arbitrarily many chunks and distribute those chunks to an arbitrary number of processors and how do I do all that scheduling efficiently? This model doesn't work for interactive tasks mind you (where you're waiting for user input), but I'm very confident a model can be developed that can.

Re:Parallel Computing is not magic by zymano · 2008-03-14 08:36 · Score: 1

itanium handles it. Good optimistic thread by you.
Re:Parallel Computing is not magic by technienerd · 2008-03-14 08:56 · Score: 1

Itanium? The processor? A processor can only do so much on its own. Ultimately I think we need to reinvent sorting, searching, encryption, compression, FFT, etc and teach undergrads how to design these algorithms to run on n processors where n is some arbitrary number. We need libraries that do this stuff for us as well. We also need programming languages and platforms like RapidMind to keep us in the "data-parallel" mindset. I've attended two talks by Professor David Patterson from UC Berkeley and in each case this roughly the argument he's been giving major corporations and universities in those research areas.

Fine-Grain Parallelism Is the Future, Not Threads by MOBE2001 · 2008-03-14 08:07 · Score: 2, Insightful

Instead, you describe, using the toolset, the problem in a way which is decomposable, and the tools spread the work over the 1000+ cores.

One day soon, the computer industry will realize that, 150 years after Charles Babbage invented his idea of a general purpose sequential computer, it is time to move on and change to a new computing model. The industry will be dragged kicking and screaming into the 21st century. Threads were not originally intended to be the basis of a parallel software model but only a mechanism for running multiple sequential (not parallel) programs concurrently. The multithreading approach to parallel computing embraced by Intel, AMD and Microsoft is a disaster in the making because the future of computing is not multithreaded. See Nightmare on Core Street for more.

Why not just *run* Solaris? by h8sg8s · 2008-03-14 09:16 · Score: 0, Offtopic

Solaris 10 goes to at least 256 cores,Nevada much higher. Why not just run it rather than simulate?

--
Organization? You must be joking..

One step further by Chemisor · 2008-03-14 10:13 · Score: 1

We should be "stuck with a serial programming model". If your program runs too slow on a single 1 GHz CPU, lack of multicore techniques is the last thing you should be concerned about. The first thing you ought to do is optimize your damn code! There are very few applications that are CPU-bound, and in those that are, only one or two inner loops need parallelizing. The overwhelming majority of slow code is slow because you wrote it badly. So fix the software before blaming the hardware!

Re:One step further by SanityInAnarchy · 2008-03-14 13:12 · Score: 1

There are two things to think about here:

First, how much effort will it take to optimize it, versus throwing another core at it? Or computer? Not always an option, but take, say, Ruby on Rails -- it wouldn't scale to 1000 cores, but it might scale to 1000 separate machines. And yes, it could probably run four or five times faster -- and thus require four or five times less hardware -- but at what cost? Ruby itself is slow, and there are certain aspects of it which will always be slow.

But, you see, the advantage is that development is quicker -- enough to justify the extra hardware expense -- and as soon as your fast(er) Perl, C, or Java needs to scale past that one machine, you have to deal with the same issues. (Did you really think Slashdot runs on a single 1GHz CPU?)

Second: There are enough applications that are CPU-bound that you shouldn't go insulting people offhand about it. Raytracing (or pretty much any rendering) is massively parallizable -- and again, it's that same problem. If you're very good, you might squeeze another 10% out of the renderer -- so the renderfarm will be 1350 beige boxes instead of 1500.

Oh, and the overwhelming majority of code which is slow because it's bad would gain nothing from multiple cores.

--
Don't thank God, thank a doctor!
Re:One step further by jgrahn · 2008-03-15 21:20 · Score: 1

We should be "stuck with a serial programming model". If your program runs too slow on a single 1 GHz CPU, lack of multicore techniques is the last thing you should be concerned about. The first thing you ought to do is optimize your damn code!

I believe SMP (when did we decide to start calling it "multicore"?) is mostly useful (in everyday computing!) for multiple-user machines.
Right now I'm running a number of important jobs on a four-CPU Linux server. At the same time, one guy has a process hogging 100% of one CPU (running an animated screen blanker or whatever). This guy is on parental leave an will be back in August.
I am happy that this process wasn't coded to take advantage of all four cores.

Re:Almost by kramulous · 2008-03-14 11:35 · Score: 2, Funny

for a old priest, a young priest and a couple of virgins Oohh, Oohh. I know this one! First you take the virgins and the old priest across, and then ... dammit! Young priest masturbates.

First you take the ... dammit!

I wish I had more than one boat. Taking all across concurrently would be easier.

--
.

Re:Fine-Grain Parallelism Is the Future, Not Threa by SanityInAnarchy · 2008-03-14 12:51 · Score: 1

I don't think anyone except you said anything about threads. You may have just described exactly what the GP was describing -- point is, why should you have to break them down into individual programs yourself?

Personally, I like Erlang, but the point is the same -- come up with a toolset and/or programming paradigm which makes scaling to thousands of cores easy and natural.

The only problem I have yet to see addressed is how to properly test a threaded app, as it's non-deterministic.

--
Don't thank God, thank a doctor!

Where is Code Pink? by deuist · 2008-03-14 13:42 · Score: 1

Oh sure, Code Pink protests the military's presence in Berkley, but leaves Microsoft free to enter the flagship university. What a bunch of commie pussies!

Re:Fine-Grain Parallelism Is the Future, Not Threa by MOBE2001 · 2008-03-14 13:45 · Score: 2, Insightful

I don't think anyone except you said anything about threads. You may have just described exactly what the GP was describing -- point is, why should you have to break them down into individual programs yourself?

This is precisely what is wrong with the current approach. The idea that the problem should be addressed from the top down has been tried for decades and has failed miserably. The idea that we should continue to write implicitly sequential programs and have some tool extract the parallelism by dividing the program into concurrent threads is completely ass-backwards, IMO. We should start with parallel elements and build implicitly parallel modules from those elements. Nothing needs to be broken down. Hardware engineers have been doing it for years and it works.

Personally, I like Erlang, but the point is the same -- come up with a toolset and/or programming paradigm which makes scaling to thousands of cores easy and natural.

Erlang is not the answer for many reasons, otherwise the entire computer world (especially the high performance parallel research community) would have jumped on it in earnest since it's been around for quite a while. One reason is that it uses a coarse-grain approach to parallelism; you can't even parallelize a quicksort routine (an ideal candidate for parallel processing) in Erlang. Consider also that Erlang is not deterministic and has no mechanism for automatic load balancing. The same goes for all the other functional programming language approaches to concurrency.

The only problem I have yet to see addressed is how to properly test a threaded app, as it's non-deterministic.

The solution is not to use threads at all. They are not needed for parallelism. The difficulty of programming with threads is the primary reason that the multicore industry is in a panic right now. Multithreaded programs are a nightmare to maintain, especially if you did not write the original code. Ask the folks at Intel, Microsoft and AMD. And it's not for the lack of trying. Billions of dollars have been spent on making multithreaded parallel programming easy in the last two decades. They're still at it. What's wrong with this picture?

The 250 gigacore Intel processor by rice_burners_suck · 2008-03-14 17:46 · Score: 1

I really think that Intel needs to skip doing quad-core and whatever processors, and jump directly to doing a kilocore processor. Such a processor would have 1024 cores. It would be the pride of any self-respecting geek to own such a computer. Then they could improve on it by gradually going to two kilocores, four kilocores, etc. In a number of years, when the average computer processor has 250 gigacores, we'll laugh and poke fun at the good ol' days when 640 kilocores were enough for anyone.

stupid by nguy · 2008-03-14 19:20 · Score: 1

No more worrying if you incremented that semaphore correctly because you're operating at a much higher level.

You only need to "worry" about that if you insist on programming your multi-core machine in low-level C. Better solutions have existed for decades, people just don't use them. How is the BEE3 going to change that?

nonsense by nguy · 2008-03-14 19:27 · Score: 1

The fact of the matter is that multi-threaded programming is a common paradigm which takes advantage of multiple cores just fine.

Multi-threaded programming is cumbersome. There have been better was of doing parallel programming for a long time.

Additionally, many algorithms cannot be parallelized.

Whether algorithms can be parallelized doesn't matter. What matters is whether there are parallel algorithms that solve problems faster than serial algorithms, and in most cases there are.

Even languages like Erlang which bring parallelization right to the front of the language are still stuck running serial operations serially.

They aren't "stuck" doing that, they do that because programmers find it convenient, not because they have to. There are many languages that don't even have a defined order of execution.

it's a hardware design problem.

Actually, it's a programmer education problem: most programmers have no idea what kinds of tools they have available for parallel programming, they have no idea how to use them, and they don't even understand what parallel programming paradigms exist. Like you, for example.

Re:Fine-Grain Parallelism Is the Future, Not Threa by SanityInAnarchy · 2008-03-15 16:29 · Score: 1

The idea that we should continue to write implicitly sequential programs and have some tool extract the parallelism by dividing the program into concurrent threads is completely ass-backwards, IMO.

Maybe so, but it's certainly not what I was suggesting.

Rather, I'm suggesting that we should have tools which make it easy to write a parallel model, even if individual tasks are sequential -- after all, they are ultimately executed in sequence on each core.

One reason is that it uses a coarse-grain approach to parallelism; you can't even parallelize a quicksort routine (an ideal candidate for parallel processing) in Erlang.

Can't? Or isn't easy to?

Consider also that Erlang is not deterministic and has no mechanism for automatic load balancing.

I suspect such a mechanism would be easier to build in Erlang than in most other modern languages. And parallel programming is inherently non-deterministic.

It might not be fast, though, as what immediately came to mind is a bunch of worker threads and one master thread -- workers notify the master when they're ready for more tasks.

The solution is not to use threads at all.

I am referring to threads as the OS concept, not as a programming concept. That is: I am not talking about the Erlang processes, I'm talking about the real OS threads it uses (generally one per core, or just one). And I'm referring to threads as a generalization of OS-level processes.

Are you suggesting that a different CPU and/or OS architecture could be built which would make it possible to write deterministic, threaded programs? Or are you talking about an entirely language-level approach?

Or are you suggesting that we try to keep cranking up the clock?

--
Don't thank God, thank a doctor!

Slashdot Mirror

Wintel, Universities Team On Parallel Programming

91 comments