BlueGene/L Puts the Hammer Down
OnePragmatist writes "Cyberinfrastructure Technology Watch is reporting that BlueGene/L has nearly doubled its performance to 135.3 Teraflops by doubling its processors. That seems likely to keep it at no. 1 on the Top500 when the next round comes out in June. But it will be interesting to see how it does when they finally get around to testing it against the HPC Challenge benchmark, which has gained adherents as being more indicative of how a HPC system will peform with various different types of applicatoins."
Maybe this thing can keep the WoW service running.
How much processing power does one need for any certain application? I know that projects like World Community Grid need massive amounts of computing power, but seriously, 135 TFlops?
...ok I couldn't resist
Imagine a beowulf cluster of these....
does anyone else find the similarities between the computer hardware world and DragonballZ irratating? right when you think its finally over, the best is exposed and found worthy, yet another difficulty comes up - along with the standard unfathomed power increases and bizare advances. then it all happens again :/
this sig no verb
Is it just me or is 135.3 * 2 < 360 / 2?
Obviously that number's based on an unrealistic, 100% efficient scaling factor. But still. The 137 TFlop is coming from 64,000 processors.
It's fun to think about what's just around the corner.
...host a spell check for Slashdot! ...as being more indicative of how a HPC system will peform with various different types of applicatoins."
Oh man, I *so* wanna put Windows HPC on this thing!
1) Solving linear equations. SIMD Matrix math, check.
2) DP Matrix-Matrix multiplies. IBM added DP support to their VMX set for Cell (though at 10% the execution rate), check.
3) Processor/Memory bandwidth. XDR interface at 25.6 GB/s, check.
4) Processor/Processor bandwidth. FlexIO interface at 76.8 GB/s, check.
5) "measures rate of integer random updates of memory", hmmmm... not sure.
6) Complex, DP FFT. Again, DP support at a price. check.
7) Communication latency & bandwidth. 100 GB/s total memory bandwidth, check (though this could be heavily influenced on how IBM handles its SPE threading interface)
Obviously, I'm not saying they used the HPC Challenge as a design document, but clearly Cell is meant as a supercomputer first and a PS3 second.
I found it odd that there aren't any pics of the machine on those sites, so I looked around... Here are some pics of the prototype at top, and the finished version at bottom. It looks like it's going to be in classic "IBM black", like the 2001 monolith : )
Some more pics of the prototype.
For comparison, the Earth simulator and big mac.
Anyone know what kind of facilities blue gene will be housed at? The one for the earth simulator looks like something out of a movie, IBM better be able to compete on the 'cool factor'. : )
And does anyone else get the warm and fuzzy feelings from looking at these pics, even though there's nothing you could possibly use that much power for? Ahhh, power...
Not all problems are going to be solved faster by parallel computation. Some problems will be better solved on the 6 Tflop machine than with 10,000 slower CPUs.
Be who you are and say what you feel, because the people who mind don't matter, and the people who matter don't mind.
and what type of frame rate do you get with Quake?
It speculatively pre-renders every possible frame for the next 90 seconds.
Stop the world; I need to get off.
Think of all the charges in a protein composed of hundreds of amino acids, each composed of dozens of atoms. Now imagine those charges ineracting during protein folding, in a solution. Let's say that process takes a few miliseconds. Now imagine modeling this process at the femtosecond resolution. This system is severely underpowered.
It could be that the competition for the top of the 500 slot is becoming less of technological achievement and more of just who has the most $$$ to spend. Just like auto racing used to be about improvements in engines and transmissions etc but after a point everybody could make a faster car just by buying more commonly available, well known technology than the other guys. So they put in limitations for the races, only so big a venturi, displacement, etc.
Anyway, my point is - it's becoming just "I can afford more processors than you can so I win" instead of the heyday of Seymore Cray when you really had to be talented to capture the #1 spot from IBM.
try { do() || do_not(); } catch (JediException err) { yoda(err); }
What's the scalar performance of one of these beasties?
Can an Athlon 64 / P4 beat it on scalar code? The whole HPC world has gotten boring since Cray died. Here's why I say that:
The Cray 1 had the best SCALAR and VECTOR performance in the world.
The Cray 2 was an ass kicker, the Cray 3 was a real ass kicker (if only they could build them reliably).
Cray pushed the boundaries, he pushed them too far at some points -- designing and trying to build machines that they couldn't make reliable.
So it'll be a cold day in hell before I get all fired up over the fact that someone else managed to glue together a bazillion 'killer micros' and win at Linpack...
Now if someone would bring back the idea of transputers, or we saw some *real* efforts at Dataflow and FP then I'd be excited. I'd love a PC with 8 small, simple, fast, in-order tightly bound cpus. Don't say CELL, all indications are that they will be a *real* PITA to program to get any decent performance out of.
Well, it comes down to a few different things.
First off, Opterons are pretty mediocre at double precision floating point benchmarks, it just isn't what they were designed for. Opterons effectively have only a single FPU (technically they have two, but one only does addition, while the other handles all multiplies), while most competing chips in the HPC arena have two full FPUs. They tend to get spanked by PPCs and Itanium2s, and even Xenons can do better.
Also, you should note that the modified PPC440s in BlueGene have a disproportionate amount of floating point resources. Making them about equivalent to the 970 in that area mhz for mhz, despite being massively outclassed in integer and vector ops. And the floating point units on those 440s are full 64-bit units (as fpus are on many other ostensibly 32 bit chips, as the bit width of a fpu has nothing to do with the integer units and mmus being 32-bit). Plus the PPC has a fused multiply-add instruction, allowing it to theoretically finish 2 FLOPS/unit/cycle, instead of just one.
And finally, you should know that individual nodes' ram sizes matter very little for Linpack.
When you take all that together, it's not too surprising that 700Mhz PPC440s with 2 64-bit FPUs each finishing up to 2 FLOPs/cycle (at least 2 of which must be adds) would perform on par with 2.xGhz Opterons finishing a total of 2FLOPs/cycle (at least one of which has to be an add).
"The worst tyrannies were the ones where a governance required its own logic on every embedded node." - Vernor Vinge