Slashdot Mirror


BlueGene/L Puts the Hammer Down

OnePragmatist writes "Cyberinfrastructure Technology Watch is reporting that BlueGene/L has nearly doubled its performance to 135.3 Teraflops by doubling its processors. That seems likely to keep it at no. 1 on the Top500 when the next round comes out in June. But it will be interesting to see how it does when they finally get around to testing it against the HPC Challenge benchmark, which has gained adherents as being more indicative of how a HPC system will peform with various different types of applicatoins."

20 of 152 comments (clear)

  1. Finally... by Nos. · · Score: 5, Funny

    Maybe this thing can keep the WoW service running.

  2. Avoiding the obvious memes... by yuriismaster · · Score: 3, Funny

    How much processing power does one need for any certain application? I know that projects like World Community Grid need massive amounts of computing power, but seriously, 135 TFlops?

    ...ok I couldn't resist

    Imagine a beowulf cluster of these....

  3. similarities by teh_mykel · · Score: 4, Insightful

    does anyone else find the similarities between the computer hardware world and DragonballZ irratating? right when you think its finally over, the best is exposed and found worthy, yet another difficulty comes up - along with the standard unfathomed power increases and bizare advances. then it all happens again :/

    --
    this sig no verb
    1. Re:similarities by TetryonX · · Score: 5, Funny

      If the BlueGene/L can grant me any wish I want for collecting 7 of them, sign me up.

      --
      [!] No, I can't see my comments. They are not worthy of +3 moderation.
  4. Math Error? by mothlos · · Score: 5, Insightful
    Roughly as expected, BlueGene/L can now crank away at 135.3 trillion floating point operations per second (teraflops), up from the 70.72 teraflops it was doing at the end of 2004. BlueGene/L now has half of its planned processors and is more than half way to achieving its design goal of 360 teraflops.



    Is it just me or is 135.3 * 2 < 360 / 2?

    1. Re:Math Error? by Anonymous Coward · · Score: 5, Informative

      Lets see if I can get this right. I'm going to talk a little bit out of my ass now, but here goes:

      Every 512 node backplane has a peak of 1.4tflop/s according to their design doc.

      The 32k system benched in at 70. The theoretical peak was 91, so the actual performance is about 77 percent of the peak, which is pretty normal.

      The 64k system benched in at 135. The peak should be around 182, so that is 74 percent of the peak.

      The design goal is for 360 at 64k. I'm guessing that 360 is the peak, because my rough estimate calculations put the of 64k nodes at about 364. Lets be nice and assume 70% of peak in the actual machine. That would indicate around 255tflop/s at 64k nodes of actual performance, assuming the thing scales at about the same rate.

      So, they got their math right as long as they are claiming a peak of 360. That's a theoretical max, and so never actually reached. The actual is notably less. My numbers are estimates, and so 364 not equalling 360 doesn't bother me much in the end.

      Anyone care to correct anything?

      Also, can someone explain to me why Cray's Redstorm won't kick this things ass performance wise. Redstorm should have 10k processors, but they are 64bit Opteron 2.4ghz processors with 4x the ram per node. These, I believe, are 700mhz processors.

      I'm confused because Redstorm only has a theoretical peak of about 40tflops off the 10k nodes. IBM's system at 10k should have a peak at around 30tflops. I'm wondering how 64bit opterons at more than 3x the speed could only be 10flops faster at 10k nodes than 700mhz 32bit PPCs with 1/4 the ram. Can someone please explain?

      Also, anyone know why IBM isn't using HPCC. Cray has been using it for the XD1s. I'm guessing the reason IBM hasn't posted results is because they can't even come close in sustained memory bandwidth, mpi latency, and other tests. That's just a guess though. I'd love to hear from someone that actually knows about this stuff.

  5. Wait another year... by Anonymous Coward · · Score: 5, Interesting
    That's like, what, 527 Cell processors?

    Obviously that number's based on an unrealistic, 100% efficient scaling factor. But still. The 137 TFlop is coming from 64,000 processors.

    It's fun to think about what's just around the corner.

  6. Maybe it will be able to... by EvanED · · Score: 4, Funny

    ...host a spell check for Slashdot! ...as being more indicative of how a HPC system will peform with various different types of applicatoins."

    1. Re:Maybe it will be able to... by CoolGopher · · Score: 4, Funny

      I'm sure you mean a spelling checker. It's WoW that needs the spell check... combat resurrection for priests, anyone?!

  7. Windows HPC by Cruithne · · Score: 4, Funny

    Oh man, I *so* wanna put Windows HPC on this thing!

  8. Cell vs HPC by adam31 · · Score: 5, Insightful
    The HPC Challenge benchmark is especially interesting and I think sheds some light on the design goals IBM had in coming up with the Cell.

    1) Solving linear equations. SIMD Matrix math, check.
    2) DP Matrix-Matrix multiplies. IBM added DP support to their VMX set for Cell (though at 10% the execution rate), check.
    3) Processor/Memory bandwidth. XDR interface at 25.6 GB/s, check.
    4) Processor/Processor bandwidth. FlexIO interface at 76.8 GB/s, check.
    5) "measures rate of integer random updates of memory", hmmmm... not sure.
    6) Complex, DP FFT. Again, DP support at a price. check.
    7) Communication latency & bandwidth. 100 GB/s total memory bandwidth, check (though this could be heavily influenced on how IBM handles its SPE threading interface)

    Obviously, I'm not saying they used the HPC Challenge as a design document, but clearly Cell is meant as a supercomputer first and a PS3 second.

    1. Re:Cell vs HPC by shizzle · · Score: 3, Interesting
      2) DP Matrix-Matrix multiplies. IBM added DP support to their VMX set for Cell (though at 10% the execution rate), check.
      [...]
      ...clearly Cell is meant as a supercomputer first and a PS3 second.
      I think you've refuted your own argument there: double precision floating point performance is critical for true supercomputing. (In supercomputing circles DP and SP are often referred to as "full precision" and "half precision", respectively, which should give you a better idea of how they view things.)

      In contrast, SP is plenty of accuracy for things like rendering and game physics, since (very loosely speaking) as long as you're within a fraction of a pixel of the right answer you don't need any more accuracy.

      I'd say the Cell architecture is very well suited for supercomputing as well as gaming, but the announced Cell implementation appears to me to be clearly targeted at the PS3. They'll have to come out with a "Cell HPC Edition" that has much better DP performance before they take over supercomputing. Not that I don't expect that they're working on that as we speak...

  9. Pics by identity0 · · Score: 4, Informative

    I found it odd that there aren't any pics of the machine on those sites, so I looked around... Here are some pics of the prototype at top, and the finished version at bottom. It looks like it's going to be in classic "IBM black", like the 2001 monolith : )

    Some more pics of the prototype.

    For comparison, the Earth simulator and big mac.

    Anyone know what kind of facilities blue gene will be housed at? The one for the earth simulator looks like something out of a movie, IBM better be able to compete on the 'cool factor'. : )

    And does anyone else get the warm and fuzzy feelings from looking at these pics, even though there's nothing you could possibly use that much power for? Ahhh, power...

  10. Re:But What about the Crays? by Hungry+Admin · · Score: 4, Informative

    Not all problems are going to be solved faster by parallel computation. Some problems will be better solved on the 6 Tflop machine than with 10,000 slower CPUs.

    --
    Be who you are and say what you feel, because the people who mind don't matter, and the people who matter don't mind.
  11. Re:Yes but what .. by OneDeeTenTee · · Score: 5, Funny

    and what type of frame rate do you get with Quake?

    It speculatively pre-renders every possible frame for the next 90 seconds.

    --
    Stop the world; I need to get off.
  12. Re:Can anyone please... by kromozone · · Score: 3, Informative

    Think of all the charges in a protein composed of hundreds of amino acids, each composed of dozens of atoms. Now imagine those charges ineracting during protein folding, in a solution. Let's say that process takes a few miliseconds. Now imagine modeling this process at the femtosecond resolution. This system is severely underpowered.

  13. Top 500 and Auto Racing by ch-chuck · · Score: 4, Insightful

    It could be that the competition for the top of the 500 slot is becoming less of technological achievement and more of just who has the most $$$ to spend. Just like auto racing used to be about improvements in engines and transmissions etc but after a point everybody could make a faster car just by buying more commonly available, well known technology than the other guys. So they put in limitations for the races, only so big a venturi, displacement, etc.

    Anyway, my point is - it's becoming just "I can afford more processors than you can so I win" instead of the heyday of Seymore Cray when you really had to be talented to capture the #1 spot from IBM.

    --
    try { do() || do_not(); } catch (JediException err) { yoda(err); }
    1. Re:Top 500 and Auto Racing by ShadowFlyP · · Score: 5, Informative

      I think your comparison here is quite unfair to the technological accomplishments of BlueGene/L. This is not simply a case of IBM "throwing more processors" at the problem, but BlueGene is a technological leap over other supercomputers. Not only is BlueGene faster, than for instance the Earth Simulator, but it also consumes FAR LESS power (which in turn minimizes the energy wasted cooling the thing) and takes up much less space. From an article published when BlueGene first overcame the Earth Simulator: "Blue Gene/L's footprint is one per cent that of the Earth Simulator, and its power demands are just 3.6 per cent of the NEC supercomputer." http://www.theregister.co.uk/2004/09/29/supercompu ter_ibm/ So, I say to you, NO! The top 500 race is not simply big companies throwing money at a problem (well, it sort of is), but there is quite a lot of technical accomplishment going on here. You could argue that the people involved may not have the brilliance of Seymore, but they sure do have real talent.

  14. Scalar performance -- Unimpressed! by tarpitcod · · Score: 3, Interesting

    What's the scalar performance of one of these beasties?

    Can an Athlon 64 / P4 beat it on scalar code? The whole HPC world has gotten boring since Cray died. Here's why I say that:

    The Cray 1 had the best SCALAR and VECTOR performance in the world.

    The Cray 2 was an ass kicker, the Cray 3 was a real ass kicker (if only they could build them reliably).

    Cray pushed the boundaries, he pushed them too far at some points -- designing and trying to build machines that they couldn't make reliable.

    So it'll be a cold day in hell before I get all fired up over the fact that someone else managed to glue together a bazillion 'killer micros' and win at Linpack...
    Now if someone would bring back the idea of transputers, or we saw some *real* efforts at Dataflow and FP then I'd be excited. I'd love a PC with 8 small, simple, fast, in-order tightly bound cpus. Don't say CELL, all indications are that they will be a *real* PITA to program to get any decent performance out of.

  15. Why BlueGene kicks RedStorm's ass on Linpack by RalphBNumbers · · Score: 4, Informative

    Well, it comes down to a few different things.

    First off, Opterons are pretty mediocre at double precision floating point benchmarks, it just isn't what they were designed for. Opterons effectively have only a single FPU (technically they have two, but one only does addition, while the other handles all multiplies), while most competing chips in the HPC arena have two full FPUs. They tend to get spanked by PPCs and Itanium2s, and even Xenons can do better.

    Also, you should note that the modified PPC440s in BlueGene have a disproportionate amount of floating point resources. Making them about equivalent to the 970 in that area mhz for mhz, despite being massively outclassed in integer and vector ops. And the floating point units on those 440s are full 64-bit units (as fpus are on many other ostensibly 32 bit chips, as the bit width of a fpu has nothing to do with the integer units and mmus being 32-bit). Plus the PPC has a fused multiply-add instruction, allowing it to theoretically finish 2 FLOPS/unit/cycle, instead of just one.

    And finally, you should know that individual nodes' ram sizes matter very little for Linpack.

    When you take all that together, it's not too surprising that 700Mhz PPC440s with 2 64-bit FPUs each finishing up to 2 FLOPs/cycle (at least 2 of which must be adds) would perform on par with 2.xGhz Opterons finishing a total of 2FLOPs/cycle (at least one of which has to be an add).

    --
    "The worst tyrannies were the ones where a governance required its own logic on every embedded node." - Vernor Vinge