BlueGene/L Puts the Hammer Down

← Back to Stories (view on slashdot.org)

BlueGene/L Puts the Hammer Down

Posted by Hemos on Thursday March 24, 2005 @07:29PM from the welcome-to-the-machine dept.

OnePragmatist writes "Cyberinfrastructure Technology Watch is reporting that BlueGene/L has nearly doubled its performance to 135.3 Teraflops by doubling its processors. That seems likely to keep it at no. 1 on the Top500 when the next round comes out in June. But it will be interesting to see how it does when they finally get around to testing it against the HPC Challenge benchmark, which has gained adherents as being more indicative of how a HPC system will peform with various different types of applicatoins."

15 of 152 comments (clear)

Min score:

Reason:

Sort:

Re:Math Error? by pacslash · 2005-03-24 19:58 · Score: 1, Informative

135.3 * 2 = 271 360 / 2 = 180 It's just you.
Pics by identity0 · 2005-03-24 20:39 · Score: 4, Informative

I found it odd that there aren't any pics of the machine on those sites, so I looked around... Here are some pics of the prototype at top, and the finished version at bottom. It looks like it's going to be in classic "IBM black", like the 2001 monolith : )

Some more pics of the prototype.

For comparison, the Earth simulator and big mac.

Anyone know what kind of facilities blue gene will be housed at? The one for the earth simulator looks like something out of a movie, IBM better be able to compete on the 'cool factor'. : )

And does anyone else get the warm and fuzzy feelings from looking at these pics, even though there's nothing you could possibly use that much power for? Ahhh, power...
1. Re:Pics by Anonymous Coward · 2005-03-25 02:47 · Score: 2, Informative
  
  I work at IBM in Rochester Minnesota where the machines are built and housed and I have seen the machines that are being shipped around and installed at Lawrence Livermore and other places... it is an awesome sight. VERY LOUD with the hugs fans above it, and the floor in the building had to be dug down 4 feet to allow for the cabling and air ducts to run underneath everything. What most surprised me is not how fast it is, but how well they were able to get it to scale by using fairly low power processors. The power is not in the processors in it, which are low power to conserve electrocity and heat, but in how amazingly well it scales.
Re:Math Error? by Anonymous Coward · 2005-03-24 20:43 · Score: 5, Informative

Lets see if I can get this right. I'm going to talk a little bit out of my ass now, but here goes:

Every 512 node backplane has a peak of 1.4tflop/s according to their design doc.

The 32k system benched in at 70. The theoretical peak was 91, so the actual performance is about 77 percent of the peak, which is pretty normal.

The 64k system benched in at 135. The peak should be around 182, so that is 74 percent of the peak.

The design goal is for 360 at 64k. I'm guessing that 360 is the peak, because my rough estimate calculations put the of 64k nodes at about 364. Lets be nice and assume 70% of peak in the actual machine. That would indicate around 255tflop/s at 64k nodes of actual performance, assuming the thing scales at about the same rate.

So, they got their math right as long as they are claiming a peak of 360. That's a theoretical max, and so never actually reached. The actual is notably less. My numbers are estimates, and so 364 not equalling 360 doesn't bother me much in the end.

Anyone care to correct anything?

Also, can someone explain to me why Cray's Redstorm won't kick this things ass performance wise. Redstorm should have 10k processors, but they are 64bit Opteron 2.4ghz processors with 4x the ram per node. These, I believe, are 700mhz processors.

I'm confused because Redstorm only has a theoretical peak of about 40tflops off the 10k nodes. IBM's system at 10k should have a peak at around 30tflops. I'm wondering how 64bit opterons at more than 3x the speed could only be 10flops faster at 10k nodes than 700mhz 32bit PPCs with 1/4 the ram. Can someone please explain?

Also, anyone know why IBM isn't using HPCC. Cray has been using it for the XD1s. I'm guessing the reason IBM hasn't posted results is because they can't even come close in sustained memory bandwidth, mpi latency, and other tests. That's just a guess though. I'd love to hear from someone that actually knows about this stuff.
Re:But What about the Crays? by Hungry+Admin · 2005-03-24 22:23 · Score: 4, Informative

Not all problems are going to be solved faster by parallel computation. Some problems will be better solved on the 6 Tflop machine than with 10,000 slower CPUs.

--
Be who you are and say what you feel, because the people who mind don't matter, and the people who matter don't mind.
Re:Can anyone please... by kromozone · 2005-03-24 23:17 · Score: 3, Informative

Think of all the charges in a protein composed of hundreds of amino acids, each composed of dozens of atoms. Now imagine those charges ineracting during protein folding, in a solution. Let's say that process takes a few miliseconds. Now imagine modeling this process at the femtosecond resolution. This system is severely underpowered.
Re:Math Error? by RaffiRai · 2005-03-25 00:13 · Score: 2, Informative

I might fathom that the layout/grid computing data-flow arrangement might have as much, if not more, effect than the sheer number of processors when you're workong on something like that.

It seems to me that since the device isn't complete the data management isn't working under optimal conditions.
Re:Cell vs HPC by Flaming+Death · 2005-03-25 01:02 · Score: 1, Informative

Yeah it is, until you realise SP runs at 256 Gflops.. so even at a modest 25 Gflops it out performs most cores quite well. Cells are obviously built for clusters/multiple connected cores though.. theoretically then you only need 5,400 odd cores to get the same 136 Tflop caps.. (I refer to cores here, since most incarnations are going to have 2, 4, 8 or 16 cells onboard) .. still a fairly decent improvement..
Re:Wait another year... by imsabbel · 2005-03-25 01:11 · Score: 2, Informative

Well, too be fair, cell uses some fairly stoned tricks to get to that kind of peak power. (the massive memory bandwith is only to small local memories (everywhere else you would call them cache) and the main memory bandwith is laughable compared to the computing resources.
Although linpack is very nice to parallize, i dont think it would be possible to even get 10% of the theoretical rate on a cell.

--
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
Re:Top 500 and Auto Racing by ShadowFlyP · 2005-03-25 02:38 · Score: 5, Informative

I think your comparison here is quite unfair to the technological accomplishments of BlueGene/L. This is not simply a case of IBM "throwing more processors" at the problem, but BlueGene is a technological leap over other supercomputers. Not only is BlueGene faster, than for instance the Earth Simulator, but it also consumes FAR LESS power (which in turn minimizes the energy wasted cooling the thing) and takes up much less space. From an article published when BlueGene first overcame the Earth Simulator: "Blue Gene/L's footprint is one per cent that of the Earth Simulator, and its power demands are just 3.6 per cent of the NEC supercomputer." http://www.theregister.co.uk/2004/09/29/supercompu ter_ibm/ So, I say to you, NO! The top 500 race is not simply big companies throwing money at a problem (well, it sort of is), but there is quite a lot of technical accomplishment going on here. You could argue that the people involved may not have the brilliance of Seymore, but they sure do have real talent.
Re:One in every home by TheRealFoxFire · 2005-03-25 03:32 · Score: 2, Informative

The definition of a supercomputer is a moving line. At any given time, a supercomputer is usually just a machine with an order of magnitude more CPU throughput than a PC. This neglects things supercomputers have that desktops don't like massive I/O capabilities, but in terms of CPU performance, today's desktops are usually as fast as the past's supercomputers.
Why BlueGene kicks RedStorm's ass on Linpack by RalphBNumbers · 2005-03-25 03:36 · Score: 4, Informative

Well, it comes down to a few different things.

First off, Opterons are pretty mediocre at double precision floating point benchmarks, it just isn't what they were designed for. Opterons effectively have only a single FPU (technically they have two, but one only does addition, while the other handles all multiplies), while most competing chips in the HPC arena have two full FPUs. They tend to get spanked by PPCs and Itanium2s, and even Xenons can do better.

Also, you should note that the modified PPC440s in BlueGene have a disproportionate amount of floating point resources. Making them about equivalent to the 970 in that area mhz for mhz, despite being massively outclassed in integer and vector ops. And the floating point units on those 440s are full 64-bit units (as fpus are on many other ostensibly 32 bit chips, as the bit width of a fpu has nothing to do with the integer units and mmus being 32-bit). Plus the PPC has a fused multiply-add instruction, allowing it to theoretically finish 2 FLOPS/unit/cycle, instead of just one.

And finally, you should know that individual nodes' ram sizes matter very little for Linpack.

When you take all that together, it's not too surprising that 700Mhz PPC440s with 2 64-bit FPUs each finishing up to 2 FLOPs/cycle (at least 2 of which must be adds) would perform on par with 2.xGhz Opterons finishing a total of 2FLOPs/cycle (at least one of which has to be an add).

--
"The worst tyrannies were the ones where a governance required its own logic on every embedded node." - Vernor Vinge
Tip of the iceberg... by Anonymous Coward · 2005-03-25 04:45 · Score: 1, Informative

The 70.72 TF BlueGene/L that debuted on the November list is only 16 of 64 racks of the full machine (25%). BluneGene/L was to be delivered in stages and be a 131072 CPU system when complete (64 racks * 2048 CPUs per rack). The beasty will be well over 200 TF sustained Linpack when it is completed. Oh, and it is binary compatible with System X at Virginia Tech.
Re:Can anyone please... by overunderunderdone · 2005-03-25 08:28 · Score: 2, Informative

Because the mechanism by which the DNA blueprint is actually used to create proteins (in a mechanism that itself uses proteins) is *spectacularly* complex.

"IBM estimates that the folding model for a 300-residue protein will encompass more than one billion forces acting over one trillion time steps. Even for Blue Gene, modeling such a folding process is expected to take about a year of around-the-clock processing."
Re:Can anyone please... by Obfiscator · 2005-03-25 09:18 · Score: 2, Informative

I prefer Monte Carlo techniques with special folding moves, personally, but I don't think either method will be solving this problem anytime soon.

What kind of force field do you want to use? CHARMM or AMBER? Well, it might work. From an unfolded protein, though, some of your atoms will undergo fairly drastic changes in environment. Better use a polarizable model. Oh, crap, there's another order of magnitude in expense, and you still may not get the right answer unless your force field is parameterized in a clever way that may only work properly for one class of proteins. What's the functional form? Lennard-Jones? Exp-6? It better be able to reproduce hydrogen bonding accurately. Do you have the correct bond stretching, bond bending, and torsional potentials? A lot of interactions in those systems, and each set has a very complicated potential energy surface with a lot of minima. It's gonna be tough to find the correct one.

Let's get rid of parameters and use quantum mechanical energy calculations, you say? That cuts down your system size to 64 molecules and maybe 20 ps of trajectory if you use a method like CPMD and density functional theory with no Hartree-Fock exchange...and there is no protein in the system, just water. How long does it take a protein to fold? Nanoseconds? Microseconds? So you can forget about doing a protein like this, and besides, current density functionals give a worse answer for most physical properties of water than cheap empirical models. So create a new functional to reproduce the properties of water? It's being tried, and it could work very nicely. Now you just have to worry about the protein. DFT is nice in that you won't have to worry about assigning charges or intramolecular parameters to the protein, but if it struggles this much with water...well, let's say I don't have much faith in it to nail a protein on the first try. On top of that, I don't think I've even seen a single point calculation done on a full protein with DFT (QM/MM, yes, but not just DFT), much less the thousands that you'd need for a simulation.

Nope, protein folding isn't going to give us any answers for years. Tell you what, though: you can keep working on proteins, and I'll try and get water correct. Water is a difficult enough molecule. Maybe by the time we finally get water right, we'll have the resources we need to do a full protein.

I'm not trying to dissuade you from continuing your research. I'm just trying to be realistic about what we can expect and when we can expect it. Good luck to you.

--
"Nothing shocks me. I'm a scientist." -Indiana Jones