IBM's Blue Gene Runs Continuously At 1 Petaflop

← Back to Stories (view on slashdot.org)

IBM's Blue Gene Runs Continuously At 1 Petaflop

Posted by Zonk on Tuesday June 26, 2007 @05:02AM from the all-i-can-think-is-the-games-you-could-play dept.

An anonymous reader writes "ZDNet is reporting on IBM's claim that the Blue Gene/P will continuously operate at more than 1 petaflop. It is actually capable of 3 quadrillion operations a second, or 3 petaflops. IBM claims that at 1 petaflop, Blue Gene/P is performing more operations than a 1.5-mile-high stack of laptops! 'Like the vast majority of other modern supercomputers, Blue Gene/P is composed of several racks of servers lashed together in clusters for large computing tasks, such as running programs that can graphically simulate worldwide weather patterns. Technologies designed for these computers trickle down into the mainstream while conventional technologies and components are used to cut the costs of building these systems. The chip inside Blue Gene/P consists of four PowerPC 450 cores running at 850MHz each. A 2x2 foot circuit board containing 32 of the Blue Gene/P chips can churn out 435 billion operations a second. Thirty two of these boards can be stuffed into a 6-foot-high rack.'"

16 of 231 comments (clear)

Min score:

Reason:

Sort:

Re:I'm ignorant. by pytheron · 2007-06-26 05:14 · Score: 2, Informative

If you have a large dataset or input domain to perform work upon, split it into X chunks, each chunk processed on a CPU. Hence supercomputers usually being useful for problems that have large datasets/input domains

--
"I am not bound to please thee with my answers" [William Shakespeare]
For those keeping score at home... by Chysn · 2007-06-26 05:16 · Score: 2, Informative

...the next step (10**18) is the "exaflop."

--
--I'm so big, my sig has its own sig.
-- See?
google calculator by Anonymous Coward · 2007-06-26 05:24 · Score: 1, Informative

I wonder if I will ever be able to read slashdot articles without using the google calculator...

1.5 mile = 2.414016 kilometers
2 "foot" = 0.6096 meters
6 feet = 1.8288 meters
How high? by Anonymous Coward · 2007-06-26 05:32 · Score: 4, Informative

Well the the stack of laptops might be tall, but even the 216 racks would stack up to 1/5 of a mile high.
Depends on what you mean by real world. by jd · 2007-06-26 05:36 · Score: 5, Informative

If you include medical imaging, then computed tomography and computational fluid dynamics are heavily dependent on 3D FFTs, which are in turn heavily parallelizable. In extreme cases (raytracing, for example) where there is next to zero communication between nodes, you get linear scaling with the number of nodes for as many nodes as you like. Well, in the case of raytracing, up to the resolution your "camera" works at. On a modern display, you may be talking one million or so distinct originating points at three colours, typically using "bundles" of rays to eliminate effects, which would normally be 64 rays in size. With something like 250 million cores, you could actually generate an animated feature film from raw data files at the time of showing.
How many of these are "real world"? Well, medical and CFD applications are significant, but hardly what you'd call mainstream, and the raytracing may have been used in Titanic on a smaller scale, but IMAX is under no threat at this time.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
1. Re:Depends on what you mean by real world. by jd · 2007-06-26 08:28 · Score: 5, Informative
  Thank you for the compliment. It's equally nice to know that there are active questioners on Slashdot determined to stretch the quality to the limits. In the spirit of providing information, though, I'll add a few links for the perusal and amusement of all. I'm hard on some of the software, but that's not because I could do better. If anything, it's because I have confidence the authors could.
  Let's start with a Slashdotting of NASA...
  
  Scalable Dynamic Chimera Methods for Unsteady Aerodynamics is one of those packages mere mortals like us will have either no use for or will have to just drool over.
  Fully Unstructured Navier-Stokes 3D is a nice Fortran-based CFD, requires some hefty paperwork to obtain, and may need you to use G95 rather than GCC's GFortran, due to compiler bugs.
  OVERFLOW and related CFD software.
  Three Dimensional Multi-block Advanced Grid Generation System is the component that actually lets you do a lot of the necessary grid work for CFDs.
  Viscous Upwind ALgorithm for Complex Flow ANalysis is the hardest of the CFD codes at NASA to obtain, but if you want to work on anything hypersonic, it's the best place to start. Do Not Use hypersonic airflows for CPU cooling.
  
  Astrophysical Thermonuclear Flash Simulator - well, you never know.
  Geant4, for the subatomic nuclear physicist in your life...
  Open Field Operation and Manipulation is a nice open-source CFD package.
  Parallel Basic Local Alignment Search Tool gives you a parallelized search engine for nucleotides and proteins.
  Stanford Exploration Project provides some nice parallel geophysics applications and tools.
  Tachyon Parallel Raytracer is a nice example of what you can do with parallelism and graphics.
  
  Kerrighed is an up-and-coming clustering system for Linux. I saw it demonstrated at SC|05 - and was less than impressed. It needed a lot of work at that point. However, it looks like it has improved a lot since then, and it would be unreasonable to not mention it.
  MOSIX is the second-oldest clustering technology to gain a fan following to rival Star Trek. It's very good, though hard to get if you're not in academia. Arguably for entirely fair reasons.
  OpenMOSIX was originally a fork from MOSIX but is now essentially its own clustering technology. Development is nowhere near the speed I'd like, it does need far more eyes, but is well-known and highly regarded. Moshe Bar is also one of the coolest developers I've encountered.
  
  DAKOTA is a program for profiling parallel applications and should be useful in telling you where you are gaining and losing.
  HPC Toolkit is another toolkit for profiling HPC applications.
  is yet another profiler for parallel software. Between this and the others I've listed, you should have more information than sequential programmers ever get to work with.
  Performance API is a facility used by most of the profiling software to provide an architecture-independent view of performance counters. I have it on good authority that some (now former)
  --
  It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
2. Re:Depends on what you mean by real world. by Goalie_Ca · 2007-06-26 08:48 · Score: 2, Informative
  
  One of the problem working with, say 3D mri data, is that for various reasons the FFT just can't be broken up into chunks of arbitrary sizes. I think at most I've broken a data set up 24 times, but then padding etc. become a worry. Also, you to pretty much avoid all IPC or amdahl's law kicks in fast and hard. Ironically some of the easiest algorithms to break up into several cpu's are things like convolution. The irony is that these are also computed faster on a single cpu than it takes to load and store the file.
  
  --
  
  ----
  Go canucks, habs, and sens!
Re:I'm ignorant. by jellomizer · 2007-06-26 05:42 · Score: 2, Informative

Sure you can sort in O(1/(n^(1/2))) time. By Using a Shear Sort Algroithm.

--
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
The Dawn of Petaflop Computing! by i_like_spam · 2007-06-26 05:47 · Score: 4, Informative

This announcement is part of the International Supercomputing Conference, which just kicked off today. The new Top500 list will also be announced shortly.

While the new IBM Blue Gene/P system is impressive, I'm more curious to see what sort of new supercomputer Andreas Bechtolsheim of Sun Microsystems has put together.

Here's an interesting quote about Bechtolsheim from the article:
'He's a perfectionist,' said Eric Schmidt, Google's chief executive, who worked with Mr. Bechtolsheim beginning in 1983 at Sun. 'He works 18 hours a day and he's very disciplined. Every computer he has built has been the fastest of its generation.'
1. Re:The Dawn of Petaflop Computing! by flaming-opus · 2007-06-26 08:35 · Score: 4, Informative
  
  It appears that Sun's design is less revolutionary. It's just a bunch of off-the-shelf blade servers strung together with infinaband. They use the same cabinets, powersupplies, etc as the regular blade server offerings for non-technical computing. It also runs as a regular linux OS, clustered, rather than a supercomputer specific OS, as the Blue Gene does. The big differentiator of the Sun system is the massive 3000 port infinaband switch. I'm sure it's not actually a 3000-port switch, but a bunch of small switches packed together, running over printed circuit boards, rather than cables.
  
  Sun's design is affordable, and probably has a pretty decent max performance, and pretty reasonable power/memory per node. However, it's not as exotic as IBM's design. The IBM design has fantastic flops/watt and flops/square-foot performance. However, each node is really wimpy, which forces you to use a LOT of nodes for any problem, which inreases the necessary amount of communication. Some problems work really well, others, not so much.
  
  IBM has limited blue gene to a small number of customers, all with fairly large systems. I suspect that's because it's very difficult to port an application to the system, and get good performance.
Re:I'm waiting for the next generation by shaitand · 2007-06-26 06:19 · Score: 2, Informative

Even with the computing power weather would be impossible to calculate. It isn't because of a lack of understanding either. In order to calculate weather you don't just need to know how weather works, you need to have precise data on every variable across the globe and these measurments would need to be taken to a resolution that is simply insane. If you had a fast enough machine, it could even catch up with current weather from that point, but your snapshot would have to be exact and all measurements would have to be taken simultaneously.

THAT is what we can't do. Even if we could mount instrumentation in every square meter of the earth AND its atmosphere to get our current status map and we configured the machine to predict the interactions of those currents we would still be lost. Aside from tracking the output of the sun, the weather system would need to account for ocean currents, tides, bonfires and heating systems, volcanoes, body heat, pig sex, etc.

That is right my friend, every time you pull out and shoot a load on her stomach the weather system would have to take it into account, because the air disturbed might be the first of a chain of complex interactions that leads to a hurricane that devestates louisana... again (because there are actually people so ignorant that they are going to rebuild a city in the same bad location).
Re:But are they availble on the market by Anonymous Coward · 2007-06-26 07:01 · Score: 1, Informative

When the previous generation (BG/L) was released, a rack (1024 nodes, 2048 cores) would cost about US$1.5m. Apparently IBM sells them considerably cheaper now, with BG/P around the corner...
It is petaflops not petaflop. by bommai · 2007-06-26 08:08 · Score: 2, Informative

Contrary to most people that think a singular way of representing floating point speed is FLOP, it is FLOPS because FLOPS is not plural. FLOPS is Floating Point Operations Per Second. So, I chuckle everytime I read 1 PETAFLOP. Guys, just turn off your singular/plural alarm and say with me 1 and only 1 PETAFLOPS.
Re:What about Memory? by Anonymous Coward · 2007-06-26 08:27 · Score: 3, Informative

BG/P will support 2 GB standard for each compute node. A compute node has 4 core processors. An option for 4 GB of memory is also available. On BG/L the initial memory configuration at Livermore was 512 MB per compute node which consisted of 2 core processors. Since 2007 BG/L has offered 1 GB memory as the standard configuration.
Re:How far behind are desktops from super-computer by flaming-opus · 2007-06-26 08:47 · Score: 4, Informative

A tricky question, but not all that interesting. A fast server processor is within a factor of 4 of the fastest supercomputer processor in the world. That does not mean that you can do equivalent work with the server processor. Among other things, processing performance (gigaflops) of a CPU, is no longer the interesting part of a supercomputer. (It never really was) memory bandwidth, interconnect bandwidth and latency, and I/O performance are the more interesting features of supers. 12 year old Cray processors still have five times the memory bandwidth of modern PC processors, and twenty times the I/O bandwidth.

You'll notice, that 98% of the supercomputers, sold in the last 10 years, all use server processors. (Blue Gene actually uses an embedded systems processor, but it's the same idea) However, in the late 80's putting 256 processors in a super was cutting edge. In the 90's, a few thousand. Soon you'll see a quarter million cores. So supers are actually getting faster at a higher rate than are desktops, at least by most measures.
Re:But are they availble on the market by Anonymous Coward · 2007-06-26 10:28 · Score: 1, Informative

Basically, the easiest way to do it (speaking as someone who has done it) is MPICH (http://www-unix.mcs.anl.gov/mpi/mpich/). Anyone familiar with C or C++ can use it in a relatively simple manner (it is after all, just another header file). You can set up a rather simple beowolf style cluster and run the environment in a linux network without much trouble.

There is also OpenMP (more of an extension to C/C++ than just a header, you need pragmas and stuff to use it); I find it easy to fall into race conditions in that library because you really need to think about what you are doing.

Technically, pthreads ought to be able to provide enough functionality to get up and running if your environment acts as a single machine, or even the System.Threading namespace if you have the ability to run managed code. However, you don't have control then over if your thread gets its own cpu or not (unless it is guaranteed by the OS). In most cases that isn't actually necessary, your algorithm can be written in such a way that it doesn't matter if it is running on a 8 cpu system or a 2^32 cpu system (with exception to the fact that time to completion will vary); the troubles come in with optimizations.

Recently I have been experimenting with simple web services on a server to farm out pieces of the solution in a distributed fashion for attempting a brute force on a salted sha1 hash in a database situation where you know the salt:
on server:
class infoBlock {string hash, string salt, string prefix, bool finished} infoBlock getWorkItem() { if success return new infoBlock{finished = true} else return new infoBlock{hash, salt, next prefix} } void finish(string password) { success = true; store plaintext password with hash and salt; }
on clients:
infoBlock wi; wi = getWorkItem(); while (!wi.finished()) { resultBlock results = processWorkItem(wi); if(results.success) { finish(results.plaintext); } wi = getWorkItem(); }