On the Supercomputer Technology Crisis
scoobrs writes "Experts claim America has been eating our 'supercomputer feed corn' by developing clusters rather than new supercomputer processors and interconnects. Forbes says America is playing catch-up and that the new federal budget items are too little too late. Cray is laying people off due to decreased federal spending and claims lower margin products have forced them to create products based on commodity parts. Red Storm, one of their new Linux-based products, is being delayed to next year."
It's seed corn. Seed, as in, what you don't eat, but save to plant next year.
Kids these days.
With reasonable men I will reason; with humane men I will plead; but to tyrants I will give no quarter. -- William Lloyd
a mesh of nodes on a network will do just as well
In some cases.
Unfortunately, some problems are particularly unsuitable for clusters of commercial computers, and really benefit from specialized architectures such as shared memory or vector processors.
A while ago it was decided by the US government to essentially abandon such specializations, and buy COTS. It is certainly cheaper, but not necessarily effective.
Clusters are not good for very chattery parallel processes, a shared memory supercomputer can still do much better for computational fluid dynamics.
Offtopic, my left testicle. You're SUPPOSED to eat feed corn. You save seed corn to plant.
And do interesting things. And try to keep in touch.
For things like weather forecasting, maybe big vector machines still have an edge, but I suspect that's changing as the weather guys get more experience in using machines with large numbers of micros. This seems to have already occurred, in fact; NCAR appears to have mostly IBM RS6000 and SGI computers these days, with nary a Cray in sight.
The most common term I used to hear in the early 90's was Killer Micros; I think the term dates back David Bailey in the 80's sometime. If you want more evidence that the death of the supercomputer has been going on for a long time, check out The Dead Supercomputer Society, which lists dozens of failed companies and projects over the years; this page was apparently last updated 6 years ago!
Have you read my blog lately?
One, Altivec only supports single-precision floats. Big no-no in quite a few scientific areas. Also, to be blunt: GCC sucks. If you want something decent, at least use IBM's compiler.
Partial differential equations are NOT necessarily highly parallelisable. Linear ones, maybe. But the interesting ones that simulate Da Bombs are all nonlinear elliptic P.D.E.s.
Shameless copy from a news site....
... 512 CPU's is pretty damn studly," Torvalds said in an email interview. "Putting 20 of them in a cluster and making them be programmable as a single machine is pretty hot."
---snip---
Thursday, 29 July, 2004
NASA to build 10,000-processor Linux computer
Robert McMillan, San Francisco
NASA has given the green light to a project that will build the largest ever supercomputer based on Silicon Graphics' 512-processor Altix computers.
Called Project Columbia, the 10,240-processor system will be used by researchers at the Advanced Supercomputing Facility at NASA's Ames Research Center in Moffett Field, California.
Scientists will use Columbia to design equipment, simulate future space missions and model weather patterns. A portion of the US$160 million system will also be made available to other government agencies and educational facilities, says Bill Thigpen, manager of Project Columbia. "We need to look at working with other agencies to provide them with access to this system because it is a unique system," he says.
What makes Project Columbia unique is the size of the multiprocessor Linux systems, or nodes, that it clusters together. It is common for supercomputers to be built of thousands of two-processor nodes, but the Ames system uses SGI's NUMAlink switching technology and ProPack Linux operating system enhancements to connect 512-processor nodes, each of which will have more than 1,000 gigabytes of memory.
"We use a very large single-system image," says Jeff Greenwald, senior director of server product marketing with SGI. "The other guys come with a very thin node cluster, and try to screw them all together."
The Altix nodes will use Intel's Itanium 2 microprocessors, and the entire 20-node system is expected to be fully assembled by year's end, he says.
SGI has used this large-node technology to build a number of smaller Altix systems with between 3,000 and 6,000 processors, but Project Columbia will be the largest to date, Greenwald says,
Columbia's large-node, shared-memory architecture works well for NASA's "tightly coupled" weather and space simulation applications, where a lot of inter-processor communication is required, Thigpen says. "These codes scale very well on this type of architecture."
The downside to the large-node architecture is that if a single processor fails, the entire 512-node system goes out of service, he says.
The first node of Project Columbia, named Kalpana after Columbia astronaut Kalpana Chawla, was built by Ames researchers last fall. Since then, two nodes have been added, and NASA and SGI will spend the next five months assembling the 17 more nodes.
With the next version of SGI's NumaLink technology, expected in the fall, Project Columbia will be able to share memory across 2,048 processors, Thigpen says.
Linux creator Linus Torvalds applauded the team's success at using Linux in such large nodes. The operating system typically is used in much smaller nodes of 2 to 8 processors.
"Scaling up to
I've been in this field over 25 years, been in public position at a major lab now for 8.
If this was a simple issue, the HPC community would already have completely moved to clusters and never looked back 3 or 4 years ago. But it's not kiddies.
Want to run a physics projection for more than 1 microsecond? Takes real horsepower that clusters cannot provide even distributed. Just too much damn data. Chem codes that include REAL data for useable time slices? too slow for clustered memory. Every auto maker in the world (almost) has been whining about the lack of BIG horsepower for a few years now.(crash codes and FEA) I could go on forever. Sure, some problems work awesome on clusters, which is why we have them. But definately not all of them.
The problem is partly diminishing returns, partly the pathetic ammount of useable memory on a cluster and its joke for memory throughput, partly the growth in power of the low end and clustered networking, partly the ridiculously long development cycles invloved in High Performance Computing and the low $ returns,
One of the biggest things congress sees is that this country will more than likely NEVER again lead the world in computing power for defense and research.
And thats something we ought to do as the last real Superpower.
The national labs TRIED clusters, they don't get all the jobs done they wanted. (see testimony before congress, writings in HPC jounals, and the last couple RFPs from US gov. labs,heck every auto maker in the world) People in HPC _know_ it now, but having let what little there was of the supercomputer industry die out, there isn't mcuh of an industry left to turn to now. It just may be too darned late. HPC hasn't been a money making industry since the early 80s.
Heck, even Intel abandoned their clustered machine they custom built for the government.
Most folks in HPC will readily admit the Top500 is kind of a joke. The HPC-challenge #s are a little more realistic for the tests, but we really do need something that approximately real world applications, not just a 70s cpu benchmark.
For those that think this is a 'Linux wins' issue,
consider that mostly it was fast interconnect networks that allowed clustering, not the OS. Examine the history of clusters and you'll see this is true. Btw, the last few SC companies are already mostly moving to linux anyway.(nec,fujitsu,cray;ibm dabbles in hpc)
Hopefully the industry will survive long enough to allow for even better mergers of supercomputing power with low end cost, but at this point I doubt it. Cray has been on the ropes since 96, fujitsu's sc division is a loss leader, and NEC has been trying to get out of it for a while for something with a margin.
Ed -gov labs HPC research punk
-former Cray-on
-former CDC type
Several engineering issues making this Ludicrously LSI operation possible have not been solved. One type of logic found in computers is called CMOS, the other called TTL. TTL has another special implementation that allows it to achieve high clock rates.
This special TTL that is used today has reached its peak at .09 nanometers in width. The semiconductors used today do not have enough "gas" (voltage) to power such a large system. Every transistor has a voltage drop within it, and adding more and more transistors only increases heat, power consumption, and probability of failure. If a single part of a huge new chip were to fail, the entire chip would have to be replaced.
Ever wonder why CPU's run hot? They dissipate voltage (and with it current, which is the True Power). At some point, the input voltage on the processor will be so close to the output that measuring HIGH or LOW (1's and 0's) becomes impossible. Until some scientist discovers a brand new material that can overcome current and voltage barriers, clustering will be the norm. Said scientist would be a billionare if they could patent a material like that.
Researchers are out looking for these new substances, mind you. Until they find one and build a semiconductor company, single computers with low clock speeds and high bandwidth will not be able to tackle what the human imagination has instore for them. There is also a limit involving physics and microwaves that I could get into, but i'm not a physicist. I'll leave that up to them :)
Take for example Deep Crack (luminaries, remember that one?). Perfect example of specialized hardware for a single job.
Wikipedia has an article: http://en.wikipedia.org/wiki/Deep_Crack
Bruce,
Basically any simulation that involves a large number of fully interacting 3 dimensional components over very many timesteps.
Problems that fall into this category:
weather simulation
aerodynamic studies
molecular modelling
(to a lesser extent other fields in computational physics and chemistry)
While these can be done on clusters (and often are), the longest, highest resolution studies are done on big shared memory systems, where one user runs their whole simulation for weeks or months.
The problem is, only a handful of these are really necessary, and only a fraction of these have a compelling need and the resources to afford them:
the weather service (has), NASA and the aerospace industry (has), auto makers (has), academics in chemistry (no such resources, greater patience, cheap student programmers). There aren't many other potential buyers, so it's not surprising that the commercial groups are trying to get the government groups to pitch in more. No one is asking the academics to even try (they're all off trying to make their shared memory codes run on cheap Linux clusters.. it's not as if they'd have had the money to pay for a supercomputing industry anyway).
-one of those computational chemists
A few extra people? Do these few extra people get their own supercomputers?
The problem with clusters is that they don't scale well in all cases. Programming tricks may help, but profiling and testing on a sub cluster may not reveal bottlenecks in the full cluster.
Uh, Cray have a backlog of orders. A backlog to the tune of $153 million, if I recall correctly.
That's not the sign of a dying buisness model. If they are having problems, it's down to the mangement, not lack of demand.
There are problems that don't work well on clusters, but rocket on a proper supercomputer. These include a lot of interesting areas, there will always be demand for a few pieces of big iron. At the risk of echoing the ghost of IBM CEO's past, I think somewhere around 20-30 serious top end supercomputers in the world [0]. Most of the rest of the jobs will do just fine on high end clusters.
If you read the article, there are no quotes from Cray people. What there are quotes from is the people who used to get to play with special hardware, who now admin those clusters.
It's toys for the boys, not a buggy whip issue.
[0] That's informed by being someone who uses high perfromance computing, both cluster and supercomputer.
Computational fluid dynamics. The most common manifestation of this would be weather models, although it can also be used to model petroleum resevoirs, airplane wings, and the list goes on and on. The system is broken up into little chunks. What's going on in each little chunk depends on what goes on in all the the other chunks. And we are talking about millions and millions of chunks.
A lot of posters in this article decry commodity hardware for sucking but they're not considering the long term. The tasks commonly performed by commodity hardware are starting to be more and more suited to scientific applications. We can already see how vector optimizations like Altivec and SIMD have worked their way into every desktop chip around.
As 3D rendering and sophisticated media codecs are becoming the primary reasons for upgrading a home PC, the front side bus and CPU (especially the CPU) have gained a disproportianate speed advantage compared to the rest of the PC. And when you start throwing in high speed network compenents like blade interconnects and 10Gig E, you can create an extremely powerful, scalable computer. Vendors like IBM love this because it means they can sell one design to everyone, just changing the number and specs of the nodes to fit their customer.
Speaking as someone who worked support on a large academic cluster where thriftiness is king, not only is the hardware cheaper, but it's mundane specs mean it's less likely to fail, when it does, the nodes are replaceable like lightbulbs, and you can run down to the local computer store in the event of a faulty IDE drive or memory stick. The support costs and reliablility are far superior to typical high performance solutions because support can mostly be done in house by existing staff.
And that ComputerWorld article is mostly bunk, OpenMP supports Fortran. The people who are writing large scale simulations in the first place are not only technical but also extremely bright. While it is a hassle, adding networking to the already heavy code optimization that they do is not that big of a deal.
Though there are problems that require a shared memory system, $2 million can buy so many more networked Xeons than shared memory SGI processors that the scientists in question should really consider their needs. While it may take much longer, their problem set can be much, much larger since they'll have more collective RAM and HD space. The system can also easily be partitioned, rebuilt, and shared in such a way that it's always in use by one or several people. In many cases, the tradeoff is worth it. And even then, shared memory systems will probably still have a lower price/performance ratio compared to clustering a smaller number of high memory database servers. Say for example, a group of Itaniums with 64Gig of ram each vs their SGI Origin equivalent in terms of memory footprint.
I think as time goes on, interconnect speed will increase to the point where clusters are very similar to traditional supercomputers. We can already see baby steps in this direction with blade equipment and various motherboards with gigabit NICs built into their own bus to avoid choking in PCI land. Infiniband and 10G over copper are the first major steps but switching and motherboards still have to catch up it seems. Instead of whining, the typical supercomputing vendors should be looking into merging standardizing their designs with those of clustered systems. Of course, I could see how the entrenched traditional vendors would want to legislate as much as possible to avoid having to compete with dozens of smaller engineering companies with less overhead and better ideas.
Now that I think of it, a lot of the posts I've read here are probably skeptical of massively parallel machines because they don't realize that the performance is all in the network. I've seen two clusters first hand with an identical number of identical nodes. One was being used as a compute farm where several hundred dual Xeons were attached to a rediculously large and overly expensive IBM foundry switch. The other was being used as a massively parrallel simulation machine, with dual gigabit nics attached to a bunch of 'cheap' 24 port switches in a 2D mesh. The second design cost much less because it wasn't built by IBM while still being about twice as fast.
New supercomputing advances on the way will radically redefine the industry.
I refer to a DARPA funded project created to fill in the performance gap between today's inadequate SC technology and tomorrow's (quantum, bio) still far in the future stages.
The project is called HPCS, which does not stand for High Performance Computing System, but rather High Productivity Computing System. The point is not to increase flops but increase value. The earth simulator, for example, is down for maintenance about 2/3 of the time and can only be reliably run in 8 hour chunks. The ASCII series may have high peak performance but averages only 5-10% of that. DARPA knows that if this is the state of the art, then there is work the do.
Three companies are currently doing research on proposals and are DARPA-funded through 2006. The three are Sun, IBM, and Cray. Two companies will continue from 2006. A working product will be delivered in 2010.
Many radical new technologies are in place, but as you can understand most of it is tightly under wraps. But do some reading on the DARPA page and you will find some interesting things.
Here is the link
- employee at one of the three aforementioned companies
No, it really is true. With non-linear PDE's the parallelization falls apart. You say, "ok smarty-pants, when modeling the copressible fluid around a car (ie, the air), how much can the flow around the bumper depend on the flow arond the mirror?" The answer is not much, but when you discretize the problem into a fine mesh, there are a gazillion boundary points between the mirror region and the bumper region, so the communication links between parallel processors bog down. It is faster to solve all the points with the same, really fast, processor.
Actually you're only partially correct. The alpha processor despite popular opinion was headed towards the scrap heap before Compaq bought them. The reality is that the conventional 'supercomputer' markets were falling away from the Alpha because of the promise of the Itanium and Opterons (which were just poking their heads out of their holes at the time and announcing their existance). In additional benchmarks of the existing alphas were only marginally better than the Intel Pentiums of the time (pre-Itanium & AMD Opteron) despite the 64 bit processing. The stall wasn't because of Compaq, but because the core development teams for the Alpha were... well, getting old and retiring without suitable replacements. Plus I heard it rumored that Compaq purchased Alpha not for the processor, but for the architecture which it intended to 'reinvent' in a new image. Dunno what happened with that though...