On the Supercomputer Technology Crisis
scoobrs writes "Experts claim America has been eating our 'supercomputer feed corn' by developing clusters rather than new supercomputer processors and interconnects. Forbes says America is playing catch-up and that the new federal budget items are too little too late. Cray is laying people off due to decreased federal spending and claims lower margin products have forced them to create products based on commodity parts. Red Storm, one of their new Linux-based products, is being delayed to next year."
Random Array of Inexpensive Servers.
If the 'supercomputers' of today are increasing performance, does it really matter the design?
Maybe that is a signal that monolithic computer tasks are best handled in a hive mentality - have the Queen issue the big orders, have the warriors performing security, have the workers transporting the goodies (data), and have the requisite extra daughters and suitors to grow the hive and assure its viability (redundancy).
The fact that it is cost-effective is even better.
when you can build a top 5 supercomputer for under 6 million dollars, using off the shelf parts. Why spend the hundreds of millions of dollars?
Because instead of fundamentally advancing the science of computing, the industry is simply scaling commodity technology. The American supercomputer industry has gone from innovator to an assembly operation.
We're in need of a paradigm shift. Where's the next Seymour Cray?
Jason.
Cray has been engaging in scare tactics about "America being dominated by overseas competitors" for a while, because they're terrified of losing the lucrative business contracts from government and big business, they'll pull out all the stops. They've come up in the IT press recently a couple of times.
Screw 'em. If there's a need, the market will provide. If it turns out that the important tasks can be parallelized and run on much less expensive clusters, then all that means is that we have a more efficient solution to the problem.
May we never see th
Poor little babies, now where will their executives and boardmembers get free money? Will they actually have to do something useful for a living, and for a change? It seems we are only a Darwinian capitalist economy when the little guy gets fucked. When the professional bullshitters at the top get it, it becomes some sort of strategic crisis that requires immediate injections of billions of dollars. Screw them. May Cray rest in peace.
La Supercomputadora is Dead! Viva la Supercomputadora!
If you really want a vector-processor supercomputer you can program in Fortran, get yourself a G5 and gcc. The PPC64 supports SIMD vector processing. For that matter, any problem which benefits from vector processing is trivial to parallelize with threads.
In the age of IP and patents it seems like it is very hard for companies to make major advances [in any field] without some other company cry foul and taking that company to court over patent/IP rights, especially if the alleged infringer is a smaller company (i.e. less lawyers). IBM and MS, among others, are filing dozen if not hundreds of patents a day. What we are seeing as an affect is that innovation is being stifled by litigation.
(pat pending)
"Look Lois, the two symbols of the Republican Party: an elephant, and a fat white guy who is threatened by change."
Technology first developed on the high end slowly works it's way down into the low end. What happens when the high end is no longer there.
Not that many people really need a race care, but advances in fuels, materials, engineering in race cars eventually leads to bette passenger car. And for raw performsnce, strapping together a bunch of Festivas will not get you the same as an Indy racer.
And there's absolutely no reason why you can't put a bunch of FFT chips on a specialized PCI card, mass produce it, and get a bunch of networked FFT enhanced supercomputers.
SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
Forbes has been complaining that federal support of advanced computing is too little? If the government over-stimulates an industry that has too small of a market, it wil just delay the failure.
Of course the governent should continue in its current policy of funding a few leading-edge machines that are too costly to sell into the general market, but will test new technology. The governemnt itself is a customer will energy testing, weather modeling, medicine development, etc.
I've seen a lot of naive comments suggesting that supercomputers are being replaced by clusters. The truth is, anyone who can replace their supercomputer with a cluster didn't need a supercomputer in the first place:
- (compared to a supercomputer):
- The prime advantage of an x86-based server is that it is cheap, and it has a fast processor. It is only fast for applications in which the whole dataset resides in memory - and even then, it is still the slowest of the group.
- Clusters are a little better, but suffer from severe scalability problems when driving IO-bound processes. As with the x86 server, if you can't put the full dataset into memory, you might as well forget using a cluster. The node to node throughput is several orders of magnitude slower than the processor bus in multiple CPU systems. (6.4GB/s vs 17MB/s for regular ethernet, or 170MB/s for Gigabit)
- Multiple CPU servers do better, but still lack the massive storage capacity of the mainframe. They work better than clusters for parallel algorithms requiring frequent syncronization, but still suffer from a lack of overall data storage capacity and throughput.
- Mainframes, OTOH, possess relatively modest processors, but the combined effect of having several of them, and the massive IO capability makes them very good for data processing. However, their processors aren't fast at anything, and often run at 1/2 or 1/3 the speed of their desktop counterparts.
- Supercomputers combine the IO throughput of a mainframe with the fast processors typically associated with RISC architectures (if you can still consider anything RISC or CISC nowadays). They have faster processors, more memory, and much greater IO throughput than any other category.
It used to be that the prime reason for faster computers came from the scientific and business communities. But now that the internet has turned computers into glorified televisions, the challenges have gone from that of crunching numbers to serving content:As our economy has shifted away from a technological base to an entertainment one, the need for supercomputers has begun to evaporate. We outsource innovation overseas so that we can lounge around on the couch watching tv and drinking beer (or surfing the net and drinking beer). The primary purpose of technological innovation has shifted from that of discovering the universe to merely bringing us better entertainment.
The society for a thought-free internet welcomes you.
I suppose the counter-argument would go something like this:
It's true that supercomputers aren't really all that useful or necessary these days. However, it may be that a future computing problem shall arise, which requires a next-generation supercomputer to solve. So we'd be well-served to have a next-generation supercomputer fresh from R&D, to apply to the problem.
We may only encounter one or two more supercomputer-class problems, but they might be important ones. We should be prepared.
On the other hand, we may encounter a problem that can only be solved by horses. But we don't see a lot of buggy-whip subsidies these days...
Any sufficiently well-organized community is indistinguishable from Government.
One technology that I work with is called Artificial Life and is basically large evolutionary software simulations. (This is not exactly the same thing as genetic programming, but it's close.) This is an example of something that just plain doesn't cluster well. Try to cluster one of these, and you will max out a gigabit switched LAN in less than a second (I've done it!). I've even maxed out a gigabit "star configuration" LAN with this stuff. It just doesn't cluster.
The problem is that these simulations involve many cells that must interact with each other in real-time. The cluster spends 90% of it's time waiting on other nodes no matter how you build the architecture.
There are lots of other problems like this that just don't cluster well.
Clustering only works for problems like protein folding or SETI that divide up into neat "work units" that can be shipped out and then returned. Millions has been spent, along with massive amounts of time by people like myself, and we're still no closer to being able to really cluster applications like this efficiently.
Clusters (talking about "Beowulf" types and such using stuff like Ethernet, no matter the speed) do OK at coarse grained problems. That is, problems where communication is seldom and typically when it happens, is a chunk of data like a WU or something where the consuming node can go off and compute on it a while then submit some result or intermediate result. All of the "Distributed" projects like SETI are of this type and are even a subclass of coarse grained problems called Embarassingly Parallel.
Fine grained problems typically require much more inter-node communication and are also typically much more sensitive to latency. Fine grained problems typically show lots of communication and (not always, but typically) smaller amounts of data transferred. For example, take a problem where each node will sum up an array of 16 integers passed to it and pass this (partial) sum back to the sender for it to use in its calculations. On 100Mb Ethernet, the latency of the transfer of that 16 bytes is *huge* compared to the full transfer time (overhead of packet size and slowness of the actual transfer) and the computational time for adding 16 integers together. A "Beowulf" cluster would be ill suited to this type of problem simply because Ethernet at those scales is extremely inefficient. However, there are architectures (some SuperComputers) that could do this algorithm quite well.
Another example of this is to think about a kernel (maybe a Linux one) that is multiprocessor. How well would a shared image Linux kernel perform if it were using Ethernet as its interconnect between nodes as opposed to shared memory on a dual/multi CPU motherboard? This is one reason why SGI, Cray, Sun, IBM, and others have developed NUMA architectures - basically the bus between processors is really a switched network that is extremely high bandwidth (GB/s) and extremely low latency (measured in microseconds at worst) in order to run single system images - in order to scale the number of CPUs up in a box that can work on the same problem in parallel.
So, parallel programs are written with a mind to the algorithms being used to solve the problem and the hardware on which it will run. For example, there are a number of algorithms for solving systems of equations in parallel. Different algorithms may be more latency critical than others (some may not be penalized by high latency networks such as Ethernet as others which require more communications). Not all problems have multiple "good" algorithmic solutions where the programmer can pick based on the hardware available. Some problems do not have "good" algorithmic solutions that are latency tollerant at all. Those algorithms/problems need low latency networks to solve and typically, the only solutions are super computers and very "special"/exotic networks (read: expensive) are the ones that supply very low latency interconnects - Myrinet, Giganet, RACE, etc.
Yes... but it depends on the overhead required for those almost-independent pieces to be transferred from one machine to another. There are lots of fine grained problems out there where All the computers in the world tied together with Ethernet couldn't solve as fast as one good supercomputer.
Basically it boils down to:
Compute the cost for communication of one work unit for your algorithm.
Compute the cost for processing that work unit.
If the cost for communication is significant compared to the computaional cost (say, 1%) then you probably will have a performance issue.
This is why SETI and the like do well. The cost of communication of one WU is insignificant/irrelevant compared to the computational cost of one WU.
However, there are a good number of problems that can't (or at least haven't yet) had algorithms thought out where the cost of the communication of one WU over a high latency interconnect like Ethernet is useful given the amount of computation of that one WU. I assure you, if it were easy to make those problems embarassingly parallel like SETI, someone would have done it already and if you can figure even some of them out, you'd be very wealthy. If you could figure out a way to do it all automatically and do it well, you'd be rich beyond the dreams of avarice!
Two...I have two words for you!
Seriously, I don't see the problem, so long as companies like IBM and (dare I say it) Microsoft continue to do research in this area. That is the real value of companies that are committed to *real* research in revolutionary sciences and technology.
Of course, US companies don't have a hammerlock on this research. There is a lot of work being done internationally in the area, by corporations, and by educational/research institutions.
---anactofgod---
---anactofgod---
"Equal opportunity swindling - *that* is the true test of a sustainable democracy."
Oh, and yes, I'm a Linux fanboy, but I was also reading comp.arch (remember Usenet?) back in the days when the Attack of The Killer Micros was starting to kill the minicomputer and mainframe industry ("careful with that Vax, Eugene!") and RISC vs. CISC was still a design issue, so I do have some perspective on the game.
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
And there's no real reason why you can't do that with a cluster as well- archetecting both the hardware and the software for the cheapest possible construction.
Reminds me of my early programming days on the TI99/4A. The brilliant bit that made that computer more powerfull than most other micros on the market at the time (except maybe the Commodores and Ataris, but neither of those had this as much as the 99er did) was multiple specialized subprocessors. Most others had *maybe* a video processor and a sound processor, but the TI also had a memory manager and an I/O management chip and a speech synthesis chip as well. A good assembly level programmer had his own cluster supercomputer- even if it was only running at 3.44Mhz. Better yet, the PEB allowed you to build your own specialized cards for additional tasks- Disk I/O had it's own processor, as did Serial and Parallel communications.
We do this today with off the shelf consumer hardware thanks to USB, AGP graphics cards, and separate sound processors on sound cards. To truly take over the Supercomputer market, that's the paradigm we need to get back to.
SJW: a person who perceives an injustice, and while correcting it, commits a greater injustice.
Here's the problem. On codes which need lots of data interchange, communication speed becomes the bottleneck. I don't know of anyone running a serious fluid dynamics or weather code, which are this kind of data-interchange-limited application, who gets anything near peak performance on "real-world" problems using ASCI machines. Sure, ASCI White (a 10000-node cluster) was billed as a 10-Teraflops supercomputer. Who cares, when you get 10% of peak performance if you're lucky? NOAA wanted to buy a supercomputer in the mid-90s, for weather and climate simulations. They did the requirements analysis and decided that a Japanese vector supercomputer was what they needed -- nobody in the U.S. made them anymore. Seymour Cray flipped out -- a government organization buying foreign supercomputers? heresy! -- pulled a bunch of strings, and very soon thereafter Japanese supercomputers faced a stiff tariff because the Japanese were "dumping" their product on the U.S. market. Of course, that meant NOAA couldn't get their NEC. They ended up buying some American-made cluster and getting their piss-poor 5% of peak performance. Well, two years ago, Japan brought Earth Simulator online. It's cluster of 5000 vector processors; it boasted 30 Teraflops peak performance, which was 3 times as fast as the then-current number one machine, ASCI White. And a group from NOAA went over to Japan on invitation to check the machine out. They spent on the order of a week adapting some of their current codes to the ES architecture and fired them up. And got 66% of peak performance right off the bat. How'd that happen? Well, ES cost on the order of $100 million. (By the way, as a rule, if your 'supercomputer' cost less than $10 million, it's not really a supercomputer.) Of that, about $50 million went into developing the processor interconnect -- it's a 5000-way(!) crossbar, for you EE types. With an interconnect that big and fast, the communication bottleneck which dooms the big physics codes suddenly disappears. So, yeah, the U.S. supercomputer market at its own seed corn. To see Earth Simulator jump to the top of the Top 500 was something of a slap in the face; to see it get 20 Teraflops on real-world problems was a terrible blow to the prestige of the U.S. supercomputing community. And not one we're going to easily recover from.
"The urge to fly from modern systems, instead of moving through them to even greater, fairer things is, I think, an indi
Yes, but they work ON supercomputers, they don't build or design them... Do they?
I was talking about a center whose purpose was the creation of ever-more-powerful supercomputers. The rental section would just be there to make use of the tech, and put it through its paces.
Farewell! It's been a fine buncha years!
General Relativity is nonlinear; you cannot use superposition to calculate the field of each object at a point and simply add up the results like you can with Newtonian gravity.
How about high resolution physical simulation (whether that be climate modeling or plasmas)? One great thing of a Supercomputer is that you can hold so much data at once in the active set. One node can only hold so much data, so the full simulation has to be distributed across the whole supercomputer. It is definitely not RC5 keys, the opposing end of the spectrum in this data/compute tradeoff.
Start Running Better Polls
How about a compromise like SGI's NUMA systems? Seems like a reasonable compromise between fast memory & lots of CPUs when memory access is important.
my sig's at the bottom of the page.
A long time ago, at Pixar, I got an ARPA grant to work on an image-processing application for the feature film industry. The purpose of the grant was economic and military at the same time. I was to help create a market for multiprocessor computers (not really supercomputers) so that there would be U.S. manufacturers of them if when/if the Army needed them for military purposes. This is what often gets called corporate welfare, although I could see the defense purpose was valid. I don't know if ARPA still does this sort of grant. To do one would require an application that is interesting to more than just the folks on your list. And these days visual effects is much more of a solved problem.
Thanks
Bruce
Bruce Perens.
Even if the application can not be parallelized, my experience working in these environments is that most times the same group needs to run the same application many times with different parameters. Putting the application on slower machines, but running different cases simultaneously also favors clusters.
Its very rare the high margin items ever maintain or increase their high margins. Anyone writing/using large resource intensive applications needs to plan for these type of things. crays are wonderful machines but if there is a better solution people should go with it.
Its also interesting to note that mainframes are still high margin and profitable for IBM. All the comptetion went away and lots of legacy code out there.
The difficult large-scale problems have "chunky" parallel solutions - each chunk is parallel, but some chunks take longer than others (it's difficult to know how long beforehand) and the overhead of scheduling and balancing all those chunks begins to dominate the actual computation. Combining large, arbitrary, sparse matricies would do it - some multiplications will result in very little work, while some will have lots of collisions and require a lot of work.
A witty [sig] proves nothing. --Voltaire
If you read the papers at the recent OLS (Ottawa linux simposium) you'll see that SGI is running linux images (specially tuned) on 64, 128, 256, and in 2 cases 512 cpus. Reading the paper is an interesting view into the problems of running kernels and OS's on such huge NUMA machines.
http://www.finux.org/proceedings/