Cray CTO Says Cray Computers Are Great
Jan Stafford writes "Linux clusters can not offer the same price-performance as supercomputers, according to Paul Terry, chief technology officer of Burnaby, British Columbia-based Cray Canada. In this interview, Terry explains that assertion and describes Cray's new Linux-based XD1 system, which will be priced competitively with other types of high-end Linux clusters."
You really shouldnt place commentary on a story title, unless it's an "its funny, laugh" one.
Oh, by the way, everyone who has a slashdot account should go to their preferences and set the "light" layout. You wont suffer with the bad color schemes anymore, and the results are more printer-friendly too.
There are some limitations to clusters that "supercomputers" don't have. Even if your network were exactly as fast as the internal bus of one of the Cray supercomputers (which I highly doubt it is), you still have a logical layer on top of it (TCP/IP/UDP etc). This slows it down.
For some applications, a cluster of slow PCs is ok. Bu if you want to do real time-intensive computation, you really can't beat a good internal bus.
Moderation: Put your hand inside the puppet head!
However it spawned a popular story about how "Cray designs on Apple and Apple designs on Cray" (see link.)
And now for the REST of the story:
Did you know that Macintoshes are designed on PCs!? That's right--PCs running WINDOWS. You see, nobody makes software to burn eproms or design printed circuit boards that runs on MacOS, so the hardware group has a bunch of Windows PCs!.
So now you know the *rest* of the story!
Best Buy can have you arrested
No, the inventors of big supercomputers (couple million dollars a pop) are definitely scared of clustering.
If you want a Cray supercomputer, you have to buy it from Cray. If you want a Linux cluster, you can buy it (or build it) from anyone.
I'm sure there are applications for a supercomputer, but I see universities, production studios (Pixar!), and research labs moving toward clusters. The supercomputer companies will do anything it takes to either stop that from happeneing or to gain in that market.
I'm not sure you do either.
A NUMA machine is just a cluster where the wire is in the form of a bus rather than copper or fibre cabling. The communications protocol for the bus may be better optimized for "supercomputing". However, you can do the same thing for a MPP optimized network protocol.
It's all ultimately just wires and protocols.
The total lack of process migration between nodes in a cluster might actually give clusters and edge over some NUMA implementations.
Watching a single process dance around a number of bricks in a Sun 15K can be rather entertaining.
A Pirate and a Puritan look the same on a balance sheet.
And in doing so you are essentially building a super computer. However you'd have to keep in mind that it isn't all about total bandwidth - latency also needs to be extremely low. That said, HP is working on an open source Single System Image clustering support for Linux on "normal" hardware
http://www.sec.gov/Archives/edgar/data/949158/000
Here they discuss the limitations of clusters and vector-based supercomputing.
Basically, they offer three types of supercomputers aimed at different markets: vector, massively parallel, and multithreaded. Not really sure why multithreaded means in this context (Microkernel capable of threading itself across many processors i.e. UNICOS/mk?) but they do a decent job of explaining the whole thing:
LedgerSMB: Open source Accounting/ERP
From Cray (From XD1 page):
"A 96 GB per second, nonblocking, crossbar switching fabric in each chassis provides four 2 GB per second links to each two-way SMP and twenty-four 2 GB per second interchassis links."
-So for a dual-opteron XD1 processor unit, there is 8GB total bandwidth available.
Total aggregate PCI bandwidths (Accepted standards):
PCI32 33MHz = 133MB/s
PCI32 66MHz = 266MB/s
PCI64 33MHz = 266MB/s
PCI64 66MHz = 533MB/s
PCI-X 133MHz = 1066MB/s
PCI Express = 200MB/s (Per slot)
PCI Express x16 = 3000MB/s (Usable bandwidth)
-So for PCI Express x16 we're talking 3GB/second
SMP Opteron with two PCI Express x16 slots can do 6GB/second aggregate bandwidth. A couple of Infiniband links can easily saturate that. I'm sure this all costs quite a bit less than Cray's propriatary stuff.
My Other Computer Is A Data General Nova III.
Not quite true. First off, you get much higher bandwidth between processors using proprietary (NUMA) based interconnects than you can with commodity hardware. Why? Because you can optimize for your situation. Second you can exploit things like cache-coherency between processors (even if they're in different "nodes") and therefore true shared memory. So, a 1024 processor SGI Altrix, or a 256 processor Cray is one computer as far as the OS and user-land stuff is concerned.
There's another advantage Cray has on the SV and X series and that's a vector unit on the processor. That allows you to conduct operations on arrays of numbers at once instead of having to cycle through the numbers in a loop. For example, the dot_product between two small arrays might be accomplished with one or two instructions, as opposed to a loop. Apple's AltiVec is also a vector unit.
If you took money out of the picture it would be easier to deal with a big-honkin' super computer like an SGI or Cray rather than a cluster. One computer is easier to manage and you could always use threads and plain old heap memory (which is much faster than message passing over a network).
Add money back in and 500,000 goes a lot farther in raw compute power when you're buying racks of DELLs and infiniband interconnects. However, depending on the application, you may be faster, slower, or even dog-slow compared to the cray. If you need the answer today, and the $ is not a factor, go to Cray or SGI with a blank check. If you have to balance cost and time, then a cluster might be better.
Essentially, it boils down to how much communication you do between nodes. Cray does it orders of magnitude faster than off-the-shelf stuff. If you hardly ever pass messages between nodes, clusters are fast. If you have to pass a lot of messages between nodes, one big computer will trounce lots of little ones.
Leave the gun, take the cannoli -- Clemenza, The Godfather
I work a lot with it, like ~3000 customers, almost half of them are industry (non academic or gvt).
You found bugs ? Care to share them ? Hardware failed ? Did you get it replaced ?
Can you give me the tech support ticket numbers so I can see if your complaints are reasonable (and have been addresses) or are just plain FUD ?
Back in the mid-80s, my department had a huge VAX 780 with 4 MB of RAM (16KB chips, I think), and we were working on a network simulation system that needed 12-14 MB RAM to run. I spent a while playing with different versions of 4.1BSD and Unix System VR2, but fundamentally the machine spent all its time swapping data in and out of disk, and the main performance with was helping the physics jocks who wrote the application get better algorithms and better localization and good checkpointing because the computer didn't always stay running for the full week it took to finish a simulation run. A year or two later, we got the budget to buy another 4MB of RAM (in 64KB chips, about $50K IIRC), which helped a bit, and a year or two after that, we got enough budget to buy another 8MB of RAM (maybe 256KB chips? not sure. Also about $50K), and suddenly the application could complete in under an hour instead of a week, because RAM really is a couple orders of magnitude faster than disk drives with a couple more orders of magnitude less latency, so our problem changed from being disk-bound to being CPU-bound.
That speedup not only improved the utilization of the equipment, it made a qualitative difference in the kinds of problems we could address because of the way we could interact with it. That's why people buy supercomputers if they need them - it really can be orders of magnitude faster for some problems. The first year or so, we really had all the RAM that could fit in the double-refrigerator-sized VAX cabinet. Once the denser RAM chips became available, we probably should have spent a bit more manager time beating up on the accounting department, because an extra $50K for hardware could have more than doubled the efficiency of 3-4 physicists, but of course the accounting droids don't think in terms of efficient use of physicists unless it lets you buy half as many of them, which was _not_ the objective here...
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks