Cray CTO: Linux clusters don't play in HPC
jagger writes "Linux clustering was touted as the next big thing by many vendors last week at ClusterWorld Conference & Expo 2004. But supercomputer vendor Cray Inc. scoffed at the notion of putting Linux clusters in the high-performance computing (HPC) category. "Despite assertions made by Linux vendors, a Linux cluster is not a high performance computer," said Dr. Paul Terry, CTO of Cray Canada."
It's not the vendors who are claiming that Linux clusters are real supercomputers, it's the people who are using them to do real supercomputer work. They sell themselves based on actual price and performance.
Methinks Cray is feeling a little threatened...
While I certainly disagree that you can't build a very high performance computer out a cluster of computers (Linux or otherwise), there is a lot of merit to the fact that clusters just don't scale well for certain classes of applications. Hence the renaissance of the vector supercomputer (ala the Earth Simulator ).
Obviously, this guy is plugging the new Cray X1 architecture, which really is quite promising. For instance, check out this paper by some folks at Oak Ridge National Lab that appeared in Supercomputing 2003.
Of course, since this is Slashdot, I expect that there will be a deluge of posts decrying everything about the new Cray machine because it commits the cardinal sin of NOT USING LINUX. Oh, the horror!
Octigabay does (did) in fact make linux solutions. However, it is not a cluster. It's a more traditional supercomputer although it does use low cost AMD processors.
Cray isn't anti-linux per se, just anti-cluster.
Somehow I wouldn't be surprised, the next step seems to be cray-marketed cluster nodes with a proprietary high speed interconnect. (If you can't beat them, join them).
The reason that Cray only holds 19th right now is because they have only deployed X1 systems using up to 256 nodes. When the number of nodes is increased, you will certainly see the Cray moving up the top 500 list -- the architecture is VERY scalable.
No supercomputer (cluster or traditional) is going to work well if your app can't multi-thread as none of them derive their power from a small number of super powerful CPUs. For that you want something more like a traditional mainframe (and guess what, many banks still use them). The real difference between the Cray model and the cluster model is shared vs seperate memory. The question becomes "can your application be broken down into small chunks which are entirely self-contained". So rendering a movie works well because each frame, or even portions of frames, can be rendered entirely apart from others. However, doing analysis over massive data sets (e.g. data mining) will benefit from multiple threads being able to share one huge memory pool. So Pixar use a cluster and the NSA use a Cray. Right tools for the job.
---- Den ene knappen er powerknapp, den andre er Bender voice knapp "Bite My Shiny Metal Ass"
Cray wants all forms of clusters disqualifed from that list for not being true supercomputers since they can only consider tasks that support multithreading at their full speed. If given a logical task that must be processed serially, the cluster will end up dropping to the speed of its fastest processor. Sure, the rest of the cluster is available to consider other questions... but the point is, it's going to waste some cycles while a true supercomputer would be able to dedicate its entire resources to the task.
Every task has a maximum number of threads it can be broken into where adding another parallel process threads just won't make it any faster. For some, that number is in the stratosphere and doesn't have to be worried about. However, for others, that number is in the single digits. Those tasks aren't going to be helped much by a cluster that exceeds that number of processors.
The how to from way back in the day.
e r- formats/html_single/Beowulf-HOWTO.html
http://www.ibiblio.org/pub/Linux/docs/HOWTO/oth
has a great explanation using a grocery story analogy that makes it really easy to understand what kind of tasks will work well and what kind will suck. And unlike the cheerleaders that have been showing up since clusters became a big business is very balanced about it.
Still worth reading.
Cypherpunks: Civil Liberty Through Complex Mathematics. Those who live by the sword die by the arrow.
There isn't a Cray system that can touch the brute parallel power of a big cluster like Virginia Tech's G5s. But depending on the kind of problem you're working on, there are Cray systems that would walk all over that G5 cluster.
With problems that can be split up into hundreds or thousands of more-or-less independent subtasks, a cluster is the way to go. But for problems that can't be divided up like that, a smaller system with a few very tightly coupled extremely fast vector processors, like what Cray specializes in, is what you need.
There are certainly plenty of HPC problems that aren't well suited for large clusters, but it sounds like the Cray guy might have been significantly overstating his point.
"The worst tyrannies were the ones where a governance required its own logic on every embedded node." - Vernor Vinge
The Cray XD1 System operating system is Linux
Cray could easily be at or close to the top of the top500 list, their X1 architecture will extend that far. However, for a lot of really important supercomputing codes, it's no contest: The cray will trounce the clusters (linux or otherwise). Those #19 crays are only 256 processors. To get similar performance a stack of xeons requires thousands of processors. Some tasks just can be split appart that easily.
A cray processor has eight floating-point units running at 800Mhz. The big Mac cluster (for example) uses G5 processors which have 2 FPUs at 2000Mhz. Thus the cray has a ~40% advantage. However, the G5 processor has ~4GB/s memory bandwidth. The Cray has ~50GB/s memory bandwidth. If you have a problem that needs to do a HUGE amount of math on a tiny amount of data, the G5 will rock. If you have a problem that needs to do a HUGE amount math on a GINORMOUS amount of data, buy the cray. (for a GINORMOUS amount of money too)
Similaraly infiniband (ala the big mac) is really hot in the cluster interconnect space because it gives 2.5GB/s per node. The Cray gives you 51GB/s.
You need to move a little data, buy a cluster. You need to move a lot of data, buy the Cray.
There's no one solution for all problems.
True Parallele Programming with computer with over 16 or so CPUs is a slightly different mindset then the way most people program. In PP you can write a sort routiene that runs in O(log(x)) While with one processor system you can only do it in O(x). Most programs today that are threaded tend run a buch of code on one processor and its own memory. That is much the way that linux clusters work, by writting programs that minimalize the amount of comunications needed so then they provide high performace. But crays and the like super computer allows all the processors to comunicate with each other and the shared memory a lot faster. Thus making some algorithms run in Maginatudes faser.
An example is when I took a course in Parrallel processing we used a MassPar system which had 1024 processors in a grid formation. Now woring on that system I was able to sort a list of a million random numbers way faster then my Duel Processor PC could.
But on the flip side when I ran a program on the MassPar that wasn't designed parallel processing (emacs) it took upwards of 3 minutes to load it due to the age of the computer. While my PC could open up emacs in a split second. So on the clusters even the fastest in the world a Cray that may not be the fastest could actually beat it on many applications because of the faster bus comunication.
If something is so important that you feel the need to post it on the internet... It probably isn't that important.
Given a problem that doesn't scale well in a cluster environment, throwing more nodes at it will not help significantly. In that case, the cluster will run slower than the Cray at an equal cost. When you're paying several researchers $100k/year each, the Cray is probably the better solution for problems which are not easily parallelizable.
It was annonced that VA tech actually purchased the G5 X-serves before production was in place, but were instead delivered the G5 towers as loaners to have the cluster built in time for ranking.
The cluster remains, they have not shut it down and were swapping out individual racks for the upgrade.(something like one rack of X-serves is three racks of towers.
I don't think it's been published that they have or haven't ran any data besides benchmarks.
Post: Sigged, for your pleasure.
For the love of Christ people, it's a simple thing.
Format links like this: <a href="http://somelink">link text</a>
It takes virtually no extra time and we don't have to trim the fucking slashcode spaces.
Oh, and here's the link.
LOAD "SIG",8,1
LOADING...
READY.
RUN
I'm seeing alot of single threaded versus multi-threaded arguments.
That's great and all, but for a single threaded application a cray isn't even going to smash your modern top of the line home pc by too terribly much.
crays are massive smp systems, they need a multi-threaded app to take advantage just as much as a cluster does. The difference is in the bus speed. A cray has a much faster bus, and with equivelent processing and memory it will excel with a number of small quickly terminated threads, whereas a cluster will as well or better with larger more processor consuming threads.
Why would a cluster ever do better? Simple, although a cluster has a drastically slower bus, there is memory local to the processor in question so there is much less congestion on the bus, and since if your shelling out for a cluster you will be switching rather than hub style whatever you do there will be almost without collisions and bus contention. Each node has it's own ram so there isn't much of an issue with contention for the bus and much greater memory throughput.
So like I said, it's all about how fast threads spawn and terminate, because if your rapid firing threads then you will doing alot of communicating between nodes over the slow bus (network), if your sending good sized chunks of data do something and keeping your nodes busy they will spend more time working and less time communicating results and your cluster will tromp all over that cray.
Actually, that crossbar memory bus is just the local bus for each cabinet, and they do have low-latency interconnects that allow globally shared memory and single system imaging. Otherwise they wouldn't be working on a 1024 CPU installation. A clue for you: The technology used in the Origin machines was originally developed by Cray, and it runs 1024 CPU installations as global shared memory and single system image.
As for research, it's more a case of researchers doing the old "Damn, I'll have to make do with this". And Origin and Altix systems are still selling well in the research market.
And don't forget, Cray is backed by US government departments such as the NSA. The X1 received a lot of such support, which Cray even admits themselves: http://www.cray.com/products/systems/x1/
Hello,
actually, if you read the datasheet, the XD1 runs Linux 2.4.21 with some modifications (see xd1_datasheet.pdf
So does the SGI Itanium machine. What sets these computers apart is that they offer better interconnections between the processors than clusters do.
The Bigmac has 1.2GB/s between two nodes through Infiniband whereas an SGI machine has 6.4GB/s.
As a summary, in a cluster you use slower links with higher latency and your processors communicate through messages.
In a SGI or Cray machine, you use fast and expensive links (think more wires, more expensive controllers) and your processors can work as though they all shared the same memory.
SGI sells systems with 128 processors where there is only ONE Linux kernel (as opposed to 128 in a Linux cluster).
In fact even rendering a single frame decomposes easily into lots of seperate task, because it involves raytracing backwards from every single pixel trying to calculate it's colour. And furthermore, each of these tracings are completely seperate and just begs to be parallelised.
Raytracing is sometimes referred to as "embarrasingly parallel", because of this.
Mathematical dependencies is the real destroyer of parallelism. Any situation where the next calculation depends on the result on the previous is a typical serial calculation that would do badly on any super-computer and might as well be run on a single single scalar processor like the Athlon or P4.
You are basically quoting Amdahl's law (you may know that, but it should be pointed out in case anyone who doesn't know wants to look it up). Though his machines run into the same problem, if the program can not be broken into little concurrent chunks then having 1024 processors isn't going to help either.
When I worked at Oak Ridge National Labs there were several applications that people ran on our clusters that were serious computations. Very few of the people there really cared one way or another if it was on the IBM SP-2 or on the intel clusters, just run on the hardware that has the shortest runtime.
We generally got well over 8% utliization, if that was all you were getting then you were not managing the cluster well. Basically both machines had similar problems, if one piece of software only utilized 10% of the machine (and that is possible, even probable, in either world) then you ran more than one person - they did it and so did we. It was rare a single person got exclusive use of the machines (they either shared on each individual node or the over all machine was split into smaller clusters/supercomputers). The lines between the two are very blurry, but of course Cray wants you to think differently.
This article is just like one of the researchers there that ran the Big Iron stuff. When I was still an intern I overheard him telling the new director about how clusters sucked because they cost so much more in salaries to maintain. While true, he overlooked that thier service contract with IBM cost more than triple what it would cost us to replace the whole cluster per year and hire four full time people to manage them, and they never got any hardware upgrades for it.
Each has thier strong points and weaknesses, and never trust someone who is trying to sell you something to give the full story.
------- Sorry about the spelling, I suffer from two problems. Dyslexia makes it difficult to spell well, lazy makes it