Slashdot Mirror


Cray XT-3 Ships

anzha writes "Cray's XT-3 has shipped. Using AMD's Opteron processor, it scales to a total of 30,580 CPUs. The starting price is $2 million for a 200 processor system. One of its strongest advantages over the std linux cluster is that it has an excellent interconnect built by Cray. Sandia National Labs and Oak Ridge National Labs are among the very first customers. Read more here."

34 of 260 comments (clear)

  1. imagine a... by Anonymous Coward · · Score: 5, Funny

    single node of those.

    1. Re:imagine a... by Anonymous Coward · · Score: 5, Insightful
      *rolls eyes*

      When you have a single CPU, designing the system to be pretty fast is easy. There's no major contention to deal with.

      Two CPUs? Slightly harder, but reasonably straightforward. You don't see a 2x improvement in speed over one CPU, but it's around 1.95x, give or take a bit.

      Four CPUs? Now you're starting to see less improvement ... probably around 3.2x, because of all the contention issues.

      Sixty-four CPUs? You'll be lucky to get a 50x speed up over a single CPU.

      When you get to 200 CPUs, the issue of access to shared memory and other shared resources becomes critically important. It's also an issue that most computer buyers don't need to worry about, because they don't have 200 CPUs in their system. This means that you have a lot of highly specialised research going on, and relatively few buyers to spread the cost of that research over.

      Two million for a 200 CPU box which has low latency, low contention, and solid reliability is not a lot at all. You might not buy it. That doesn't mean nobody will.

    2. Re:imagine a... by crimsun · · Score: 4, Informative

      It's not just hardware: the amount of non-parallelizable code in parallel applications impacts scalability most tremendously.

      The upper bound on speedup is generally Amdahl's law. Plainly, the efficiency approaches zero as the number of processes is increased. Generally we consider the major sources of overhead to be communication, idle time, and extra computation. Interprocess communication is considered negligible for serial programs in this context (we consider message passing). Idle time ends up contributing to overhead, because processes idle awaiting information from others. Extra computation is virtually unavoidable at some point; for instance in MPI's Single Program Multiple Data model, each process in tree-structured communication other than the root is eventually idled prior to the completion of computation, and each process determines IPC at some point based on rank.

      There are notable exceptions to Amdahl's law, however; Gustafson, Montry and Benner wrote about such in Development of parallel methods for a 1024-processor hypercube, SIAM Journal on Scientific and Statistical Computing 9(4):609-638, 1988.

  2. How big is it? by rooijan · · Score: 3, Interesting

    I read the article (okay, so I kinda read it :-) ) and it has the speed and specs to be a geek's improvement on sliced bread. But how big is it, physically?

    The article doesn't appear to mention its dimensions, and I'm curious to know what kind of space you need to install this baby. Anyone got any idea?

    --
    Daar is nie 'n lepel nie
    1. Re:How big is it? by Anonymous Coward · · Score: 4, Informative

      Dimensions (cabinet): H 80.50 in. (2045 mm) x W 22.50 in. (572 mm) x D 56.75 in. (1441 mm)

      Weight (maximum): 1529 lbs per cabinet (694 kg)

      http://www.cray.com/products/xt3/specifications. ht ml

  3. I'll pass for now. by mrjb · · Score: 3, Funny

    This is only the XT-3. I'll wait for the Pentium-3-4.

    --
    Visit http://ringbreak.dnd.utwente.nl/~mrjb/growingbettersoftware to download your free copy of the book
  4. we're getting closer... by nilbog · · Score: 5, Funny

    A few more years of advances like this and we might have a machine capable of running Longhorn!

    --
    or else!
    1. Re:we're getting closer... by metlin · · Score: 3, Funny


      Ahh, now that's what I call an optimist.

    2. Re:we're getting closer... by provolt · · Score: 4, Insightful

      Ah the joys of youth.

      Back in my day we spelled "enuff" without the 'f' character and it was good enough for us.

  5. $2 million for a computer? by commodoresloat · · Score: 3, Funny
    It better have a lot of good games. How many mouse buttons does it have?

    I can't believe people complain about the price of iMacs....

  6. real FPU operations by Barbarian · · Score: 4, Interesting

    How are the Opterons at standard FPU operations in double precision? SSE2 and friends are nice, unless you have to make compromises in your simulations.

    I ask, because I remember that the Athlons beat the pants off the Pentium 4's in FPU operations, so all the benchmarks were rewritten to use SSE2.

    1. Re:real FPU operations by jmv · · Score: 4, Informative

      Opterons beat the pants off the Pentium 4s in x87 (i.e. old) FPU operations. If you want to get good performance, you need SSE/SSE2. Both for AMD and Intel. For pure SSE, the Pentium 4s beat the Opterons mainly because of the clock speed, but for multi-processor systems, the hyper-transport and all more than makes up for that.

    2. Re:real FPU operations by jmv · · Score: 3, Insightful

      Both SSE and 3DNow! get you (in theory, at best) two adds and two multiplies per clock cycle, even on an Opteron. So yes, just because of the clock, the P4 beats the Opteron in the case of pure (no memory/cache access, no depencency, nothing else) float operation. Now, in real life, you sometimes spend longer waiting for the data than computing with it and that's how the Opteron quite often comes out on top, especially for multi-processors.

    3. Re:real FPU operations by jmv · · Score: 5, Interesting

      Couple facts about SSE:
      1) You can use it in scalar mode, in which case it's almost like x87, only a bit faster because:
      a) It doesn't use a braindead register model (stack)
      b) On P4, you can do a mul and an add in parallel with SSE, but not with x87
      2) You can use SSE intrinsics. It's not as easy as "normal" programming, but easier than assembly and almost the same speed.
      3) Unaligned access is possible. It's slower than aligned access, but overall better than non-vectorized code.
      4) Trig is so slow that SSE/x87 doesn't matter (unless you write approximations, in which case SSE will also be faster).

  7. Just the name brings back memories by Dancin_Santa · · Score: 3, Informative

    In this day and age of very fast computers and clusters built in our basements, there sometimes comes along a story that whispers of the computing age of days long past. Cray is one of those names that can drop a jaw just by the mere utteration of the name.

    The name is synonymous with speed and power and the unwillingness to cut corners in order to shave a few dollars off the final product. When you buy a Cray, you know you are getting top of the line hardware.

    It looks like Sandia wants to build the fastest supercomputer in the world by clustering a few of these monsters, and I have no doubt that they will. Looks like more fun articles about this in the future. :-D

    There are two prominent applications for these machines. The first is nuclear weapons simulation. Personally, I don't see the point to that. The other application is in weather prediction. By feeding in current weather variables into a well-written model, a supercomputer is able to predict to a large degree of accuracy the future weather. Such an application will always be welcome.

    I think I'm going to have to fire up the old ][e, the nostalgia is killing me!

    1. Re:Just the name brings back memories by joib · · Score: 4, Informative


      There are two prominent applications for these machines. The first is nuclear weapons simulation. Personally, I don't see the point to that. The other application is in weather prediction.


      Oh, please. Buy a clue, will ya? There's lots and lots and lots of applications that use supercomputers, or could use if they were more affordable. A few examples from the top of my head:

      Materials science, that is ab initio simulations, moldyn, you name it. This alone probably uses > 50 % of all supercomputer cpu time in the world. By comparison, weather prediction and nuke simulations is small potatoes (or shall we say, the simulations as such are big, but the number of people engaged in weather prediction or nuke simulation is really small compared to all the supercomputing materials scientists).

      CFD, the automobile and aerospace sectors are big users.

      Electronic design.

      Seismic surveys, the oil industry uses lots and lots of supercomputers to find oil deposits.

      Biology. Gene sequencing, moldyn simulations of lipid layers and whatever.

      Climate prediction, somewhat related to weather prediction. Official purpose of the Earth Simulator.

      All of the examples above could easily use almost any amount of cpu power you can throw at them. The only thing that stands between a lot of scientists and improved understanding of the world is computing power.

    2. Re:Just the name brings back memories by flaming-opus · · Score: 4, Insightful

      Actually, there is no reason to cluster a few of these. If you have a 2000 node xt3 (or t3e, paragon, blue-gene, cm5, insert mesh-structured mpp here) and a 4000 node xt3, you stick them together and make a 6000 node xt3. But that's just picking nits.

      Curiously the xt3 IS about shaving dollars off the price. If you go read the origional whitepapers on the system, they go through EXTENSIVE cost-return analysis. They studied their (then-) current generation of cluster systems, as well as future linux/solaris/aix clusters, and rejected them as (interestingly) FAR TOO EXPENSIVE, once the administrative costs are factored in. They then looked at, and rejected, cray's vector solution, the X1. They then decided that the (amazingly) most cost effective solution was to underwrite cray's product development cycle on a wholey new product. Basically they asked for an update to the system they already had. (asci-red i.e. intel paragon++) Nobody was building such a thing. Since cray had a really strong similar product in the 90s. (T3D, T3E) the department of energy asked them to create an update. Some designs never die.

      What I'm most interested in is the reliability. One of the biggest difficulties in the T3D engineering cycle was dealing with memory failure. red-storm is going to have 10,000 processors. Lets assume each has 2 banks time 3 dimms (chip-kill) of memory. That means there are 10,000 x 6 x 18 = 1 million+ memory chips in the system. IF 1/100th or a percent of these fail, that's still a lot of memory failures. How well are faults isolated? That's the big question for systems this big.

      I'm also a little wary of cray's use of lustre. I've used lustre before, as well as other cluster-FSes. While I'm not aware of other filesystems that will scale to 700+ i/o nodes, I'm not confident in lustre. It's an immature product at best. (I don't mean to disparage the people working on it, it's a neat architecture, but it's a hard problem, and I'm not sure it's ready for prime-time.)

  8. You don't have to begin to imagine by commodoresloat · · Score: 3, Informative

    You could just read on the spec page: Power: 14.8 kVA (14.5 kW) per cabinet. Circuit Requirement: 80 AMP at 200/208 VAC (3 Phase & Ground), 63 AMP at 400 VAC (3 Phase, Neutral & Ground) Cooling Requirement: Air Cooled, Air Flow: 3000 cfm (1.41 m3/s) Intake: bottom, Exhaust: top.

    1. Re:You don't have to begin to imagine by fbform · · Score: 5, Interesting


      More interesting is this spec:

      Acoustical Noise Level: 75 dBa at 3.3 ft (1.0 m)

      For comparison, that's roughly the same as an average vacuum cleaner when you're operating it, or maybe a good-sized pickup truck passing you in the next lane.

      And remember, this value is *per cabinet*. You have to do a weighted sum over all the cabinets in an installation to get a true dB level. I wonder whether the maintenance people will have to use noise-level exposure limits for this baby.

      And here I was, complaining about the quiet whine of my PC's fan.

      --
      Time flies like an arrow. Fruit flies like a banana.
    2. Re:You don't have to begin to imagine by pchan- · · Score: 4, Interesting

      Power: 14.8 kVA (14.5 kW) per cabinet.

      that's amazing. how did the cray guys get a kilovolt-ampere that is not equal to a kilowatt? just goes to show you the power of fast interconnects.

    3. Re:You don't have to begin to imagine by wronskyMan · · Score: 5, Informative

      Disclaimer: IANACEBIATAPEC (I Am Not A Cray Engineer But I Am Taking A Power Engineering Course)
      It's fairly common to get a KVA !=KW.
      Overall power used by a load is expressed as S=P+jQ, where P is the "real" power and Q is the reactive power (capacitive/inductive from motors, fluorescent lamp ballasts, etc).

      While the "units" of S, P, and Q are power=voltage*current, S is generally expressed in VA, P in W, and Q in VAR(volt-ampere reactive) to differentiate the variables. Because the magnitude of S=sqrt(P^2+Q^2), S will always be greater than or equal to P (in this case, 14.8kVA=sqrt((14.5kW)^2+(+-2.965kVAR)^2)

      --
      --- You shall know the truth, and the truth shall make you mad- Neal (not Cowboy) Boortz
  9. Opterons and PowerPC together by Henriok · · Score: 5, Interesting

    It seems that the XT-3 not only use Opteron processors but they also use PowerPC 440 co-processors from IBM to off load inter-processor communication from the main computing CPUs. Quite an interessting set up.

    The XT-3's biggest comptetitor in this segment must be the BlueGene/L type super computer made by IBM. The processors in Blue Gene/L is a custom built dual core version of the PowerPC 440 with built in high speed interconnects.

    Just like IBM have a finger in all the future game consoles, they seem to have a finger in several of the next generation super computers also. Nice going IBM.

    --

    - Henrik

    - when the Shadows descend -
  10. The first test of the new Cray by teamhasnoi · · Score: 3, Funny
    they simulated a woman who posts to Slashdot and is waiting for her Centris running PearPC on Debian to boot OS X.

    Strangely, it took roughly a week. The second test was a simulation of the moderation results of this post.

    It received a +5 Funny, which puzzled researchers, as it is currently modded -1 Offtopic.

    Damn you Schroedinger!

  11. Re:software by Coryoth · · Score: 5, Informative

    what kind of operation system runs on this beast?

    UNICOS is usually a safe bet. In this case the specs say UNICOS/lc, which is made up of "SUSE(TM) Linux(TM), Cray Catamount Microkernel, CRMS and SMW software"

    I'm not entirely clear how to interpet that, but I think it runs as follows: It runs the Catamount Microkernel as the kernel, and uses SUSE for everything else (so we have SUSE Linux, without the Linux - all of a sudden that GNU/Linux stuff starts to make sense). The CRMS is their interconnect management and monitoring software, and SMW is the System Management Workstation - which I'm guessing is their administration frontend.

    It's worth noting that that's some pretty serious software there (because Cray has a lot of experience dealing with large systems) - you can bet that the management and monitoring software is some very serious stuff.

    This thing is to a beowulf cluster what a dual G5 PowerMac is to homebuilt PC system running Linux From Scratch. It's going to work flawlessly "out of the box" with a smooth and polished interface that lets you get done everything you want to do simply and easily. You can of course make your home built PC with LFS work just as well, it's just going to take you an awful lot of effort.

    Jedidiah.

  12. Re:So......the cost compared to? by Coryoth · · Score: 4, Informative

    So, how does this compare to running Apple's Xserve? Bang per buck? Heat? Space? Etc etc....

    There's not a lot to compare. We're talking apples and oranges. It's like asking to compare a PowerMac G5 with a bunch of PC parts scattered on the floor as desktop machines. Sure, you can put the PC together, load it with Linux, tinker with it to get everything working, etc. but that's a fair amount of work compared taking the PowerMac out of the box, plugging it in, turning it on, and having everything work perfectly.

    Read the specs, particularly with regard to the interconnect, system administration, and hardware and software reliability features. This thing is seriously engineered to be massively parallel system with top of the line hardware and software to support and maintain that, as well as extremely impressive reliability features.

    Jedidiah.

  13. Re:MP performance overhead by Big+Mark · · Score: 3, Informative

    If Crays were built the same was as desktop dual-proc machines, then yes, the multi CPU overhead would cripple it. Fortunately, it's designed completely differently - e.g. they use PowerPC chips to handle almost all of the inter-processor communication.

    You can't really compare something that can hold thousands of CPUs to something powered by Abit that can hold two, anyway. It's like comparing apples and a strange bug thing with tentacles.

  14. Re:My new dream toy by Guppy06 · · Score: 3, Funny

    Maybe if you included promises of free iPods...

  15. Re:cray by Anonymous Coward · · Score: 5, Interesting

    Cray never went "belly up". It was acquired by SGI around 1997 or so, then divested and merged with Tera, who renamed the resultant entity "Cray Research".

    Although it's true that Cray was not growing strongly before the SGI buy-out, it was not failing either. It could have kept running quite happily for many years, but in the bizarro-world of Wall Street, a company which is not growing is dying. I so love it when economists use biological terminology for corporations. In Wall Street's thinking, the only healthy growth would be a cancerous tumor.

    Anyway....

    The whole SGI-period of Cray is actually quite fascinating, and I suspect the true story will never be fully known. Lots of SGI engineers had their non-Cray technology branded with Cray marketting names, most egregiously LegoNet becoming CrayLink. Lots of Cray folks - aka. Crayons - felt that the core of their company was gutted by an SGI operation which didn't care for the extreme high-ends of HPC.

    One rumor I heard, from a well-placed source, is that the Cray merger with SGI was primarily arranged by the USG. The intelligence services have huge investments in both company's products, so the merger between them made sense. I was told that as a quid-pro quo, the USG had an in-principle agreement to continue purchasing Cray gear to provide enough revenue inside SGI to keep both Cray architectures alive. However, certain parts of SGI felt that the US government didn't live up to their agreement, negotiations to rectify that weren't successful, and so SGI management defunded significant aspects of the Cray engineering work.

    Also, FYI, Cray is one of those companies which will never totally go "belly up" anyway. Given the sensitivity of the work which they did, their support databases alone are full of sensitive and/or classified information. Should the company cease trading, it would be acquired by a shelf company whose sole function is to ensure this data would remain private. That's been the fate of almost all of the now-defunct supercomputer and high-end graphics companies who formerly supplied the defence and intelligence market.

  16. Re:The math for a comparable Xserve system by joib · · Score: 4, Insightful


    What a value!!


    That is, until you throw a tightly coupled problem at it and the Cray is 10 times faster because it has much better internode bandwidth and lower latency.

    And, you forgot to count the cost of the InfiniBand interconnect that the VT cluster used? That's a couple grand per node.

    Bottom line, apples and oranges. If your applications is easily parallelizable (i.e. doesn't require much communication between the nodes) you'd be stupid to piss away your money on a "real" supercomputer instead of a cluster. And vice versa.

  17. 700kgs, 75dB and 14kW... by Alkonaut · · Score: 4, Funny

    ...Sadly I think that beats my Volkswagen on all three

  18. No, what stands in the way is price by Moraelin · · Score: 3, Interesting

    The real problem that stands between scientists and them having lots of shiny toys is funding.

    E.g., yeah, having a 30,000 CPU super-computer to simulate your gene model on would be nice. Forking over half a billion for it, well, it's suddenly not that nice any more.

    Having one of those to simulate an electronic circuit, now that would probably rock. Again, paying half a billion for it, suddenly isn't that attractive.

    The real question isn't how nice a toy you'd like to have, it's ROI. (Unless you work for the government, and just have a budget you _have_ to blow on stuff, whether you need that stuff or not.)

    And in that context, you'd be surprised what you _can_ do with a lot less expensive toys.

    Having Cray's custom interconnects sure is impressive, but for a lot of problems they're not even needed any more. _That_ is what killed Cray.

    Most RL problems are not really the kind described as "_one_ huge indivisible data set, that you have to process in _one_ huge batch process." They're more like "we have this process with a small data set that we have to run 100,000,000 times." Most design problems or biology problems are really of that kind: run the same thing 100,000,000 times with different parameters.

    And as Seti@Home or Folding@Home proved, a helluva lot of those don't really need _any_ kind of shared memory or fancy interconnects. The real ticket is noting that instead of accelerating the batch run 200 times, you could just split it into 200 smaller batches ran on 200 single-CPU machines.

    The super-computer solution costs 2,000,000 just for the machine alone, while the 200 PCs solution costs 200,000 or so. I.e., 10 times cheaper. Better yet, the 200 PCs solution is also far cheaper to program. (Anyone can program a non-threaded batch app.) _And_ for that kind of a problem the 200 PCs solution would actually finish faster, since it has no contention issues whatsoever.

    Again, that's what really killed Cray and the super-computers. They're techologically impressive, they're a geek's wet dream, but... for 99.9% of the problems out there they're just not worth the price any more.

    --
    A polar bear is a cartesian bear after a coordinate transform.
  19. ... Back in my day .... young whippersnapper by ebooher · · Score: 4, Interesting

    So come on, ante up. How many remember being awed at the mere sight of old Crays back in the day? Like the Cray-3? I remember the first time I saw a Cray .... thing was in an anti-static environment. To access it, one had to pass through an airlock and be "decharged" or "depolarized" etc. Basically they some how charged the air to get rid of static electricity. Then you had this system that was running *in* liquid! Take that "Oh I'm so cool cause I have a l337 haX0r water cooled CPU" overclockers

    They (Cray) were so proud of this accomplishment that the upper portion of the cabinet was some kind of plexiglass so you could see the fluid as it moved, and moved wiring and what not with it. Very surreal feeling, almost like the thing was breathing.

    And what about the Cray-1? Wasn't that a true testiment to 70's *art* and sculpture? The thing looks like some kind of freaky bus station bench with it's odd red and white panels and black base. Though, I don't know if they all looked like that, maybe you could get them in other colors?

    Ahh .... those were the days.

    --
    "Genius may shine aloof and alone, like a star, but goodness is social, and it takes two men and God to make a Brother."
  20. hybrid system with multiple kernels by Dink+Paisy · · Score: 3, Informative

    From the documents, it looks like it runs Linux on the management nodes and Catamount on the compute nodes. The idea is you can do what you like with the general purpose nodes, but for the compute nodes, you run a lightweight operating system that has low overhead, minimal services and predictable scheduling. BlueGene/L works the same way; it runs Linux on the management nodes and a custom operating system on the compute nodes. Compute nodes likely provide scheduling for only the number of threads that run on the node, communication through MPI and some proprietary API, and basic debugging facilities. Compute nodes probably lack normal OS services like network, disk, or even a console.

    --

    Whoever corrects a mocker invites insult;
    whoever rebukes a wicked man incurs abuse.
    --Proverbs 9:7
  21. Re:software by flaming-opus · · Score: 3, Informative

    This split microkernel architecture has been in use for a long time on big mpp systems like the paragon and the t3e. The software base (catamount/linux) is new, but the design is old.

    catamount is the kernel that runs on the compute nodes. IT's a tiny kernel that packages up the OS service requests, and sends them, over the interconnect, to an OS or I/O node, which does the real work of the operating system. catamount is a descendant of PUMA, which came from Cougar. These are heavily derived from work done at caltech. (I believe CMU, and one of the UTexas schools also played a role, but am not sure). The idea is that the microkernel is small and unobtrusive, and it gets the hell out of the way so the application can use the CPU as much as is possible.

    The OS and I/O nodes run linux, and provide services to the compute nodes. This is probably, but it could just as easily be running as a user-space daemon on the OS node. (Though you might have to do some mem-copys that way, which would lower performance)

    NOTE: Though these nodes take advantage of some of linux's features (like the lustre file system) they do NOT necessarily implement these features for the system as a whole. They probably provide a minimal set of features necessary for the sorts of problems that the xt3 runs. All the scheduling work that has gone into more recent linux kernels is of little use, as the compute nodes have their own scheduler, probably more closely tied to the batch dispatcher than to the linux kernel. To say that the system runs linux is true, but a little misleading. It's a very different linux than what runs on my desktop, and it's used in a very different way.