Slashdot Mirror


Cray CTO Says Cray Computers Are Great

Jan Stafford writes "Linux clusters can not offer the same price-performance as supercomputers, according to Paul Terry, chief technology officer of Burnaby, British Columbia-based Cray Canada. In this interview, Terry explains that assertion and describes Cray's new Linux-based XD1 system, which will be priced competitively with other types of high-end Linux clusters."

8 of 338 comments (clear)

  1. The issues are progress and long-term usefulness by Space+cowboy · · Score: 5, Informative


    Given the difference in rate-of-evolution in the two camps, it can't be long before PC clusters, probably running Linux / with PVM or BSP (that's bulk-synchronous parallel rather than 3D graphics :-) are perfectly capable of doing what supercomputers do today. Of course, there'll be new really-super computers then, but that's a different story :-)

    It's all very well to mock the I/O of PCI, but that's why we're all imminently moving to PCI Express, at a rather more respectable (current) maximum of 8+GBps rather than 133Mbps... Run a few gigabit ethernets in a hypercube formation and you have some rapid data transfer...

    I notice he hasn't quoted the data-transfer rate on these new super-duper chips. The whole article does rather look like a piece of advertising on the cheap, speaking of which, the cluster solution is (relatively) CHEAP. Did I mention that ITS CHEAP...

    Simon.

    --
    Physicists get Hadrons!
  2. Re:*Shock* by Anonymous Coward · · Score: 5, Informative

    No, no, you misunderstand.
    He's saying that linux-based *supercomputers* are faster then linux-based *clusters*.
    (although, you can probably cluster those supercomputers...)

  3. Re:*Shock* by ohad_l · · Score: 5, Informative

    Uhh, no, he's not dissing Linux at all. He's saying that one big supercomputer (running Linux, perhaps) will get you more price-performance (bang per buck, I guess) than a Linux cluster.

    --
    If it weren't for fog, the world would run at a really crappy framerate.
  4. Re:The issues are progress and long-term usefulnes by PythonCodr · · Score: 5, Informative

    It's not just the speed of the data transfer, it's also the latency of the interconnect. A lot of scientific codes will pass around a lot of little messages, and GigE is fast for bulk transfer, but it's not so good for that. That's why there are companies like Quadrics, Myricom, etc... Infiniband should fix this, but you'll want a big infiniband switch.

    His point is building fast machines is hard, and the fastest machines are really hard. Too many folks think all you have to do is throw enough PCs and GigE nics at the problem. You can build a machine that way, but the codes don't scale well. Some scientific code will quickly show negative scaling in fact (where the more processes you add, the *slower* you code will run.) MPI codes do that all the time, which is one of the reasons you'll see people running their code at sizes smaller than the whole machine, and different sizes on different machines.

    Yeah, you can build a Linux based world-class supercomputer as a cluster, but you better be willing to sweat the details is all. Or buy a Cray, I guess. ;-)

  5. No ... by gstoddart · · Score: 5, Informative

    There are entire classes of computational problems which are calssed as Embarassingly Parallel.

    It means it is so trivial to parallelize the problem and get gains from it (think SETI@Home) that it's a no-brainer.

    Other computational problems don't just simply fan out to the bazillions of nodes with tiny independant pieces of data.

    Your assertion that the Cray CTO is talking FUD when he uses the actual term is just plain wrong and unfair to him. He actually knows what he's talking about.

    --
    Lost at C:>. Found at C.
  6. Not quite so simple really is it? by Anonymous Coward · · Score: 5, Informative

    I don't think the Cray assertion is that crazy.

    For a 12 CPU opteron unit the academic pricing (admittedly lower than commercial but where most of their sales will go) is about 45K. That's not too shabby. Before you bounce up and down and say I can build four times the cluster for that price, it should be noted that the XD1 gives you a single systems image, which simplifies programming and makes shared memory applications (increasingly important for areas such as bioinformatics).

    We have a cluster with dolphinics wulfkit, using distributed shared memory slows us down. It's not the end of the world type slow down but it's a factor. Our cluster is a sixteen node, dual xeon 2.2GHz with wulfkit 3d torus interconnects. It cost us, at academic prices, $50K. Admittedly more CPU power than the 12 Opterons but we find ourselves using distributed shared memory alot, wulfkit is great here, and that would probably be much better on the XD1. Had the XD1 been available a year ago we may have bought one instead.

    It really depends on your application. Are Crays cheaper than clusters in terms of harnessable compute power per dollar? Maybe. Depends on your application. Surely that's the correct answer.

    Also, buying Cray is about getting access to their software technology too.

    R-S

  7. Re:The issues are progress and long-term usefulnes by Wesley+Felter · · Score: 5, Informative

    Good clusters don't use IP; they use Infiniband, Myrinet, or Quadrics, which all have OS bypass and trasport offload features so that the app can talk directly to the NIC. In fact, Cray's XD1 "supercomputer" uses the same Infiniband interconnect as some "clusters"; Cray just has better NICs.

  8. Re:*Shock* by ranrub · · Score: 5, Informative

    Have you ever worked with supercomputers?

    However, if your supercomputer goes down... well, your screwed

    Cray supercomputers have built-in redundancies. All the subsystems are separate from the processors and memory, which are actually "clustered" (depends on model). Even the OS has build-in means to survive the harshest hardware catastrophe by checkpointing the running jobs regularly, to off-site disks.

    1000 machines are more reliable then 1 big machine

    Wrong again. With 1000 lousy cheap machines, you need an on-site team of technitians to keep the all up. Supercomputers (with built-in redundancy etc.) have equal or less maintenance requirements.