Slashdot Mirror


Supercomputer Built With 8 GPUs

FnH writes "Researchers at the University of Antwerp in Belgium have created a new supercomputer with standard gaming hardware. The system uses four NVIDIA GeForce 9800 GX2 graphics cards, costs less than €4,000 to build, and delivers roughly the same performance as a supercomputer cluster consisting of hundreds of PCs. This new system is used by the ASTRA research group, part of the Vision Lab of the University of Antwerp, to develop new computational methods for tomography. The guys explain the eight NVIDIA GPUs deliver the same performance for their work as more than 300 Intel Core 2 Duo 2.4GHz processors. On a normal desktop PC their tomography tasks would take several weeks but on this NVIDIA-based supercomputer it only takes a couple of hours. The NVIDIA graphics cards do the job very efficiently and consume a lot less power than a supercomputer cluster."

25 of 232 comments (clear)

  1. Re:By what benchmark? by Anonymous Coward · · Score: 5, Informative

    By what benchmark is eight of the NVIDIA GPUs in the 9800 GX2 more powerful than 300 2.4 GHz C2Ds? Looking at TFS the benchmark of their own tomography code taking a couple of hours instead of weeks.
  2. Tomography by ProfessionalCookie · · Score: 4, Informative
    noun a technique for displaying a representation of a cross section through a human body or other solid object using X-rays or ultrasound.


    In other news Graphics cards are good at . . . graphics.

    1. Re:Tomography by imrehg · · Score: 2, Informative

      In other news Graphics cards are good at . . . graphics.

      It's not the graphics part that makes it so computer-intensive.... All the mathematics behind it, once that's done, the presentation could be done on any ol' computer....

      So, if you mean by "graphics" that they are good at difficult geometrical calculations (like in games, for example), than you are right.... because that's what it is, truck-load of geometry...

      From Wikipedia:
      Tomography: "[...] Digital geometry processing is used to generate a three-dimensional image of the inside of an object from a large series of two-dimensional X-ray images taken around a single axis of rotation."

    2. Re:Tomography by jergh · · Score: 2, Informative

      Sounds like a simple 3D to 2D projection issue, no? Not really simple. The input data is "raw" in a sense that it contains lots of artifacts from the acquisition which have to be removed during reconstruction.
  3. Re:By what benchmark? by 77Punker · · Score: 4, Informative

    By what benchmark is eight of the NVIDIA GPUs in the 9800 GX2 more powerful than 300 2.4 GHz C2Ds? By any SIMD problem. For reference, fire up a game that's capable of using a software renderer and do some sort of benchmark, then use the 3D hardware on the same benchmark. That's the difference between SIMD on hardware that is designed to do SIMD and SIMD on hardware that's designed to do everything (or in the case of the Duo, multitasking).
  4. Re:By what benchmark? by hansraj · · Score: 5, Informative

    As far as my understanding goes, comparing a GPU's performance to a CPU's performance is very very task dependent and the comparison with 300 CPUs should not be taken to mean that a 8GPU system is more powerful than a 300 core duo system in general.

    If the application requires solving a small task many times over and over and all of these tasks can be done in parallel then using a GPU works great because a GPU has many cores each of which can handle a simple routine. Also the GPU is designed to spend very little time on the way code is hadled (load, switch etc) and spend more time actually running the code (hence the requirement of only very simple functions).

    Such problems frequently arise in tomography, physics, astronomy etc and I hear GPUs are a great success in these areas. But don't hold your breath for running your favorite distro blazingly fast using GPUs.

  5. Re:This is awesome! by livingboy · · Score: 2, Informative

    Insted of dices you could use KCALC 1 EUR is about 1.55 USD so instead of 20000 it did cost only about 6200 USD.

  6. Re:Why haven't they started releasing GPU CPUs yet by 77Punker · · Score: 4, Informative

    It is possible to solve non-graphics problems on graphics cards nowadays, but the hardware is still very specialized. You don't want the GPU to run your OS or your web browser or any of that; when a SIMD (single instruction, multiple data) problem arises, a decent computer scientist should recognize it and use the tools he has available.
    Also, this stuff isn't as mature as normal C programming, so issues that don't always exist in software that's distributed to the general public will crop up because not everyone's video card will support everything that's going on in the program.

  7. Not a Supercomputer -- Special purpose hardware by gweihir · · Score: 2, Informative

    It is also not difficuult to find other tasks where, e.g., FPGAs peform vastly better than general-purpose CPUs. That does not make an FPGA a "Supercomputer". Stop the BS, please.

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  8. Re:By what benchmark? by 77Punker · · Score: 4, Informative

    The GPU's are better at floating point than integer; if I remember correctly it takes 4 cycles on current GPU's to do a float operation, but it takes 16 to do an int. No, I don't understand why.

    Also, the "multiply" and "add" instructions exist in a "madd" opcode which essentially doubles the theoretical floating point performance, even if you don't use "madd" very often.

  9. Re:This is awesome! by krilli · · Score: 2, Informative
    --
    Jag pratar lite svenska.
  10. Wave of the Future? Yes by bockelboy · · Score: 5, Informative

    Wave of the Future? Yes*. Revolution in computing? Not quite.

    The GPGPU scheme is, after all, a re-invention of the vector processing of old. Vector processors died out, however, because there were too few users to support. Now that there's a commercially viable reason to make these processors (PS3 and video games), they are interesting again.

    The researchers took a specialized piece of hardware, rewrote their code for it, and found it was faster than their original code on generic hardware. The problems here are that you have to rewrite your code (High Energy Physics codebases are about a GB, compiled... other sciences are similar) and you have to have a problem which will run well on this scheme. Have a discrete problem? Too bad. Have a gigantic, tightly coupled problem which requires lots of inter-GPU communication? Too bad.

    Have a tomography problem which requires only 1GB of RAM? Here you go...

    The standard supercomputer isn't going away for a long, long time. Now, as before, a one-size-fits-all approach is silly. You'll start to see sites complement their clusters and large-SMP machines with GPU power as scientists start to understand and take advantage of them. Just remember, there are 10-20 years of legacy code which will need to be ported... it's going to be a slow process.

  11. Re:Wave of the Future? Yes by 77Punker · · Score: 4, Informative

    Fortunately, Nvidia provides a CUDA version of the basic linear algebra subprograms, so even if your software is hard to port, you can speed it up considerably if it does some big matrix operations, which can easily take a long time on a CPU.

  12. Re:Re-birth of Amiga? by Quarters · · Score: 5, Informative
    The Amiga design was, essentially, dedicated chips for dedicated tasks. The CPU was a Motorola 68XXX chip. Agnus handled RAM access requests from the CPU and the other custom chips. Denise handled video operations. Paula handled audio. This cpu + coprocessor setup is roughly analogous to a modern X86 PC with a CPU, northbridge chip, GPU, and dedicated audio chip. At the time the Amiga's design was revolutionary because PCs and Macs were using a single CPU to handle all operations. Both Macs and PCs have come a long way since then. 'Modern' PCs have had the "Amiga design" since about the time the AGP bus became prevalent.

    nVidia's CUDA framework for performing general purpose operations on a GPU is something totally different. I don't think the Amiga custom chips could be repurposed in such a fashion.

  13. Re:By what benchmark? by Calinous · · Score: 4, Informative

    Because floating point operation goes on a dedicated path, while the integer operations does not have a dedicated integer-only path.
    Also, it's possible that loading floating points operands and storing results in actual code can be pipelined, while integer operations are not pipelined.
      (and yes, I don't know what I'm talking about)

  14. Re:I guess... by lukas84 · · Score: 2, Informative

    Look at the IBM x3850 M2.

  15. Re:Limited Application by Calinous · · Score: 4, Informative

    Even more: if you don't optimize the code specifically for the GPU-based supercomputer, your performance goes down the drain. I wouldn't be surprised if they obtained a speedup of an order of magnitude or more from the aggressive code optimisation.
          The idea is: the original code would run faster on a 8 Core2Duo machine than on the 8 GPUs. Even more optimising of the code will do little for the Core2Duos, due to limited memory bandwidth, FSB bandwidth, and so on.
          Meanwhile, optimising a pipelining sistem (load, compute, store) in the GPU would be greatly improved by huge bandwidth (50GB/s on current systems), huge number of computation units (128 or more) and so on.

  16. Re:By what benchmark? by kipman725 · · Score: 2, Informative

    or 100's of FPGA's can do what was previously considered a task that even with super computing resources was considered so time consuming to be only worthwhile for groups like the nsa: http://www.copacobana.org/index.html (the EFF had a simlar custom chip device several years before but that cost >$250K)

  17. Re:By what benchmark? by TheThiefMaster · · Score: 3, Informative

    The 9800 GX2's GPUs have 128 1.5GHz "shader processors". 8 of these is like having 1024 vector-processing-specialised processor cores at your command.

    I could easily believe that it performed comparably to 300 2.4GHz Core 2 Duos (aka 600 "over 1.5x faster but not vector-specialised" cores).

    Theoretical performance is 576 GFLOPS per 9800 GX2 GPU (4.608 TFLOPS total) vs 19.2 GFLOPS per Core 2 CPU (5.760 TFLOPS total). However in tests the Core 2 gets as low as 6 GFLOPS instead of it's 19 theoretical, and the 9800 GPU gets a lot closer to it's full power.

  18. Re:By what benchmark? by AlecC · · Score: 4, Informative

    It takes 4 cycles to do a floating point operation, and 4 cycles to do an integer add/subtract. It takes 16 cycles to do an integer multiply because it only has a 24-bit hardware multiplier (needed to achieve the 4-cycle flops, so it has to do long multiplies as four madds, This was for the first generation CUDA CPUs; the second generation, which should be out by now, was going to have double length floating and would be able to do 32 bit multiplies in the same four cycles.

    While they can do integer, these machines are not very happy with it, and I found it much easier to do everything in floating point, even if you are talking about 8-bit colour data. It goes no slower, and everything is much better adapted to floating point. Then there are special instructions to get back to integer at the output.

    While each operation takes 4 cycles, they are fully pipelined, so that it launched a new instruction per cycle, times 32 pipes per unit, times 8 units per GPU.

    And madd is very useful for the sort of tasks for which supercomputers are traditionally used.

    --
    Consciousness is an illusion caused by an excess of self consciousness.
  19. Re:Re-birth of Amiga? by Anonymous Coward · · Score: 1, Informative

    GPUs essentially act like field programmable gate arrays. In a CPU, to perform a typical mathematical transformation, you would write the mathematical algorithm, wrap it with a loop (such as a for or while loop) and iterate the loop once for each element in a large array. In a GPU, You define the mathematical algorithm, point the GPU at the array, and tell it to go. It applies the algorithm to every element in the array without the code overhead for the loop and do it simultaneously by the number of pipelines it contains, given the constraint that the algorithm cannot be recursive (dependent upon the value of another member of the array).
    If your problem fits the non-recursion constraint, GPUs are going to kick ass all over normal CPUs. Most general programming problems do not fit that constraint. Most production scientific mathematical problems do.
    So no, it's nothing like an Amiga.

  20. CUDA C programming environment is the key by Anonymous Coward · · Score: 1, Informative

    The key to why they were able to do this is the CUDA C programming environment:
    http://www.nvidia.com/cuda

  21. Define: which is better? by mcrbids · · Score: 4, Informative

    Which is faster? A Lamborghini or a 5-ton flatbed truck?

    Depends on what you're after! If you are trying to get yourself from point A to point B, the Lamborghini is the obvious choice. But if you need to move 4.5 tons of stuff from point A to point B, the Lamborghini would suck ass when compared to the flatbed truck.

    It's just a question of what you are trying to accomplish. There is no absolute framework for "power" to solve problems, even if you define it fairly narrowly. For example, let's talk about 'pattern matching': A free database (like PostgreSQL) on cheap hardware can search through millions of records to deliver a query result in a tenth of a second. In that respect, Postgres is WAY faster than, say, the human brain. But the human brain will KICK ASS over just about any other technology out there in deciding whether or not a particular image contains a cat.

    Use the right tool for the job, and you'll be amazed at the results. That 8 GPUs handily outperform 512 CPU cores at a specific task is not surprising - the GPUs are designed from the beginning to solve the kind of problem that's needed!

    Personally, I'm surprised as to why there hasn't been more development behind the FPGA: are they just expensive?

    --
    I have no problem with your religion until you decide it's reason to deprive others of the truth.
  22. Re:By what benchmark? by mikael · · Score: 2, Informative

    Because a floating point number really consists of two value (a large 23-bit mantissa, a smaller (8-bit exponent and a single sign bit), performing a single arithmetic operation on two floating point numbers requires:

    1. Aligning the two mantissas so the exponents match
    2. Performing the operation
    3. Renormalizing the mantissa of new value so that it is in the range 1.0 to less than 2.0
    4. Saving the result to the destination register

    Each of these stages would probably take one read/write cycle.

    Performing an integer operation requires a shift-and-add sequence for multiplication, and a shift-compare-and-conditional-subtract for division.

    Previous GPU's just stored the integer as the mantissa of floating point registers. But as integers are now represented separately as 32-bit values, they will be processed by a different hardware unit. Maybe they have two barrel shift registers working in parallel so that only 16 cycles are required.

    --
    Vintage computer adverts: http://www.vintageadbrowser.com/computers-and-software-ads
  23. Re:By what benchmark? by cheier · · Score: 3, Informative

    I haven't done much of the development of the software myself. We have a developer we hired to work with CUDA. From what I've found, the documentation that is available on NVIDIAs site for CUDA is excellent and their developers are active on their forums.

    For VMD, it was necessary to have 1 CPU core per GPU. We tested 6 GPUs with 4 cores and we could only spawn 4 threads for GPU processing. The guys at Evolved Machines told me they can use multi GPU off of a single core. If so, I have no idea how. NVIDIA even tells me 1 core per GPU, so that is the gospel I'm following by. Acceleware for some of their stuff even use 2 cores per GPU, but they have their own libraries outside CUDA for GPU stuff, so who knows.

    I haven't come across any books on CUDA other than the support manuals, but since it isn't a very mature API, it is only a matter of time.