Slashdot Mirror


NVIDIA's $10K Tesla GPU-Based Personal Supercomputer

gupg writes "NVIDIA announced a new category of supercomputers — the Tesla Personal Supercomputer — a 4 TeraFLOPS desktop for under $10,000. This desktop machine has 4 of the Tesla C1060 computing processors. These GPUs have no graphics out and are used only for computing. Each Tesla GPU has 240 cores and delivers about 1 TeraFLOPS single precision and about 80 GigaFLOPS double-precision floating point performance. The CPU + GPU is programmed using C with added keywords using a parallel programming model called CUDA. The CUDA C compiler/development toolchain is free to download. There are tons of applications ported to CUDA including Mathematica, LabView, ANSYS Mechanical, and tons of scientific codes from molecular dynamics, quantum chemistry, and electromagnetics; they're listed on CUDA Zone."

60 of 236 comments (clear)

  1. Graphics by Anonymous Coward · · Score: 5, Funny

    Wow, that's some serious computing power! I wonder if anyone has thought of using these for graphics or rendering? I imagine they could make some killer games, especially with advanced technology like Direct 3D.

    1. Re:Graphics by GigaplexNZ · · Score: 2, Funny

      I wonder if anyone has thought of using these for graphics or rendering?

      These are effectively just NVIDIA GT280 chips with the ports removed. Their heritage is gaming.

      I imagine they could make some killer games

      If you can find some way to get the video out to a monitor... but then you effectively just have Quad SLI GT280.

      especially with advanced technology like Direct 3D

      Uh... what? Direct 3D has been commonly used for years, you make it sound like some new and exotic technology. It is also effectively Windows only, whereas this hardware is more likely to use something like Linux.

    2. Re:Graphics by Gnavpot · · Score: 4, Funny

      "I wonder if anyone has thought of using these for graphics or rendering?"

      These are effectively just NVIDIA GT280 chips with the ports removed. Their heritage is gaming.

      We need a "+1 Whoosh" moderation option.

      No, I do not mean "-1 Whoosh". I want to see those embarrassingly stupid postings. But perhaps this moderation option should subtract karma.

    3. Re:Graphics by GigaplexNZ · · Score: 4, Funny

      I suppose I'm one of those guys now. Hook, line and sinker.

    4. Re:Graphics by evilbessie · · Score: 2, Informative

      In much the same way that the current Quadro FX cards are based on the same chip as the gaming gforce cards. But still the most expensive gaming card is ~£400, but you'll pay ~£1500 for the top of the line FX5700.

      It's because workstation graphics cards are configured for accuracy above all else, where as gaming cards are configured for speed. Having a few pixels being wrong does not affect gaming at all, getting the numbers wrong in simulations is going to cause problems.

      Mostly the people who use these cards care about OpenGL support, but some people do use them under Windows and DirectX.

      This type of computing came in with the gforce 8 range when CUDA (Computer Unified Device Architecture) brought C programming to the massively parallel graphics chips. Which has allowed nVidia to port the Ageia PhysX technology to the gforce cards so a separate addin card is not necessary.

      I believe that ATi are doing something similar with their FireGL cards, which again are based on the same chip as their Radeon cards. This is why they have both moved from Shader/Vertex to Unified Stream processors. This is a really interesting development if you happen to work in a research establishment, otherwise please move along nothing to see here.

    5. Re:Graphics by xonar · · Score: 2, Interesting

      So being naive to the ways of the world is bad karma now? I thought Buddhism stressed being free from the material things of the world.

  2. Heartening... by blind+biker · · Score: 2, Interesting

    ...to see a company established in a certain market, to branch out so aggressively and boldly into something... well, completely new, really.

    Does anyone know if Comsol Multiphysics can be ported to CUDA?

    --
    "The agriculture ministry is not in charge of Gundam" - Japanese ministry official.
    1. Re:Heartening... by mangu · · Score: 4, Interesting

      Can you imagine a Beowulf cluster of these?

      Yes, I can. My first thought when I saw the article was to calculate how many of them one would need to simulate a human brain in real time. The answer is: with 2500 of these machines one could simulate a hundred billion neurons with a thousand synapses each, firing a hundred times per second, which is the approximate capacity of a human brain.

      People have paid $20 million to visit the space station, now who will be the first millionaire hobbyist to pay $25 million to have his own simulated human brain?

    2. Re:Heartening... by swamp_ig · · Score: 3, Interesting

      Would the interconnects be fast enough? There's a lot of non-locality in the synaptic connections, so you're going to need some pretty heavy comms between the cores.

      Also a selection of neurons are far more heavily connected than 1000s of synapses, and they're fairly essential ones. Might these be a critical path?

      Sure would be cool to build such a beast, do some random connections, and see what happens...

    3. Re:Heartening... by smallfries · · Score: 4, Interesting

      Your figures are off by several orders of magnitude. 2500 of these is roughly 10,000T/flops. As a Tflop is 10^12 operations, and we have 10^11 neurons that leaves 10^5 floating point operations per neuron. If each has 1000 synapses to process then we are down to 100 operations per connection, per second.

      At this point it seems obvious that you've assumed a really simplistic model of a neuron that can compute a synaptic value in a single floating point operation. These simple neuron models don't behave like a real brain, and scaling up simulations of them doesn't produce anything interesting. Real neurons are capable of computing much more complex functions than these models. The throughput on the interconnect is going to be a major factor, and simulating each neuron will require from 10s to 1000000s of operations depending on the level of biological realism that is required. The Blue Brain project has a lot of interesting material on different models of the neuron and the tradeoff between performance and realism.

      Their end goal is to dedicate a large IBM Blue Gene to simulating an entire column within the brain (roughly 1,000,000 neurons) using a biologically-realistic model.

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
    4. Re:Heartening... by LeDopore · · Score: 5, Informative

      You're right unless there's a computational way to take advantage of the fact that most neurons in cortex pretty much never fire (1), and that a small minority of synapses are responsible for nearly all of the excitation in a slab of cortical tissue (2). If not active == not important == not necessary to simulate with a 100% duty cycle (these are big "ifs"), then we could be literally about 3-5 orders of magnitude closer to being able to simulate whole brains than anyone realizes.

      (1) How silent is the brain: is there a "dark matter" problem in neuroscience? Shy Shoham, Daniel H. O'Connor, Ronen Segev. J Comp Physiol A (2006)

      (2) Highly Nonrandom Features of Synaptic Connectivity in Local Cortical Circuits. Sen Song, Per Jesper Sjostro, Markus Reigl, Sacha Nelson, Dmitri B. Chklovskii. PLOS biology March 2005

      --
      Expected time to finish is 1 hour and 60 minutes.
    5. Re:Heartening... by HiThere · · Score: 2, Interesting

      I think your post was intended humorously, but I'm going to pretend otherwise. (Note, I'm not a specialist in computational mentalistics, or whatever the field would be called, but:)

      I'm fairly certain the interconnects are fast enough. The brain is no speed demon on individual connections. It's basically chemical, with only a little electrical stuff on top that's still based on ions floating in liquid.

      The problem is the software. And the sensoria. And the effectors.

      Each of those problems is being addressed separately. What do you want to bet that when they all come to "good enough" solutions, interfacing them is going to be a MASSIVE kludge.

      And even if you could, you can't just copy how people did it. A camera is basically different from a retina. It extracts different information. You can use complex processing to convert one into a simulation of the other, but there's no straightforward mapping. Each conversion involves loss of information...so you need to ensure that the correct information is lost.

      Just as a silly example of the difference, a recent experimental hearing aid uses infra-red lasers to stimulate the nerves in the cochlea. You KNOW that people use electric signals, but artificially generated electrical signals spread too much in the interface, so you can't get decent tone resolution. With infra-red lasers, though, you can stimulate any particular neuron you choose.

      Guaranteed: random connections will give you a crashed program. Secondary chance is an infinite loop.

      Mind you, there are neural nets that are initialized with random initial values, but they have strict boundary conditions. Otherwise you never get better than garbage out of them.

      Also: There are lots of groups of neurons that are more highly connected than average. These are "functional specialists". There often isn't anything special about the neurons, but only about the way that their connections have been reinforced. I'm not sure about the neurons that branch outside of the column, but I suspect the same of them.

      My projection for a human mind equivalent computer remains at around 2020-2030. This announcement drops my estimate of the cost, but that was never an exact number of dollars, so I can't quantify it. Also note that I said equivalent. I'm not going to assert that it would enjoy watching Star Wars, or even 2001. It's emotions are unlikely to be similar in nature to those of a mammal...unless that's necessary in order to understand human language...and only to the extent necessary.

      For that matter, we wouldn't WANT it to have the same emotional structure that we have. That would be very dangerous. If we did that then it might have "take over the world!" as an innate goal, rather than as a tactical move. Even as a tactical move it's rather dangerous, so we would probably want to so design it's goals that such a tactical move would appear extremely distasteful, and best accomplished by manipulating willing proxies. (This would ensure that there was room for people where people would be comfortable.)

      OTOH, I don't see a human mind equivalent AI as remaining merely human equivalent. Progress rarely stops. But if it's motivational structure is so designed that there's plenty of comfortable room for people, I don't see this as a problem. Entities rarely want to alter their motivational structure unless it's giving them severe problems, and often not then. But don't expect it to be passive or a mere recipient of orders. It would, however, be reasonable to expect it to be a lot more considerate of human needs and desires than the current bureaucracy...in any country. (Note that individual office holders may well be sympathetic, but the system itself isn't.)

      --

      I think we've pushed this "anyone can grow up to be president" thing too far.
  3. Re:Ooooooo! Ahhhh! by Surreal+Puppet · · Score: 2, Interesting

    Port john the ripper/aircrack-ng? Buy a few terabyte drives and start generating hash tables?

  4. 4 TFLOPS? by Anonymous Coward · · Score: 5, Insightful

    A single Radeon 4870x2 is 2.4 TFLOPS. Some supercomputer, that.

    Seriously, why is this even news? nVidia makes a product, which is OK, but nothing revolutionary. The devaluation of the "supercomputer" term is appalling.

    Also, how much of that 4 TFLOPS you can get on actual applications? How's FFT? Or LINPACK?

    1. Re:4 TFLOPS? by GigaplexNZ · · Score: 4, Informative

      A single Radeon 4870x2 is 2.4 TFLOPS.

      A single Radeon 4870x2 uses two chips. This Tesla thing uses 4 chips that are comparable to the Radeon ones. It should be obvious that they would be in a similar ballpark.

      Seriously, why is this even news?

      It isn't. Tesla was released a while ago, this is just a slashvertisement.

    2. Re:4 TFLOPS? by hairyfeet · · Score: 3, Interesting

      The problem is how do you actually define supercomputer. I mean, does only machines released in the past month count? Or do you still count the original bad boys like the Cray? After all, when first built most Crays were multi million dollar number crunching beasts. Does the fact that you can get the same performance in a desktop now mean the Cray no longer counts? The power of computers is still growing at such a pace that the machine that costs millions a decade ago can probably be beaten by a cluster that would cost you less than 25K today, so how exactly would you suggest they define supercomputer?

      --
      ACs don't waste your time replying, your posts are never seen by me.
    3. Re:4 TFLOPS? by X-acious · · Score: 2, Interesting

      A single Radeon 4870x2 uses two chips

      2.4 / 2 = 1.2

      Each Tesla GPU has 240 cores and delivers about 1 TeraFLOPS single precision...

      Each Radeon HD 4870 produces 1.2 TFLOPS, about 0.2 more than one Tesla GPU.

      "NVIDIA announced...the Tesla Personal Supercomputer -- a 4 TeraFLOPS desktop...

      Two 4870 X2s equal 4.8 TFLOPS, 0.8 more than four Tesla GPUs.

      I think the parent's point was that even when an HD Radeon 4870 X2 is made up of two cards they're still connected and recognized as one. Thus, with "fewer" cards and fewer slots you could achieve more performance. Or you could use the other two vacant slots for yet another two 4870s: Four of them in crossfire would equal 9.6 TFLOPS, 5.6 more than four Tesla GPUs.

      Futhermore, I would assume two GPUs that are closely interconnected as a "single" card (4870 X2) would be better than a pair of GPUs connected through a combination of the motherboard (x2 Tesla GPU) and custom interconnects.

      I'm not implying that an HD 4870 is a viable alternative to a Tesla GPU but the "performance" is more than just comparable. As it's been mentioned before, the hardware concerned is meant for precision and not speed, otherwise known as performance. Then again, you could compensate for in-accuracy by using all that computing power to make multiple passes rather than making sure your initial calculations were accurate.

      Note: Emphasis by me in all quotes provided.

  5. Re:But.. by itsybitsy · · Score: 2, Funny

    Not yet.... darn NVidia, no Vista Drivers yet...

    Come on NVidia GET WITH IT!!!

  6. What, no coil? by dgun · · Score: 5, Funny

    What a rip.

    --
    FAQs are evil.
    1. Re:What, no coil? by geekmux · · Score: 3, Funny

      What a rip.

      Yeah, no shit. First bastard that tries to put a "Tesla Capable" sticker on the front, I'm gonna sue.

  7. What a disappointment by dleigh · · Score: 2, Interesting

    At first glance I thought these used actual Tesla coils in the processor, or the devices were at least powered or cooled by some apparatus that used Tesla coils.

    Turns out "Tesla" is just the name of the product.

    Drat. I demand a refund.

  8. Binary-only toolchain by Anonymous Coward · · Score: 5, Informative

    The toolchain is binary only and has an EULA that prohibits reverse engineering.

    1. Re:Binary-only toolchain by FireFury03 · · Score: 5, Informative

      has an EULA that prohibits reverse engineering.

      Not really a big deal to those of us in the EU since we have a legally guaranteed right to reverse engineer stuff for interoperability purposes.

  9. And the worst timing ever award goes to... by CryptoJones · · Score: 2, Insightful

    While the inner nerd in me screams to take out a loan against my house to buy one, I can't imagine this being very popular outside academia. Most users don't use the power of their crappy computers, let alone this. And then there is the whole "ECONOMY" thing.

    --
    "Chance favors the prepared mind." ~Me
    1. Re:And the worst timing ever award goes to... by Yetihehe · · Score: 2, Insightful

      It IS marketed for academia. Normal users don't really need to fold proteins or simulate nuclear weapons at home.

      --
      Extreme Programming - Redundant Array of Inexpensive Developers
    2. Re:And the worst timing ever award goes to... by palegray.net · · Score: 2, Informative

      I'm perfectly normal, and I fold proteins all the time.

    3. Re:And the worst timing ever award goes to... by Anonymous Coward · · Score: 2, Interesting

      according to http://folding.stanford.edu/English/Stats about 250.000 "normal" users are folding proteins at home.

      Personally, I would use it as a render farm, but Blender compatibility could take a while if Nvidia keeps the drivers and specification locked up.

      What they don't seem to mention is the amount of memory/core (at 960 cores). I'd guess about 32 MB/core, and 240 cores sharing the same memory bus...

  10. Let me be the first to say... by rdnetto · · Score: 5, Funny

    4 Terraflops should be more than enough for anybody...

    --
    Most human behaviour can be explained in terms of identity.
  11. Scientist speak by jnnnnn · · Score: 2, Interesting

    So many scientists use the word "codes" when they mean "program(s)".

    Why is this?

    1. Re:Scientist speak by Anonymous Coward · · Score: 3, Interesting

      It's cultural.

      You're not even allowed to say that you're "coding", but only that you produce "codes".

      Maybe it's because analytic science is basic on equations which become algorithms in computing, and you can't say that you're "equationing" nor "algorithming".

      In practice it's actually dishonest, because the algorithms don't have the conceptual power of the equations that they represent (they would if programmed in LISP, but "codes" are mostly written in Fortran and C), so the computations are often questionable. Even worse, it's almost impossible for one research group to compare the "codes" that yielded their results against those produced by another group when numerical computing is used, whereas equations are universally portable.

      The theoretical half of the scientific method has lost some of the firm foundations upon which it used to build in recent years, as a result of theorizing through numerical simulation. Fortunately it doesn't matter too much in most sciences because experiment soon demolishes any incorrect predictions. However, those sciences which deal with long-term or historic or otherwise untestable areas are suffering, as a fair bit of unsubstantiated nonsense is popping out of poorly approximated simulations and being claimed as "fact", even though reality hasn't agreed yet.

      Things are probably going to get worse in this area before they get better.

  12. Comment removed by account_deleted · · Score: 4, Funny

    Comment removed based on user account deletion

  13. Yes but by Colin+Smith · · Score: 2, Funny

    And then there is the whole "ECONOMY" thing.

    The whole reason the ECONOMY is in the tank is because there are not enough people like you taking loans out against their house to buy random stuff like this.

    Basically... IT'S ALL YOUR FAULT!

     

    --
    Deleted
  14. weak DP performance by Henriok · · Score: 5, Informative

    I supercomputing circles (i.e. Top500.org) double precision floating point operations seems to be what is desired. 4 TFLOPS single precision, while impressive, is overshadowed by the equally weak 80 GFLOPS double precision, beaten by a single PowerXCell 8i (successor to the Cell in PS3) or the latest crop of Xeons. I'm sure tesla will find its users but we won't see them on the Top500 list anytime soon.

    --

    - Henrik

    - when the Shadows descend -
  15. FTFL by mangu · · Score: 2, Informative

    now what the heck to do with it...

    All you need to do is follow the fscking link. Plenty of examples there.

    1. Re:FTFL by SmokeyTheBalrog · · Score: 2, Funny

      Once CUDA has deep consumer penetration the 3D CGI furry anime loli porn will come! In droves if not herds.

      Oh crap, I forgot to click Post AC.

  16. boring apps... let's have some realtime raytracing by Lazy+Jones · · Score: 3, Insightful

    there were a lot of early efforts trying to implement realtime rayracing engines for games (e.g. at Intel recently), let's port that stuff and have some fun.

    --
    "I love my job, but I hate talking to people like you" (Freddie Mercury)
  17. Weird options by mangu · · Score: 3, Insightful

    I went to the site and tried to configure one. The disk partition options are: "General Purpose, Internet Server, Developer's Workstation, File Server". I wonder, who needs three Tesla cards in a file server or an internet server?

  18. It also runs Python by mangu · · Score: 3, Informative

    Look, there's Python here. You can do the low-level high-performance core routines in C, and use Python to do all the OO programming. This is how God intended us to program.

    1. Re:It also runs Python by BOFHelsinki · · Score: 2, Funny

      Ah, Parseltongue. So you are of the Slytherin school of programmers?

    2. Re:It also runs Python by OriginalArlen · · Score: 3, Funny

      This is how God intended us to program.

      Then why did he write Perl?

      --

      Everything I needed to know about life, I learnt from Blake's Seven
  19. Erlang by Safiire+Arrowny · · Score: 2, Interesting

    So how do you get an Erlang system to run on this?

    1. Re:Erlang by eggnoglatte · · Score: 2, Insightful

      By writing an Erlang-to-CUDA compiler?

      More seriously though, it is probably not worth even trying, since the GPUs used in the Tesla support a very limited model of parallelism. Shoehorning the flexibility of Erlang into that would at the very leas result in a dramatic performance loss, if it is possible at all.

  20. Re:Only in C? Oh dear. by xororand · · Score: 5, Informative

    OO is very good for graphical interfaces, but it isn't particularly well suited for algorithms and other maths oriented stuff.

    The term OO is too general to make a statement about its usefulness for mathematics oriented problems. The powerful templating features of modern C++ are indeed very useful for numerical simulations:

    It's called C++ Expression Templates, an excellent tool for numerical simulations. ETs can get you very close to the performance of hand optimized C code while they're much more comfortable to use than plain C. Parallelization is also relatively easy to achieve with expression templates.

    A research team at my university actually uses expression templates to build some sort of meta compiler which translates C++ ETs into CUDA code. They use it to numerically simulate laser diodes.

    Search for papers by David Vandevoorde & Todd Veldhuizen if you want to know more about this. They both developed the technique independently.

    Vandevoorde also explains ETs to some degree in his excellent book "C++ Templates - The Complete Guide".

  21. Re:Only in C? Oh dear. by cnettel · · Score: 2, Informative

    OOP with virtual and all, yes. OOP with template magic to allow the compiler to do specializations can beat the heck out of even quite tediously hand-written C or FORTRAN, with much superior readability.

  22. And in other news... by bsDaemon · · Score: 5, Funny

    ... AMD has annouced today it new Edison Personal Supercomputer technology.

    The game is on.

  23. cold hard facts about cuda by Gearoid_Murphy · · Score: 2, Interesting

    it's not about how many cores you have but how efficiently they can be used. If your CUDA application is any way memory intensive you're going to experience a serious drop in performance. A read from the local cache is 100 times faster than a read from the main ram memory. This cache is only 16kb. I spend most of my time figuring out how to minimise data transfers. That said, CUDA is probably the only platform that offers a realistic means for a single machine to tackle problems requiring gargantuan computing resources.

    --
    prepare the survey weasels.
  24. Re:cold hard facts about cuda- unbalanced by anon+mouse-cow-aard · · Score: 4, Insightful

    People are always coming out of the wood work to claim supercomputer performance with such and such a solution, go back and look at GRAPE (which is really cool.) http://arstechnica.com/news.ars/post/20061212-8408.html or a lot of other supercomputer clusters. When you want something flexible, you look for "balance" that means a good relationship between memory capacity, latency & bandwidth, as well as computer power. in terms of memory capacity, the number people talk about is: 1 byte/flop... that is 1 Tbyte of memory is about right to keep 1 TFLOP flexibly useful. this thing has 4 G of memory for 4 TF... in other words: 1 byte / 1000 flops. it's going to be hard to use in a general purpose way.

  25. Re:Penguins' Got One Liquid Cooled! by BOFHelsinki · · Score: 2, Informative

    BTW, TFS makes a mistake calling this Tesla rig a supercomputer. Nvidia correctly just calls it a cluster replacement. A cluster is not a supercomputer, the interconnect makes all the difference, no matter how much FP crunching power there is. See NEC NX-9 or Cray's Seastar for a real supercomputer interconnect. Can't be arsed to check (this is Slashdot after all) but that Penguin Computing system likely has only InfiniBand or 10GbE for the switch network, making it "only" a cluster. :-)

  26. Nor turbine. by BOFHelsinki · · Score: 2, Interesting

    Shameless exploitation of the good name of one of the greatest inventors of all time. :-)

  27. Developement Platform by dreamchaser · · Score: 2, Insightful

    On that note, it would be a good development platform for realtime raytraced game engines. That way the code would be mature when affordable GPU's come out that can match that level of performance.

  28. Re:FLOPS not FLOP! by TeknoHog · · Score: 4, Funny

    What's the plural of FLOPS then? My preciouss FLOPSes?

    --
    Escher was the first MC and Giger invented the HR department.
  29. Patmos International by Danzigism · · Score: 3, Interesting

    ahh yes the idea of personal supercomputing. Back in '99 I worked for Patmos International. We were at the Linux Expo for that year as well if some of you might remember. Our dream was to have a parallel supercomputer in everyone's home. We used mostly Lisp and Daisy for the programming aspect. The idea was wonderful, but eventually came to a screeching halt when nothing was being sold. It was ahead of it's time for sure. you can find out a little more about it here. I find the whole ideal of symbolic multiprocessing very fascinating though.

    --
    *plays the Apogee theme song music*
  30. Your probably right about the "mad scientist" ... by PolygamousRanchKid+ · · Score: 2, Insightful

    . . . that's probably exactly the person who would buy one of these.

    Folks who are professionally working on mainstream problems that require supercomputers, well, they probably have access to one already. (Maybe one of the supercomputing folks might want to chime in here; do you have enough access/time? Would a baby-supercomputer be useful to you?)

    But there is certainly someone out there who was denied access, because his idea was rejected by peer review. He is considered a loopy nut bag, because he wants to prove that the Higg's boson is made of cottage cheese, or something like that.

    Yep, look for rejected supercomputing program proposals, and you have a list of potential customers.

    --
    Schroedinger's Brexit: The UK is both in and out of the EU at the same time!
  31. Re:Can I have a smaller version? by SpinyNorman · · Score: 3, Informative

    From NVidia's CUDA site, most of their regular display cards support CUDA, just with less cores (hence less performance) than the Tesla card. The cores that CUDA uses are what used to be called the vertex shaders on your (NVidia) card. The CUDA API is designed so that your code doesn't know/specify how many cores are going to be used - you just code to the CUDA architecture and at runtime it distrubutes the workload to the available cores... so you can develop for a low end card (or they even have an emulator) then later pay for th hardware/performance you need.

  32. Re: Is that all you got? by neomunk · · Score: 2, Interesting

    Neural nets.

    This setup sounds ideal for a training bed for fann programs. I can't recall if there's a port of fann for CUDA, but I think there might be.

  33. Re:Only in C? Oh dear. by HuguesT · · Score: 2, Informative

    Actually yes it is. For instance nobody has yet figured out an efficient matrix class in C++ that uses operator overloading. This is basically an impossible task to write B=A*X*A^t efficiently, which occurs all the time in linear analysis, because in C++ the transpose would require a copy operator, whereas one ought to get the job done simply with a different iterator. C++ is not equipped for this yet.

  34. 4 Terraflops? by yfkar · · Score: 2, Funny

    As opposed to astroflops?

  35. CUDA memory structure by DrYak · · Score: 2, Informative

    but I don't know enough about it to be able to give useful information on the subject.

    I do write some CUDA code, so I'll try to help.

    I believe that each of the chips has a 512 bit wide bus to 4GiB of memory.

    Indeed each physical package has entirely access to its own whole chuck of memory, regardless of who many "cores" the package contains (between 2 for the lowest end laptops GPUs and 16 for the highest end 8/9800 cards. Don't know about GT280. But the summary is wrong 240 is probably the amount of ALUs or the width of the SIMD) and regaless of how many "stream processor" there are (each core has 8 ALUs, which are exposed as 32-wide SIMD processing units, which in turn can keep up to 768-threads in flight thanks to some clever hyperthreading-like scheduling).

    So in one single GPU card all the memory is accessible.
    In a dual-GPU SLI card, each GPU has a full access to its own memory.
    So, in our situation, it's 4GiB for each Tesla Card.

    Then each core has a special internal memory which is shared by all the 32-to-768 threads running in parallel on the SIMD. (A couple of KiB, don't have the exact number handy).

    I'm not sure what the memory allocation per stream processor is but I think the other parts of the chip control what goes where.

    There's no actual per-stream-processor control of memory. There is something that looks like a "per-thread memory" but it's actually memory auto-allocated from the global memory.
    (It all the same global memory, and the compiler just makes sure that each thread uses a different chunk of it to avoid conflicts).

    And you actually do not control the stream-processors themselves.
    You write a kernel (a piece of code which will process a mass of data) and throw a number of threads to one GPU (one physical package : i.e.: either 1 normal graphic card, or half of a SLI dual GPU graphic card).
    The sceduler will dynamically spread all the concurrent threads among the SIMD processors on the GPU.

    There probably are some bottlenecks

    Yes, indeed :
    - These 4GiB aren't cached at all (that's why it's preferable to use them only in the begin and the end of a calculation and use other types of memory during the calculations), have a big latency (that's why its better to have more threads running together so the scheduler can switch threads to hide latency) and you have to access them in a special fashion to group together the read-writes for faster access.

    - Then there's the texture access. Using a special set of functions you can access the memory not directly but as if it was textures. It still has a big latency and it read-only. On the other hand, it has a cache so it has much better bandwidth and the texture units don't require special ordering of the access.

    - The last type of memory is an ultra fast on-chip read-write memory which is shared for all the threads executed at the same time on the same core. But its access pattern is weird because everything is accessed in banks (one bank per thread or all threads on the same bank. Never many-to-many).

    So, in the end writing good CUDA code requires some voodoo magic to correctly organise your stuff into memory in the most efficient way.

    --
    "Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
  36. It's news because... by raftpeople · · Score: 2, Interesting

    NVIDIA has done a good job of making the processing power accessible to programmers that are not GPU coding experts. In addition, they have made hardware changes to better support the type of scientific computation being done on these devices.

    So, while in theory you could put together some Radeon's, work with their API and achieve the same thing, NVIDIA has significantly reduced the level of effort to make it happen.

  37. Re:Can I have a smaller version? by kramulous · · Score: 2, Informative

    The 10K refers to a rack mount solution containing 4xGPUs. You can still buy a single GPU and try and put it in a standard machine (provided it doesn't melt - I'd read the specs) for about a quarter of the price.

    --
    .