Why 'Gaming' Chips Are Moving Into the Server Room

← Back to Stories (view on slashdot.org)

Why 'Gaming' Chips Are Moving Into the Server Room

Posted by timothy on Thursday July 15, 2010 @07:37AM from the expense-report-manipulation-++ dept.

Esther Schindler writes "After several years of trying, graphics processing units (GPUs) are beginning to win over the major server vendors. Dell and IBM are the first tier-one server vendors to adopt GPUs as server processors for high-performance computing (HPC). Here's a high level view of the hardware change and what it might mean to your data center. (Hint: faster servers.) The article also addresses what it takes to write software for GPUs: 'Adopting GPU computing is not a drop-in task. You can't just add a few boards and let the processors do the rest, as when you add more CPUs. Some programming work has to be done, and it's not something that can be accomplished with a few libraries and lines of code.'"

137 comments

Min score:

Reason:

Sort:

A whole new level of parallelism by TwiztidK · 2010-07-15 07:42 · Score: 4, Insightful

I've heard that many programmers have issues coding for 2 and 4 core processors. I'd like to see how they'll addapt to running "run hundreds of threads" in parallel.

--
Sent from my iPhone 5
1. Re:A whole new level of parallelism by morcego · 2010-07-15 07:50 · Score: 3, Insightful
  
  This is just like programing for a computer cluster ... after a fashion.
  Anyone used to do both should have no problem with this.
  I'm anything but a high end programmer (I mostly only code for myself), and I have written plenty of code that runs with 7-10 threads. Believe me, when you change the way you think about how an algorithm works, it doesn't matter if you are using 3 or 10000 processors.
  
  --
  morcego
2. Re:A whole new level of parallelism by Austerity+Empowers · 2010-07-15 07:50 · Score: 2, Insightful
  
  CUDA or OpenCL is how they do it.
3. Re:A whole new level of parallelism by Sax+Maniac · 2010-07-15 07:51 · Score: 3, Insightful
  
  This isn't hundreds of threads that can run arbitrary code paths like a CPU, you have to totally redesign your code, or already have implemented parallel code so that you already run a number of threads that all do the same thing at the same time, just on different data.
  The threads all run in lockstep, as in, all the threads better be at the same PC at the same time. If you run into a branch in the code, then you lose your parallelism, as the divergent threads are frozen until they come back together.
  I'm not a big thread programmer, but I do work on threading tools. Most of the problems with threads seems to come with threads doing totally different code paths, and the unpredictable scheduling interactions that arise between them. GPU coding a lot more tightly controlled.
  
  --
  I can explanate how to administrate your network. You must configurate and segmentate it, so it can computate.
4. Re:A whole new level of parallelism by Monkeedude1212 · 2010-07-15 07:57 · Score: 1
  
  I've heard that many programmers have issues coding for 2 and 4 core processors.
  Or even multiple processors, for that matter.
  That in and of itself is almost an entirely new section of programming - if you were an Ace 15 years ago, your C++ skills might still be sharper than most new graduates, but most post secondaries are now teaching students how to properly thread for parallel programming. If you don't know how to code for 2 or 4 core processors, you really should jump on board. Almost every computer and laptop I can think of being sold brand new today has more than 1 core or processor.
  
  I'd like to see how they'll addapt to running "run hundreds of threads" in parallel.
  It requires a slightly more abstract design pattern, designed to be flexible. Kind of like moving from older structured program to object oriented - you just have to approach it differently. I haven't had to deal with any of it myself, but I imagine it'll boil down to knowing what calculations in your program can be done simultaneously, and then setting up a way to dump it off onto the next available core. That way, instead of stopping a core to wait and synch with another, you are synching the thread conceptually as it simply waits for the data, not the processors the next step might need.
5. Re:A whole new level of parallelism by Nadaka · 2010-07-15 08:04 · Score: 4, Insightful
  
  No it isn't. That you think so just shows how much you still have left to learn.
  I am not a high end programmer either. But I have two degrees on the subject and have been working professionally in the field for years, including optimization and parallelization.
  Many algorithms just won't have much improvement with multi-threading.
  Many will even perform more poorly due to data contention and the overhead of context switches and creating threads.
  Many algorithms just can not be converted to a format that will work within the restrictions of GPGPU computing at all.
  The stream architecture of modern GPU's work radically differently than a conventional CPU.
  It is not as simple as scaling conventional multi-threading up to thousands of threads.
  Certain things that you are used to doing on a normal processor have an insane cost in GPU hardware.
  For instance, the if statement. Until recently OpenCL and CUDA didn't allow branching. Now they do, but they incur such a huge penalty in cycles that it just isn't worth it.
6. Re:A whole new level of parallelism by Dynetrekk · 2010-07-15 08:08 · Score: 5, Insightful
  
  Believe me, when you change the way you think about how an algorithm works, it doesn't matter if you are using 3 or 10000 processors.
  Have you ever read up on Amdahl's law?
7. Re:A whole new level of parallelism by pushing-robot · 2010-07-15 08:20 · Score: 3, Funny
  
  Microsoft must be doing a bang-up job then, because when I'm in Windows it doesn't matter if I'm using 3 or 10000 processors.
  
  --
  How can I believe you when you tell me what I don't want to hear?
8. Re:A whole new level of parallelism by Anonymous Coward · 2010-07-15 08:23 · Score: 1, Informative
  
  Uh
  OpenCL and CUDA supported branching from day one (with a performance hit). Before they existed, there was some (very little) usage of GPUs for general purpose computing and they used GLSL/HLSL/Cg, which supported branching poorly or not at all.
  The tools that were recently added to CUDA (for the latest GPUs) are recursion and function pointers.
9. Re:A whole new level of parallelism by jgagnon · 2010-07-15 08:23 · Score: 3, Interesting
  
  The problem with "programming for multiple cores/CPUs/threads" is that it is done in very different ways between languages, operating systems, and APIs. There is no such thing as a "standard for multi-thread programming". All the variants share some concepts in common but their implementations are mostly very different from each other. No amount of schooling can fully prepare you for this diversity.
  
  --
  Remember to maintain your supply of /facepalm oil to prevent chafing.
10. Re:A whole new level of parallelism by Chris+Burke · 2010-07-15 08:33 · Score: 4, Informative
  
  Programmers of Server applications are already used to multithreading, and they've been able to make good use of systems with large numbers of processors on them even before the advent of virtualization.
  But don't pay too much attention to the word "Server". Yes the machines that they're talking about are in the segment of the market referred to as "servers", as distinct from "desktops" or "mobile". But the target of GPU-based computing isn't "Servers" in the sense of the tasks you normally think of -- web servers, database servers, etc.
  The real target is mentioned in the article, and it's HPC, aka scientific computing. Normal server apps are integer code, and depend more on high memory bandwidth and I/O, which GPGPU doesn't really address. HPC wants that stuff too, but they also want floating point performance. As much floating point math performance as you can possibly give them. And GPUs are way beyond what CPUs can provide in that regard. Plus a lot of HPC applications are easier to parallelize than even the traditional server codes, though not all fall in the "embarrassingly parallel" category.
  There will be a few growing pains, but once APIs get straightened out and programmers get used to it (which shouldn't take too long for the ones writing HPC code), this is going to be a huge win for scientific computing.
  
  --
  
  The enemies of Democracy are
11. Re:A whole new level of parallelism by Anonymous Coward · 2010-07-15 08:45 · Score: 1, Insightful
  
  Well, GPGPU actually in a way addresses the memory bandwidth. Mostly due to design limitations, each GPU comes with their own memory, and thus memory bus and bandwidth.
  Of course you can get that for CPUs as well (with new Intels or any non-ancient AMD) by going to multiple sockets, however that is more effort and costlier (6 PCIe slots - unusual but obtainable - and you can have 12 GPUs, each with their own bus, try getting a 12-socket motherboard...).
12. Re:A whole new level of parallelism by Miseph · 2010-07-15 09:13 · Score: 1
  
  Isn't that basically true of everything else in coding too? You wouldn't code something in C++ for Linux the same way that you would code it in Java for Windows, even though a lot of it might be similar.
  Is parallelization supposed to be different?
  
  --
  Try not to take me more seriously than I take myself.
13. Re:A whole new level of parallelism by Hodapp · 2010-07-15 09:24 · Score: 3, Informative
  
  I am one such programmer. Yet I also coded for an Nvidia Tesla C1060 board and found it much more straightforward to handle several thousand threads at once.
  Not all types of threads are created equal. I usually explain CUDA to people as the "Zerg Rush" model of computing - instead of a couple, well-behaved, intelligent threads that try to be polite to each other and clean up their own messes, you throw a horde of a thousand little vicious, stupid threads at the problem all at once, and rely on some overlord to keep them in line.
  Most of the guides explained it as, "Flops are free, bandwidth is expensive." This board had a 384 or 512-bit wide memory bus with a very high latency, and the reason you throw that many threads at it is to let the hardware cover up the latency - it can merge a huge number of memory reads/writes into one operation, and as soon as a thread is waiting on memory I/O it can swap another thread into that same SP and let it compute. If memory serves me, the board was divided into blocks of 8 scalar processors (each block had some scratchpad memory that could be accessed almost as fast as a register) and you wrote groups of 16 threads which ran in lock-step on that processor (no recursion was allowed, and if one branched, the others would just wait around until it reached the same point) in two rounds.
  Sure, that's a bit complex to optimize for, but it beats the hell out of conventional threading while trying to optimize for x86 SIMD. And if you manage to write it so it runs well on CUDA, it generally will scale effortlessly to whatever card you throw it at.
  It's looking like OpenCL won't be much different, but I have yet to try it. I'm kind of eager, since apparently AMD/ATI's current cards, for the money, have a bit more raw power than Nvidia's.
14. Re:A whole new level of parallelism by Twinbee · 2010-07-15 09:31 · Score: 1
  
  Are If branches only slow because of what someone said below:
  "If you run into a branch in the code, then you lose your parallelism, as the divergent threads are frozen until they come back together."
  Because if that's the case, that's fine by me. The worst case length that a thread can run can be defined and even low in some cases I know of.
  
  --
  Why OpalCalc is the best Windows calc
15. Re:A whole new level of parallelism by Anonymous Coward · 2010-07-15 09:38 · Score: 0
  
  Yes, that is correct. I don't know the details of ATI's hardware very well, but on nVidia, branch divergence is only an issue if threads within a warp (32 threads with sequential id), and then it slows it down by evaluating different branches successively.
16. Re:A whole new level of parallelism by emt377 · 2010-07-15 09:40 · Score: 1
  
  There will be a few growing pains, but once APIs get straightened out and programmers get used to it (which shouldn't take too long for the ones writing HPC code), this is going to be a huge win for scientific computing.
  All you say makes sense, but I for one don't understand the market for this. Today, if you need a compute server that's good for stream (e.g. SIMD) workloads you get a dozen 1U/2U rackmounts and fill them up with as many GPU boards as they'll take. You put a work scheduler on them that accepts tasks from a dispatch server, and hook them up to a NAS box (or just rsync data sets and results from a storage subsystem). Then you put a transaction server in front it all that exposes a job manager.
  Throwing a couple of GPU boards in a transaction server will add some computational punch, but not enough to make it a real compute server. It'll be too expensive to rack and stack. It'll cost more than a plain old transaction server. The market clearly is whoever needs a little bit of computational power on their backend - but does it really exist?
17. Re:A whole new level of parallelism by Anonymous Coward · 2010-07-15 09:58 · Score: 0
  
  there's a re-join of divergent threads. the penalty is that you serialize for one thread for a while ( e.g. 15x issue rate drop).
  but that behaviour is also not going to provide coalesced mem transactions... so some algorithms are really easy and some require some work to get it to scale.
18. Re:A whole new level of parallelism by Chris+Burke · 2010-07-15 10:08 · Score: 1
  
  All you say makes sense, but I for one don't understand the market for this. Today, if you need a compute server that's good for stream (e.g. SIMD) workloads you get a dozen 1U/2U rackmounts and fill them up with as many GPU boards as they'll take.
  Well I was mostly just trying to justify why the transition to the situation "today" is taking place, and why the multi-threading itself isn't that big a deal. "Yesterday" the biggest compute servers were still made from traditional CPUs. Only recently has the potential for GPUs as general purpose (if your "general" purpose is FP math) computational devices really captured significant mindshare, and APIs an methodologies are still being ironed out.
  It seemed like the article was talking about having GPUs in the "server" room in general, not about a specific situation where there's only a few GPUs stuck in an otherwise normal rackmount server. That would be fairly pointless. Though on the other hand there is some research that suggests that having more than a token amount of regular CPUs close to the GPUs is useful.
  
  --
  
  The enemies of Democracy are
19. Re:A whole new level of parallelism by Lord+of+Hyphens · 2010-07-15 10:41 · Score: 2, Interesting
  
  Have you ever read up on Amdahl's law?
  I'll see your Amdahl's Law, and raise you Gustafson's Law.
  
  --
  "I've spent my whole life figuring out crazy ways to do things. It'll work." -- Montgomery Scott, "Relics"
20. Re:A whole new level of parallelism by morcego · 2010-07-15 10:43 · Score: 1
  
  Nadaka, you are just proving my statement there.
  What you are describing are people using the wrong kind of logic and algorithms to do parallelization.
  The only new statement you make is:
  
  Many algorithms just can not be converted to a format that will work within the restrictions of GPGPU computing at all.
  I will take your word for it, since I really don't know GPGPUs at all. Most of my experience with parallelism is with clusters (up to 30 nodes). On that scenario, 99% of the time I've heard someone say something like that was because they were using bad algorithms for parallel processing, and even with 2-3 nodes they were not ideal.
  But as I said, I have no experience with GPGPU, so my experience with clusters might not be relevant.
  
  --
  morcego
21. Re:A whole new level of parallelism by Fulcrum+of+Evil · 2010-07-15 10:47 · Score: 2, Insightful
  
  most post secondaries are now teaching students how to properly thread for parallel programming.
  No they aren't. Even grad courses are no substitute for doing it. Never mind that parallel processing is a different animal than SIMD-like models that most GPUs use.
  
  I haven't had to deal with any of it myself, but I imagine it'll boil down to knowing what calculations in your program can be done simultaneously, and then setting up a way to dump it off onto the next available core.
  No, it's not like that. you set up a warp of threads running the same code on different data and structure it for minimal branching. That's the thumbnail sketch - nvidia has some good tutorials on the subject and you can use your current GPU.
  
  --
  "We returned the General to El Salvador, or maybe Guatemala, it's difficult to tell from 10,000 feet"
22. Re:A whole new level of parallelism by nxtw · 2010-07-15 11:08 · Score: 1
  
  The problem with "programming for multiple cores/CPUs/threads" is that it is done in very different ways between languages, operating systems, and APIs.
  Really?
  Most modern operating systems implement POSIX threads, or are close enough that POSIX threads can be implemented on top of a native threading mechanism. The concept of independently scheduled threads with a shared memory space can only be implemented in so many ways, and when someone understands these concepts well, everything looks rather similar.
  It seems that claiming things are radically different due to superficial differences is fairly common today in computer science.
  
  No amount of schooling can fully prepare you for this diversity.
  Of course not, if you're the kind of person who can't grasp the concepts.
23. Re:A whole new level of parallelism by Anonymous Coward · 2010-07-15 11:09 · Score: 2, Interesting
  
  You might find this Google Tech Talk interesting..
24. Re:A whole new level of parallelism by sarkeizen · 2010-07-15 11:52 · Score: 2, Informative
  
  Personally (and I love that someone below mentioned Ahmdals law). The problem isn't as you said about specific language constructs but that there isn't any general solution to parallelism. That is to use Brook's illustration, problems we try to solve with computers aren't like harvesting wheat - they aren't efficiently divisible to an arbitrary degree. We do know of a few problems like this which we call "embarassingly parallel" but these are few and far between. So GPU's are great MD5 crackers, protein folders and I personally *love* writing CUDA code but I don't suffer from the delusion that this is somehow a revolution in software. That the usual day-to-day tasks are going to be affected. So the idea that GPUs are moving into the server room seems optimistic because the majority of stuff in there is pretty mundane.
  
  That said I'd say I wonder if there aren't some architectural limitations on GPUs e.g. memory protection and if we really wanted to use these for general purpose computing and added them would we lose performance? In other words are we just making some kind of cores-to-features tradeoff?
25. Re:A whole new level of parallelism by Jeremy+Erwin · 2010-07-15 11:57 · Score: 1
  
  Java for Windows? I think you might be missing the point.
26. Re:A whole new level of parallelism by DigiShaman · 2010-07-15 12:30 · Score: 1
  
  The HPC platform comes in two flavors. Server and desktop. I'm of the understanding that the HPC server is mainly used for quick post-processing of data. While real-time interaction with data is usually done on the desktop.
  
  --
  Life is not for the lazy.
27. Re:A whole new level of parallelism by psilambda · 2010-07-15 13:19 · Score: 3, Interesting
  
  The article and everybody else are ignoring one large, valid use of GPUs in the data center--whether you call it business intelligence or OLAP--it needs to be in the data center and it needs some serious number crunching. There is not as much difference between this and scientific number crunching as most people might think. I have been involved in both crunching numbers for financials at a major multinational and had the privilege of being the first to process the first full genome (complete genetic sequence--terabytes of data) for a single individual and actually the genomic analysis was much more integer based than the financials. Based on my experience with both, I created the Kappa library for doing CUDA or OpenMP analysis in a datacenter--whether for business or scientific work.
28. Re:A whole new level of parallelism by BitZtream · 2010-07-15 14:05 · Score: 1
  
  Never heard of posix eh?
  
  --
  Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
29. Re:A whole new level of parallelism by Anonymous Coward · 2010-07-15 16:03 · Score: 0
  
  Yes, pretty much. When you hit a branch the GPU executes all the threads that go down the "if" side in parallel, then all the threads that go down the "else" side in parallel, so execution time doubles if even one thread diverges.
30. Re:A whole new level of parallelism by Anonymous Coward · 2010-07-15 16:19 · Score: 0
  
  Believe me, when you change the way you think about how an algorithm works, it doesn't matter if you are using 3 or 10000 processors.
  believe me you are completely wrong. the more processors you throw into the mix the more the costs of managing and controlling threads increases as well as the contention issues with resources as well as the inherent costs in just distributing the work load itself. Not to mention the very different architecture of and costs around GPGPU. coding for 3 processors is night and day differences compared to programming for even 100.
31. Re:A whole new level of parallelism by jlar · 2010-07-15 16:20 · Score: 1
  
  "There will be a few growing pains, but once APIs get straightened out and programmers get used to it (which shouldn't take too long for the ones writing HPC code), this is going to be a huge win for scientific computing."
  I am working on HPC (numerical modelling). At our institute we are seriously considering using GPU's for our next generation development. From my viewpoint the biggest problem is that the abstraction layers on top of the GPU's are not widely implemented and that they are somewhat more complicated than traditional parallellization frameworks. But hopefully this will change in a few years.
32. Re:A whole new level of parallelism by pseudorand · 2010-07-15 16:21 · Score: 1
  
  Parallel programming is a bit different, but so is event-drive (Windows, JS) vs. procedural, and programmers do both of those fine. The problem, unfortunately, isn't that we're all too stupid to pick up multi-threaded programming, but the hardware isn't yet useful enough to make it worth the trouble. Take CUDA for example. To take advantage of the GPU you first have to copy data from main memory into GPU memory, do your parallel processing, then copy data back to main memory. Even for algorithms that are parallelisable, the time it takes to transfer data to/from GPU memory eats up the gains you get from multiple threads. So the reason we haven't all learned GPU programming (after all, just about every recent Nvidia card supports it) is that most of us simply don't have problems that can actually benefit from it.
  In short, don't buy the Nvidia/Dell/IBM hype. It's lots of work to port your problem to the GPU (if it's even possible) and there's no guarantee of speedup when you do. Don't buy a Tesla until you've done the appropriate algorithm analysis to determine you can actually use it!
33. Re:A whole new level of parallelism by David+Greene · 2010-07-15 17:09 · Score: 4, Interesting
  
  The stream architecture of modern GPU's work radically differently than a conventional CPU.
  True if the comparison is to a commodity scalar CPU.
  
  It is not as simple as scaling conventional multi-threading up to thousands of threads.
  True. Many algorithms will not map well to the architecture. However, many others will map extremely well. Many scientific codes have been tuned over the decades to exploit high degrees of parallelism. Often the small data sets are the primary bottleneck. Strong scaling is hard, weak scaling is relatively easy.
  
  Certain things that you are used to doing on a normal processor have an insane cost in GPU hardware.
  In a sense. These are not scalar CPUs and traditional scalar optimization, while important, won't utilize the machine well. I can't think of any particular operation that's greatly slower then on a conventional CPU, provided one uses the programming model correctly (and some codes don't map well to that model).
  
  For instance, the if statement.
  No. Branching works perfectly fine if you program the GPU as a vector machine. The reason branches within a warp (using NVIDIA terminology) are expensive is simply because a warp is really a vector. The GPU vendors just don't want to tell you that because either they fear being tied to some perceived historical baggage with that term or they want to convince you they're doing something really new. GPUs are interesting, but they're really just threaded vector processors. Don't misunderstand me, though, it's a quite interesting architecture to work with!
  
  --
34. Re:A whole new level of parallelism by Anonymous Coward · 2010-07-15 18:33 · Score: 0
  
  "Normal server apps are integer code, and depend more on high memory bandwidth and I/O, which GPGPU doesn't really address."
  http://www.amd.com/us/products/desktop/processors/phenom-ii/Pages/phenom-ii-key-architectural-features.aspx
  AMD X6 1055T/1090T ->Up to 17.1GB/s memory bandwidth for DDR2 and up to 21GB/s memory bandwidth for DDR3
  http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5970/Pages/ati-radeon-hd-5970-specifications.aspx
  AMD 5970 -> Memory bandwidth: 256.0 GB/sec
  You were saying?
  And these PCI-E addon boards come equipped with 512MB/1GB/2GB of dedicated memory ... and their own cooling system.
  I don't doubt you could get more industrial versions, rather than consumer versions.
35. Re:A whole new level of parallelism by inKubus · 2010-07-15 18:42 · Score: 1
  
  I was thinking the same thing. OLAP is all about manipulating big 2d and 3d sets, blending them with other sets, etc. All things GPUs have ops for on the die. Not that there aren't already relational db accelerator chips in the mainframe arena (such as the zIIP). Obviously the drivers and front end needs to be remade to make the programming make sense, like OpenDL (data language) instead of OpenGL.
  
  --
  Cool! Amazing Toys.
36. Re:A whole new level of parallelism by Anonymous Coward · 2010-07-15 18:52 · Score: 2, Interesting
  
  I've heard that many programmers have issues coding for 2 and 4 core processors. I'd like to see how they'll addapt to running "run hundreds of threads" in parallel.
  
  If that's the paradigm they're operating in, it will probably fail spectacularly. Let me explain why.
  In the end, GPU's are essentially vector processors (yes, I know that's not exactly how they work internally, but bear with me). You feed them one or more input vectors of data and one or two storage vectors for output and they do the same calculation on every element of the input and store the results in the output. Think about what you need for pixel rendering: it's things like "apply a fixed Affine transform to every pixel of the input image and store the results as the output image" or "add [alpha blend] these two images together and store the result." These are the kind of tasks vector processors like the old Cray's were designed to implement efficiently; compilers implementing OpenMP are also working within this kind of paradigm.
  Threads, in contrast to vector processing, are independent streams of execution. While you can use threads to split a loop into pieces, the normal thread pattern is something more like "wait for an event, and then respond to it appropriately." The real problem here is that because threads are independent tasks, memory sharing is hard (semaphores, spin locks, and all that) because you can't guarantee the behavior of any other thread.
  Clusters, finally, as a few people have mentioned (although perhaps never used), are different yet again. While each node in a cluster runs as an independent machine and thus conceptually resembles a thread, the nodes don't have a pool of shared memory (they may not even have shared disk space!). If I want to get data from node A to node B, I have to copy it over the network. Because the internal bandwidth of a cluster is so much lower than the memory bus of a shared-memory computer, you spend most of your time figuring out how to minimize the amount of data you have to copy between nodes and worrying about things like cluster topology. As a result, algorithms that scale well on a shared-memory machine may or may not scale well at all on a distributed cluster.
  So why bother? Because each design has its own strengths and weaknesses. Vector processors are great if your doing a vector operation, but things like stream processing (e.g., compressing video data) don't vectorize particularly well. Threads are generic and flexible; so flexible that you can't really optimize the hardware for them. They also require discipline to avoid dead-locks and other related problems. Clusters, finally, are inexpensive and are ideally suited for "batch" tasks like web servers or databases where each thread really is an independent job, but for things like weather simulations (where lots of data has to be exchanged between nodes) they require very careful attention to the algorithms used or the performance can tank as the size of the system gets large.
37. Re:A whole new level of parallelism by Anonymous Coward · 2010-07-15 19:45 · Score: 0
  
  Bullshit. No Windows make use of 10k processors.
  Oh! Nevermind.
38. Re:A whole new level of parallelism by Anonymous Coward · 2010-07-15 20:46 · Score: 0
  
  There are those great tools called compilers that are getting more and more complex these days. And some compilers can detect areas in your code, that can execute in parallel. Some other compilers (read: most compilers today) can be instructed to specifically made areas of code parallel (think OpenMP). Now there are compilers that can be instructed to make portions of the code to run on the GPU, like PGI's Fortran compiler that can generate CUDA-enabled programs with only some compiler directives added to the existing serial code. Of course there are tasks that will benefit and there are tasks that won't, the same way as with OpenMP. A physics simulation, an engineering model, password guessing, database traversal - these all can run hundreds of times faster on a GPGPU.
  Migration to GPGPU computing requires a shift from the current task parallelism to data parallelism.
39. Re:A whole new level of parallelism by David+Greene · 2010-07-16 07:05 · Score: 1
  
  You're talking about solving two completely different problems. Gustafson's Law assumes weak scaling. Amdahl's law assumes strong scaling. It's easy to show that by scaling the problem size to the number of processors available, one can compute more in the same amount of time (weak scaling). It's much harder to keep the problem size the same and get linear speedup as the number of processors increases (strong scaling), ignoring superlinear effects like the problem suddenly fitting into the cache, which are not seen with real problems.
  
  --
40. Re:A whole new level of parallelism by David+Greene · 2010-07-16 07:07 · Score: 1
  
  The threads all run in lockstep
  The threads in a warp all run in lockstep. Different warps can follow separate control paths with no problem. That's because a warp is really a vector. Think of the GPU as a threaded vector machine and you're golden.
  
  --
41. Re:A whole new level of parallelism by David+Greene · 2010-07-16 08:50 · Score: 1
  
  No. Branching works perfectly fine if you program the GPU as a vector machine. The reason branches within a warp (using NVIDIA terminology) are expensive is simply because a warp is really a vector.
  I should add that with a vector programming model, the way to handle control flow is via masked operations, which GPUs provide. In fact divergence is simply the GPU hardware emulating a masked operation when the programmer/compiler hasn't supplied one.
  
  --
42. Re:A whole new level of parallelism by Chris+Burke · 2010-07-16 12:04 · Score: 1
  
  You were saying?
  I was saying that graphics cards don't address the memory throughput issue, and they don't, because what matters is not how fast you can access the on-board memory, but rather how fast you can stream data to that on-board memory (because even 2GB is only a fraction of overall system memory in these systems, and sever apps in particular tend not to be streaming but more random access), and that's fast but still slightly slower than CPU DRAM itself.
  Specs don't tell you everything. You have to interpret them in the context of what you are doing.
  
  --
  
  The enemies of Democracy are
43. Re:A whole new level of parallelism by Tablizer · 2010-07-18 04:46 · Score: 1
  
  640 processors is all anyone would ever need
  
  --
  Table-ized A.I.
Good luck with that by tedgyz · 2010-07-15 07:47 · Score: 3, Insightful

This is a long-standing issue. If your programs don't just "magically" run faster, then count out 90% or more of the programs that will benefit from this.

--
"No matter where you go, there you are." -- Buckaroo Banzai
1. Re:Good luck with that by Anonymous Coward · 2010-07-15 07:57 · Score: 0
  
  That's okay - 99% or more of programs don't need to run faster. It's remaining 1% that is actually doing something important we want to run faster.
2. Re:Good luck with that by crafty.munchkin · 2010-07-15 11:23 · Score: 1
  
  Can anyone provide any info on how this is going to work with regard to virtual environments? After all, there has been a rather large push toward virtualizing everything in the datacenter, and about the only physical server we have left in our server is the Fax/SMS server, the ISDN card and GSM module for which could not be virtualised...
  
  --
  ... wait, what?
3. Re:Good luck with that by tedgyz · 2010-07-15 22:07 · Score: 1
  
  Anyone worried enough about performance to adopt GPGPU computing is probably not going to virtualize.
  We have virtualized a good portion of our servers, but the critical ones, like our db servers are still good old fashioned iron.
  Personally, I hate all this virtualization. The people that run these things think is the second coming of Christ. If you try to point out flaws in their "amazing" virtual cluster, they always claim nothing is wrong.
  
  --
  "No matter where you go, there you are." -- Buckaroo Banzai
Yes, of course by Anonymous Coward · 2010-07-15 07:47 · Score: 2, Funny

The sysdamins need new machines with powerful GPUs, you know, for business purposes.
Oh and, they sell ERP software on Steam now, too, so we'll have to install that as well.
1. Re:Yes, of course by Yvan256 · 2010-07-15 07:54 · Score: 5, Funny
  
  Portal 2? It's something for our Web server. It adds more portals to access the internet.
2. Re:Yes, of course by Anonymous Coward · 2010-07-15 08:21 · Score: 0
  
  Anyone can requisition cheap hardware, only a SysAdmin can spend 100x for the same thing with more blinkin lights, now with 1000GPU cores per blade, get at it programmers, performance issues are in your court now!
CUDA by Lord+Ender · 2010-07-15 07:48 · Score: 3, Informative

I was interested in CUDA until I learned that even the simplest of "hello world" apps is still quite complex and quite low-level.
NVidia needs to make the APIs and tools for CUDA programming simpler and more accessible, with solid support for higher-level languages. Once that happens, we could see adoption skyrocket.

--
A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
1. Re:CUDA by Rockoon · 2010-07-15 08:01 · Score: 4, Interesting
  
  Indeed. With Cuda, DirectCompute, and OpenCL, nearly 100% of your code is boilerplate interfacing to the API.
  
  There needs to be a language where this stuff is a first-class citizen and not just something provided by an API.
  
  --
  "His name was James Damore."
2. Re:CUDA by Austerity+Empowers · 2010-07-15 08:03 · Score: 1
  
  Probably can't happen, the parallel computing model is very different than the model you use in applications today. It's still evolving, but I doubt you will ever be in a position where you can write code as you do now and have it use and benefit from GPU hardware out of the gates.
3. Re:CUDA by cgenman · 2010-07-15 08:11 · Score: 2, Interesting
  
  While I don't disagree that NVIDIA needs to make this simpler, is that really a sizeable market for them? Presuming every college will want a cluster of 100 GPU's, they've still got about 10,000 students per college buying these things to game with.
  I wonder what the size of the server room market for something that can't handle IF statements really would be.
  
  --
  The ______ Agenda
4. Re:CUDA by Anonymous Coward · 2010-07-15 08:15 · Score: 0
  
  I was interested in CUDA until I learned that even the simplest of "hello world" apps is still quite complex and quite low-level.
  But it looks awesome!
5. Re:CUDA by bberens · 2010-07-15 08:18 · Score: 1
  
  import java.util.concurrent.*; //???
  
  --
  Check out my lame java blog at www.javachopshop.com
6. Re:CUDA by jpate · 2010-07-15 08:40 · Score: 1
  
  Actors are a really good framework (with a few different implementations) for easy parallelization. Scala has an implementation of Actors as part of the standard library, so they really are first-class citizens.
7. Re:CUDA by tedgyz · 2010-07-15 08:42 · Score: 1
  
  I was interested in CUDA until I learned that even the simplest of "hello world" apps is still quite complex and quite low-level.
  NVidia needs to make the APIs and tools for CUDA programming simpler and more accessible, with solid support for higher-level languages. Once that happens, we could see adoption skyrocket.
  The simple fact is, parallel programming is very hard. More to the point, most programs don't need this type of parallelism.
  
  --
  "No matter where you go, there you are." -- Buckaroo Banzai
8. Re:CUDA by jgtg32a · 2010-07-15 08:43 · Score: 1
  
  ever?
9. Re:CUDA by 0100010001010011 · 2010-07-15 08:51 · Score: 1
  
  You mean like C/Objective-C and Grand Central Dispatch?
  It's open source and has been ported to work with FreeBSD and Apache.
  Doesn't care if it's a CPU, GPU, 10xGPUs etc.
10. Re:CUDA by Lord+Ender · 2010-07-15 08:52 · Score: 1
  
  Well, since you can crack a password a hundred (or more) times faster with CUDA than with a CPU, they could at least sell a million units to the NSA and the FBI... and the analogous departments of every other country...
  
  --
  A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
11. Re:CUDA by Dekker3D · 2010-07-15 09:03 · Score: 1
  
  Yes. Just like we still doubt that anybody ever should need more than 640K.
12. Re:CUDA by Dekker3D · 2010-07-15 09:05 · Score: 1
  
  Plenty of data processing could be parallelized to GPU style code, I'll bet. As long as you've got enough data that needs enough processing, you can probably get a speedup from that. Just how much, is another question..
13. Re:CUDA by Anonymous Coward · 2010-07-15 09:17 · Score: 0
  
  PyCUDA: http://documen.tician.de/pycuda/
14. Re:CUDA by russotto · 2010-07-15 09:42 · Score: 1
  
  I found just the opposite; not enough low-level access. For instance, no access to the carry bit from integer operations!
15. Re:CUDA by Lord+Ender · 2010-07-15 09:50 · Score: 1
  
  The PyCUDA "hello world" involvies inline C code!
  
  --
  A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
16. Re:CUDA by Anonymous Coward · 2010-07-15 10:19 · Score: 0
  
  What a load of old tosh. The current CUDA SDKs and samples couldn't be easier to install and get working with. For god's sake the most recent version even allows you to call printf directly on the bloody card.
  The fact is a GPU is a very differently structured piece of hardware. If you want to use that to execute certain classes of algorithm orders of magnitude faster than on a scalar processor, then great, join in. If you want to unthinkingly write high level code and expect it to go super fast, move along, there's nothing for you here.
  Nvidia has chosen C as their lead language for nvcc because C is the most common HPC language by a country mile. If you want to get heavily in to HPC and you don't want to learn C, either learn to write cross compilers or start getting used to disappointment.
17. Re:CUDA by Rockoon · 2010-07-15 10:23 · Score: 1
  
  No.. thats not the same thing. Even if GCD worked with GPU's (which I see no evidence of) it still wouldnt be the same thing. While GPU's often have many "threads", each thread itself is a very wide SIMD architecture. For GCD in its current form to be useful, the work() function would still have to have the SIMD stuff baked in.
  
  --
  "His name was James Damore."
18. Re:CUDA by Bigjeff5 · 2010-07-15 11:32 · Score: 1
  
  Fun fact:
  That quote is an urban legend, and there has never been any evidence that it was actually uttered by Gates.
  You'd think confirmation would be easy, since it was supposedly said at a 1981 computer trade show.
  It's like the famous quote "Let them eat cake" which is attributed to Marie Antoinette, but which scholars have never been able to find any evidence to suggest she actually uttered it.
  The idea that 640k would be enough forever is idiotic, especially since the industry was so constricted by the 64k limit of 8-bit processors. Microsoft was actually influential in getting the limit to 640k from the 512k originally proposed for the 8088, because they wanted to get as much memory as possible.
  
  --
  Security is mostly a superstition... Avoiding danger is no safer in the long run than outright exposure. - Helen Keller
19. Re:CUDA by Anonymous Coward · 2010-07-15 11:46 · Score: 0
  
  The problem is, we do not need more software built with the programming equivalent of Duplo Lego. Sure, Lego Technics is more complicated to build, but you get better results.
20. Re:CUDA by psilambda · 2010-07-15 13:24 · Score: 2, Informative
  
  Indeed. With Cuda, DirectCompute, and OpenCL, nearly 100% of your code is boilerplate interfacing to the API. There needs to be a language where this stuff is a first-class citizen and not just something provided by an API.
  If you use CUDA, OpenCL or DirectComputeX it is--try the Kappa library--it has its own scheduling language that make this much easier. The next version that is about to come out goes much further yet.
21. Re:CUDA by BitZtream · 2010-07-15 14:14 · Score: 2, Informative
  
  GCD combined with OpenCL makes it usable on a GPU, but that would be stupid. GPUs aren't really 'threaded' in any context that someone who hasn't worked with them would think of.
  All the threads run simultaneously, and side by side. They all start at the same time and they all end at the same time in a batch (not entirely true, but it is if you want to actually get any boost out of it).
  GCD is multithreading on a General Processing Unit, like your Intel CoreWhateverThisWeek processor. Code paths are ran and scheduled on different cores as needed and don't really run side by side, but they can run at the same time which is practical and useful in A LOT of cases.
  OpenCL is multithreading on a graphics chip. It lets you do the same calculation over and over again or on a very large data set, side by side. You can calculate 128 encryption keys in one pass, but you can't calculate one encryption key, the average of your monthly bills, and draw a circle because the graphics chip doesn't do random processing side by side, it runs a whole bunch of the same instructions side by side and goes to hell in a handbasket the INSTANT you break its ability to run all the 'threads' side by side, executing the same instruction in each at the same time.
  I really don't think you understand either standard GP multithreading or what GPUs are practically capable of doing.
  
  --
  Persistent Volume manager for Kubernetes - https://github.com/dwimsey/openshift-pvmanager
22. Re:CUDA by MostAwesomeDude · 2010-07-15 18:27 · Score: 1
  
  I don't mean to be rude, but graphics processors don't work that way. They are not general-purpose and I would not expect general-purpose toolkits to show up for them anytime soon.
  As a thought experiment, consider Linux. It requires 8MB of RAM and support for the C language on its targets. Larrabee ran BSD and the engineers were trying to get Linux on there when the project was scuttled. Larrabee could have been a chipset where you could use "higher-level languages" to do this stuff, but it would have been the first.
  
  --
  ~ C.
23. Re:CUDA by the_one(2) · 2010-07-15 21:13 · Score: 1
  
  You should take a look at brook (though it seems to be dying in favor of OpenCL). It's really straightforward and almost simpler than programming for the CPU. Of course I did have quite a lot of trouble compiling the programs... but that's probably because I suck.
24. Re:CUDA by Rockoon · 2010-07-16 04:46 · Score: 1
  
  I really dont think that you understand that GPU's actually can, and do, execute more than one unique thread at a time. They could not get the polygon counts they do if they didnt.
  
  They arent just big SIMD's like you think. They really do execute independent threads and each are wide SIMD (128 bytes wide on most modern GPU's) .. if GCD backed by OpenCL can't do this, then its selling you short.
  
  --
  "His name was James Damore."
25. Re:CUDA by David+Greene · 2010-07-16 07:16 · Score: 1
  
  The fact is a GPU is a very differently structured piece of hardware. If you want to use that to execute certain classes of algorithm orders of magnitude faster than on a scalar processor, then great, join in. If you want to unthinkingly write high level code and expect it to go super fast, move along, there's nothing for you here.
  That's the wrong attitude. GPUs will fail as general-purpose machines if that's what the vendors are thinking. And thankfully, they aren't. The fact that one has to write all kinds of painful malloc/memcpy/free code, declare variables twice, hand-outline kernels, etc. is an abomination. This should be handled automatically by compilers, either via user directives or with advanced analysis.
  
  Nvidia has chosen C as their lead language for nvcc because C is the most common HPC language by a country mile.
  False. Fortran is by far the most common HPC language. But desktop users who have traditionally purchased GPUs don't usually know Fortran. And there's no reasonable Free Fortran implementation. That's why NVIDIA went with C. Companies like PGI are filling the Fortran gap.
  
  --
26. Re:CUDA by Dekker3D · 2010-07-17 23:27 · Score: 1
  
  True, but it's still funny to quote. And that's probably why it'll stick around forever.
Notice in TFA by blai · 2010-07-15 07:50 · Score: 1

"OpenCL is managed by a standards group, which is a great way to get nothing done"

I don't see the correlation.

--
In soviet Russia, God creates you!
1. Re:Notice in TFA by binarylarry · 2010-07-15 08:29 · Score: 2, Interesting
  
  Not only that, but they posit that Microsoft's solution solves the issue of both Nvidia's proprietary-ness and the OpenCL boards's "lack of action."
  Fuck this article, I wish I could unclick on it.
  
  --
  Mod me down, my New Earth Global Warmingist friends!
2. Re:Notice in TFA by Anonymous Coward · 2010-07-15 14:27 · Score: 0
  
  DirectX has been (and still is) ahead of OpenGL in terms of implementing new features for many years now. The gap has been closing with the Khronos group taking over OpenGL, but it's still there and it's still a valid worry. It shouldn't be so difficult to understand why people are skeptical about the Khronos group's ability to compete in the general purpose GPU computing market -- so far their only marked success has been OpenGL ES, an API with virtually no competitors. Further, OpenCL is sort of alone and without a lot of support compared to the competing platforms. CUDA has strong nVidia backing (duh) and DirectCompute has Microsoft strong-arming ATI and nVidia into supporting it if they want to be DirectX 11 compatible. OpenCL has them voluntarily implementing it, if they feel like it. And frankly, as someone who works in GPU computing, OpenCL implementations from both vendors are lacking.
3. Re:Notice in TFA by binarylarry · 2010-07-16 03:10 · Score: 1
  
  Completely wrong. OpenGL has always had an edge feature wise.
  Khronos has been fixing the coherency of OpenGL api, slimming and streamlining it to make it simpler for developers in the same vein as d3d.
  I think it's funny you claim to work "in GPU computing" when it's obvious to anyone who actually does that you don't.
  
  --
  Mod me down, my New Earth Global Warmingist friends!
OpenCL by gbrandt · 2010-07-15 07:50 · Score: 2, Informative

Sounds like a perfect job for OpenCL. When a program is rewritten for OpenCL, you can just drop in CPU's or GPU's and they get used.
1. Re:OpenCL by Anonymous Coward · 2010-07-15 08:25 · Score: 3, Informative
  
  Unfortunately, no. OpenCL does not map equally to different compute devices, and does not enforce uniformity of parallelism approaches. Code written in OpenCL for CPUs is not going to be fast on GPUs. Hell, OpenCL code written for ATI GPUs is not going to work well on nVidia GPUs.
2. Re:OpenCL by quanticle · 2010-07-15 09:10 · Score: 1
  
  Well, true, but that overlooks the fact that porting a program to OpenCL is not exactly a trivial task.
  
  --
  We all know what to do, but we don't know how to get re-elected once we have done it
Of course not! by Yvan256 · 2010-07-15 07:52 · Score: 2, Informative

It's not something that can be accomplished with a few libraries and lines of code.
It doesn't take a few libraries and lines of code... It takes a SHITLOAD of libraries and lines of code! - Lone Starr
Not really news... by Third+Position · 2010-07-15 07:53 · Score: 1

I remember reading that IBM was planning to put Cell in mainframes and other high-end servers several years ago, supposedly to accrue the same benefits. I don't really know whether or not that was ever followed through with, I haven't kept track of the story.

--
American Third Position
Finally, a real choice!
1. Re:Not really news... by Dynetrekk · 2010-07-15 08:04 · Score: 2, Interesting
  
  I'm no expert, but from what I understand, it wouldn't be at all surprising. IBM has been regularly using their Power processors for supercomputers, and the architecture is (largely) the same. The Cell has some extra graphics-friendly floating-point units, but it's not entirely differnent from the CPUs IBM has been pushing for computation in the past. I'm not even sure if the extra stuff in the Cell is interesting in the supercomputing arena.
2. Re:Not really news... by Anonymous Coward · 2010-07-15 09:29 · Score: 1, Interesting
  
  The Cell is a PowerPC processor, which is intimately related with the Power architecture. Basically, PowerPC was an architecture designed by IBM, Apple, and Motorola, for use in high performance computing. It was based in part on an older (now) version of IBM's POWER architecture. In short, POWER was the "core" architecture, and additional instruction sets could be added at fabrication time -- kind of like Intel with their SSE extensions.
  This same pattern continued for a long time. IBM's POWER architecture basically took the PowerPC instruction set, implemented it in new, faster ways. Any interesting extensions might/could be folded into the newer PowerPC architecture revision. The next generation of PowerPC branded chips would inherit the "core" of the last POWER chip's implementation. Later POWER was renamed to Power, to align it with PowerPC branding.
  The neat thing is that the "core" instruction set is pretty powerful. You can run the same Linux binary on a G3 iMac as a Cell as a Gamecube or Wii (in principle) as a as a super computing POWER7 or whatever IBM is up to now, as long as it doesn't need extensions. And you can do a lot of computation without extensions. The "base" is broad, unlike x86's strict hierarchy of modes. In some respects, this doesn't sound so neat, since the computing world has mostly settled on x86 for general purpose computation, and so any new x86 chips will probably include a big suite of extensions to the architecture too. Intel, AMD, and IBM eventually converged on this same RISC-y CISC idea, though IBM/Apple/Motorola managed to expose less of the implementation through its architecture at first.
3. Re:Not really news... by PrecambrianRabbit · 2010-07-15 09:29 · Score: 1
  
  Yep, IBM produced the PowerXCell for that purpose, and used them to build Roadrunner, which was the worlds first petaflop supercomputer. I'm not sure whether Cell is still being pushed forward these days though.
  That's somewhat different than the trend towards GPGPU that the article talks about, although it's related. Both approaches use semi-specialized parallel hardware for compute-intensive tasks.
4. Re:Not really news... by ihuntrocks · 2010-07-15 15:30 · Score: 1
  
  http://www.fixstars.com/en/products/gigaaccel180/features.html I wouldn't mind having a few of those. Also, the QS22 blades that I worked with were also very nice in my opinion. Cell is a fun architecture.
  
  --
  Randimal: AT-CG-CG-AT-CG-AT-AT-CG-CG-AT-AT-CG-AT-CG-CG-AT-CG-AT-AT-CG-AT-CG-CG-AT-AT-CG-CG-AT-CG-AT-AT-CG
5. Re:Not really news... by inKubus · 2010-07-15 18:47 · Score: 1
  
  They have the zIIP and zAPP processors on the z series mainframes, which are specialty procs. zIIP for database and encryption, zAPP is basically a java VM in hardware. IBM is big, and they have specialty fabs to make silicon for specialty mainframes. Yeah, they are expensive but worth it for some applications.
  
  --
  Cool! Amazing Toys.
Libraries by Dynetrekk · 2010-07-15 07:59 · Score: 2, Insightful

I'm really interested in using GPGPU for my physics calculations. But you know - I don't want to learn Nvidia's low-level, proprietary (whateveritis) in order to do an addition or multiplication, which may or may not outperform the CPU version. What would be _really_ great is stuff like porting the standard "low-level numerics" libraries to the GPU: BLAS, LAPACK, FFTs, special functions, and whatnot - the building blocks for most numerical programs. LAPACK+BLAS you already get in multicore versions, and there's no extra work on my part to use all cores on my PC. Please, computer geeks (i.e. more computer geek than myself), let me have the same on the GPU. When that happens, we can all buy Nvidia HotShit gaming cards and get research done. Until then, GPGPU is for the superdupergeeks.
1. Re:Libraries by brian_tanner · 2010-07-15 08:13 · Score: 3, Informative
  
  It's not free, unfortunately. I briefly looked into using it but got distracted by something shiny (maybe trying to finish my thesis...)
  
  CULA is a GPU-accelerated linear algebra library that utilizes the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics.
  http://www.culatools.com/
2. Re:Libraries by Anonymous Coward · 2010-07-15 08:40 · Score: 2, Informative
  
  It's not as complete as CULA, but for free there is also MAGMA. Also, nVidia implements a CUDA-accelerated BLAS (CUBLAS) which is free.
  As far as OpenCL goes, I don't think there has been much in terms of a good BLAS made yet. The compilers are still sketchy (especially for ATI GPUs), and the performance is lacking on nVidia GPUs compared to CUDA.
3. Re:Libraries by ihuntrocks · 2010-07-15 15:40 · Score: 1
  
  I know I posted this like a little bit above, but this sounds like something you might be looking for. Any card with the PowerXCell setup. http://www.fixstars.com/en/products/gigaaccel180/features.html If you check under the specs section, you'll see tha BLAS, LAPACK, FFT, and several other numeric libraries are supported. Also, the GCC can target Cell. All around, not a bad set up for physics modeling.
  
  --
  Randimal: AT-CG-CG-AT-CG-AT-AT-CG-CG-AT-AT-CG-AT-CG-CG-AT-CG-AT-AT-CG-AT-CG-CG-AT-AT-CG-CG-AT-CG-AT-AT-CG
4. Re:Libraries by Anonymous Coward · 2010-07-15 17:34 · Score: 0
  
  Also if you only reqires BLAS, it's shipped freely with CUDA (cuBLAS)
5. Re:Libraries by guruevi · 2010-07-15 23:15 · Score: 2, Informative
  
  The CUDA dev kit includes libraries and examples for BLAS (CUBLAS) and FFT, several LAPACK routines have been implemented in several commercial packages (Jacket, CULA) and free software (MAGMA).
  The OpenCL implementation in Mac OS X has FFT and there are libraries for BLAS (from sourceforge) and MAGMA gives you some type of LAPACK implementation.
  I work with HPC systems based on nVIDIA GPU's in a research environment - it's still a lot of work (as all research/cluster programs are) but it's certainly doable and can most certainly accelerate some calculations but it depends highly on the application and even more so on the coder.
  
  --
  Custom electronics and digital signage for your business: www.evcircuits.com
IIS 3D by curado · 2010-07-15 08:14 · Score: 2, Interesting

So.. webpages will soon be available in 3D with anti-aliasing and realistic shading?
1. Re:IIS 3D by Enderandrew · 2010-07-15 09:32 · Score: 1
  
  Yes, actually. IE9 uses DirectDraw and your graphics card to render fonts smoother and faster. Firefox has a similar project in the works.
  
  --
  http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
Wouldn't a DSP do better? by 91degrees · 2010-07-15 08:18 · Score: 2, Interesting

So why a GPU rather than a dedicated DSP? Seems they do pretty much the same thing except a GPU is optimised for graphics. A DSP offers 32 or even 64 bit integers, have had 64 bit floats for a while now, allow more flexible memory write positions, and can use the previous results of adjacent values in calculations.
1. Re:Wouldn't a DSP do better? by pwnies · 2010-07-15 08:34 · Score: 2, Informative
  
  Price. GPUs are being mass produced. Why create a separate market that only has the DSP in it (even if the technology is already present and utilized by GPUs) for the relatively small amount of servers that will be using them?
Crysis 2... by drc003 · 2010-07-15 08:25 · Score: 2, Funny

...coming soon to a server farm near you!
1. Re:Crysis 2... by JorgeM · 2010-07-15 08:35 · Score: 2, Interesting
  
  I'd love this, actually. My geek fantasy is to be able to run my gaming rig in a VM on a server with a high end GPU which is located in the basement. On my desk in the living room would be a silent, tiny thin client. Additionally, I would have a laptop thin client that I could take out onto the patio.
  On a larger scale, think Steam but with the game running on a server in a datacenter somewhere which would eliminate the need for hardware on the user end.
2. Re:Crysis 2... by drc003 · 2010-07-15 08:41 · Score: 1
  
  I like the way you think. In fact now I'm all excited at th.......ahhhhhhhhhooohhhhhhhhhhhh. Oops.
3. Re:Crysis 2... by Dalambertian · 2010-07-15 08:59 · Score: 1
  
  Sacrificing all my mod points to say this, but a friend of mine did this with his PS3 so he could play remotely using a PSP. Also, check out OnLive for a pretty slick implementation of gaming in the cloud.
4. Re:Crysis 2... by SleazyRidr · 2010-07-15 09:07 · Score: 1
  
  +1 overinformative.
  
  --
  Is 1563649 a prime number?
RemoteFX by JorgeM · 2010-07-15 08:26 · Score: 2, Interesting

No mention of Microsoft's RemoteFX coming in Windows 2008 R2 SP1? RemoteFX uses the server GPU for compression and to provide 3d capabilites to the desktop VMs.
Any company large enough for a datacenter is looking at VDI and RemoteFX is going to be supported by all of VDI providers except VMware. VDI, not relatively niche case massive calculations, will put GPUs in the datacenter.
How much number-crunching is your server doing? by Animats · 2010-07-15 08:40 · Score: 1

If your data center is running stochastic tests, trying scenarios on derivative securities, it's a big win. If it's serving pages with PHP, zero win.
There are many useful ways to use a GPU. Machine learning. Computer vision. Finite element analysis. Audio processing. But those aren't things most people are doing. If your problem can be expressed well in MATLAB, a GPU can probably accelerate it. MATLAB connections to GPUs are becoming popular. They're badly needed; MATLAB is widely used in engineering and scientific work, but it's not as fast as it should be.
1. Re:How much number-crunching is your server doing? by Anonymous Coward · 2010-07-15 08:56 · Score: 0
  
  depends how you're serving up those PHP pages. It might be a big win if you're doing lots of SSL connections.
  Encryption requires calculations that can benefit from faster math processors...
2. Re:How much number-crunching is your server doing? by smallfries · 2010-07-15 09:03 · Score: 1
  
  But it is not the same kind of maths. Most GPUs support very fast use of single-precision floats. The asymmetric crypto that you use to establish your SSL connection uses very large integers, and the AES that encrypts the stream operates in a finite field. Neither can executed efficiently on a GPU.
  
  --
  Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
3. Re:How much number-crunching is your server doing? by ceoyoyo · 2010-07-15 09:50 · Score: 1
  
  Many people will be doing those things going ahead: all forms of machine learning. The obvious example is natural language processing for your web page.
4. Re:How much number-crunching is your server doing? by Jeremy+Erwin · 2010-07-15 14:06 · Score: 1
  
  CUDA compatible GPU as an efficient hardware accelerator for AES Cryptography It's from 2007, so perhaps the bugs have been ironed out.
5. Re:How much number-crunching is your server doing? by smallfries · 2010-07-15 19:30 · Score: 2, Informative
  
  No, it's the difference between "efficiency" and what is claimed as "efficient" to get a paper published. That's a really bad citation for AES on GPUs as there is a line of prior work going back to Cook and Cryptographics. In fact that paper is a classic example of getting something into the literature that has already been done. The authors have submitted it to an unrelated conference and failed to cite the relevant work.
  If we look at their best figures then throw away the 15x claimed speedup as it doesn't consider memory transfer costs. The 5x speedup is more realistic. The GPU that they use (8800gtx) has 128 stream processors running at 1.35Ghz. The comparison is a PIV running at 3Ghz. Roughly speaking we can compare the cycles taken on each platform as a measure of the work done. The graphics card stream processors perform 57x more clock cycles.
  The central workload in AES for high-performance is completely memory bound. The cycles are just used to stage results from memory and perform XOR instructions. So the stream processors only execute the code 5x quicker with 57x more clocks and a huge memory bandwidth advantage that I can't be bothered to look up.
  So no, 10x less output per clock is not "efficient" in my book. But if you publish your paper in a crappy unrelated conference then you will get away with it.
  
  --
  Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Parallel Pr0n by tedgyz · 2010-07-15 08:44 · Score: 1

There's always an application for that.

--
"No matter where you go, there you are." -- Buckaroo Banzai
Modern GPUs, for all their hype, are just DSPs by pslam · 2010-07-15 08:52 · Score: 3, Interesting

I could almost EOM that. They're massively parallel, deeply pipelined DSPs. This is why people have trouble with their programming model.
The only difference here is the arrays we're dealing with are 2D and the number of threads is huge (100s-1000s). But each pipe is just a DSP.
OpenCL and the like are basically revealing these chips for what they really are, and the more general purpose they try to make them, the more they resemble a conventional, if massively parallel, array of DSPs.
There's a lot of comments on this subject along the lines of "Why couldn't they make it easier to program?" Well, it always boils down to fundamental complexities in design, and those boil down to the laws of physics. The only way you can get things running this parallel and this fast is to mess with the programming model. People need to learn to deal with it, because all programming is going to end up heading this way.
1. Re:Modern GPUs, for all their hype, are just DSPs by 91degrees · 2010-07-15 10:07 · Score: 1
  
  Well, yes, there's not a lot in it and more and more has been handed over to more general purpose hardware. Still, there can't be a lot of use for depth buffer handling and caching (since for a lot of applications memory will be accessed perfectly linearly), or rasterisation, or texture filtering, and I'd have thought there would be some use for the slight extra flexibility from a DSP. Granted, all I can think of is that you have addressable write operations but I'm sure there's more.
  
  Is the specialised 3d hardware really such a tiny part of a chip these days that it doesn't significantly affect the price?
2. Re:Modern GPUs, for all their hype, are just DSPs by pclminion · 2010-07-15 10:21 · Score: 2, Interesting
  
  There's a lot of comments on this subject along the lines of "Why couldn't they make it easier to program?"
  Why should they? Just because not every programmer on the planet can do it doesn't mean there's nobody who can do it. There are plenty of people who can. Find one of these people and hire them. Problem solved.
  Most programmers can't even write single-threaded assembly code any more. If you need some assembly code written, you hire somebody who knows how to do it. I don't see how this is any different.
  As far as whether all programming will head this direction eventually, I don't think so. Most computational tasks are data-bound, and throughput is enhanced by improving the data backends, which are usually handled by third parties. We already don't know how the hell our own systems work. For the people who really need this kind of thing, you need to go out and learn it or find somebody who knows it. Expecting that the whole world can do it is crazy thinking.
3. Re:Modern GPUs, for all their hype, are just DSPs by CodeBuster · 2010-07-15 18:06 · Score: 1
  
  Most programmers can't even write single-threaded assembly code any more.
  The reason that we don't is that modern optimizing compilers have made doing so almost a complete waste of time except in very highly specialized or niche applications. I would liken it to chess playing AIs: the greatest human grand masters can still defeat them with effort but the rest of us will get our butts handed to us by the AI every time. To quote one fictional AI, "the only winning move is not to play".
  
  As far as whether all programming will head this direction eventually, I don't think so.
  I don't think so either. If anything programming is becoming abstract and virtual to the point where the underlying hardware is a meaningless detail handled by the JITs (just in time compilers) and HALs (hardware abstraction layers). The hardware is so ridiculously cheap now that programmer time is far better spent writing elegant and abstract code that will run on anything that supports the VM rather than hand-optimizing for a particular piece of hardware. In fact, I would argue that unless you are writing a device driver, concerning oneself with hardware directly in software is a good indication of code smell.
  
  Expecting that the whole world can do it is crazy thinking.
  Indeed it is.
4. Re:Modern GPUs, for all their hype, are just DSPs by marcosdumay · 2010-07-16 02:08 · Score: 1
  
  "The reason that we don't is that modern optimizing compilers have made doing so almost a complete waste of time except in very highly specialized or niche applications."
  
  Most programmers don't know assembly because high level compilers made things simpler, and that increased the number of programmers out there by orders of magniture. Most of those extra programmers were drawn from a pool of people that (lacking education or just intrisical capacity/motivation) couldn't learn to use machine code. Ok, it not being needed made some people that could learn it simply not making the effort, but those are a minority.
  
  "The hardware is so ridiculously cheap now that programmer time is far better spent writing elegant and abstract code that will run on anything that supports the VM rather than hand-optimizing for a particular piece of hardware."
  
  Hardware being cheap doesn't make programmer time better spent not optimizing things. Your argument is true for some applications, and completely false on others.
  
  --
  Rethinking email
5. Re:Modern GPUs, for all their hype, are just DSPs by CodeBuster · 2010-07-18 04:01 · Score: 1
  
  Your argument is true for some applications, and completely false on others.
  Yes, but that does not make it a 50/50 proposition here in the real world. The vast majority of us who are paid to do software development work use languages and write programs where the hardware, especially with all of the virtualization these days, really doesn't matter. Most of us are engaged in writing business applications, not avionics, device drivers, or embedded controllers. The largest segment of the market where hardware performance optimization is still important is probably the games market, which is still a minority of working programmers. So what I said was mostly true for the clear majority of real world software development jobs.
  Now, does this mean that your programs should not concern themselves with efficiency at all? Of course not. One should still try to avoid nested loops, using bad sorts and other well known software faux pas that sap performance no matter what hardware one is using. My point was that most of the efficiency gains that are worth seeking should be sought in the more abstract realm of the software itself, at least at first, before jumping into hardware optimizations. Could bubble sort with optimized hardware be faster than quicksort? I suppose it could, but most programmers would consider it 'ugly' that the more obvious software optimizations, such as a better sorting algorithm, were not pursued first before time was spent optimizing for specialized or specific hardware.
Why call the GPU a gaming chip? by wrightrocket · 2010-07-15 09:09 · Score: 1

It is a Graphics Processing Unit, not a Gaming Processing Unit. Sure, they are great for gaming, but also very useful for other types of 3D and 2D rendering of graphics.
1. Re:Why call the GPU a gaming chip? by Urkki · 2010-07-15 10:48 · Score: 1
  
  It is a Graphics Processing Unit, not a Gaming Processing Unit. Sure, they are great for gaming, but also very useful for other types of 3D and 2D rendering of graphics.
  But the top bang-for-the-buck chips are designed for games. They have architecture (number of pipelines etc) designed to maximize performance in typical game use, at a framerate needed for games. In other words, they're gaming chips, just like eg. PS3 is a game console, no matter if it can be used to build a cluster for number crunching.
Huh... by geemon · 2010-07-15 09:11 · Score: 1

Saw the title of this article and wondered "how will Las Vegas casinos make the move to have all of my gaming chips put onto a server."
Major benefits seem overlooked by nickdwaters · 2010-07-15 10:23 · Score: 1

While it is true that parallelization does not necessarily assist a single program to operate more efficiently or faster, it is true that multi-cpu systems allow more concurrent programs to operate. In a major corporate context, there are 1000's of jobs running at any given time. The more effective number of CPU's (and memory) the better to keep costs down.
1. Re:Major benefits seem overlooked by Anonymous Coward · 2010-07-15 16:29 · Score: 0
  
  That is not a major benefit of this type of computing at all, the article is about HPC processing where you can achieve a high degree of parallelism for certain algorithms, this requires highly specialised programming and analysis of algorithms to achieve benefits. this has very little to no benefit whatsoever to general processing tasks and if anything would be extremely expensive implement with the discussed tech.
You must be salivating about OnLive, then by rsborg · 2010-07-15 10:54 · Score: 1

From wikipedia:

OnLive is a gaming-on-demand platform, announced in 2009[3] and launched in the United States in June 2010. The service is a gaming equivalent of cloud computing: the game is synchronized, rendered, and stored on a remote server and delivered via the Internet.
Sounds very interesting to me, as I'm pretty sick of upgrade treadmills. OnLive would probably also wipe out hacked-client based cheating (though bots and such might still be doable). It would also allow bleeding-edge games to be enjoyed by those without the best hardware, increasing adoption rates for those types of games.

--
Make sure everyone's vote counts: Verified Voting
There are also easy problems by dbIII · 2010-07-15 13:21 · Score: 1

Many algorithms just won't have much improvement with multi-threading.
Yes, but there are also many that will. I work with geophysicists, and a lot of what they do really involves applying the same filter to 25 million or so audio traces. Such tasks get split arbitrarily over clusters at any point of those millions of traces. One thread per trace is certainly possible because that's how it works normally anyway as independent operations in series. Once you get to output the results some theoretical 25 million CPU machine is going to have bottlenecks elsewhere however and not give much benefit over something a lot smaller - that's where the hard problems come in.
Also, working with images and video brings up a lot of other parallel problems that even those of us that only dabble in parallel processing can get decent results with.
Car Analogy by Anonymous Coward · 2010-07-15 14:04 · Score: 0

so a car analogy would be, CPU are normal cars, and GPUs are dragracers.. high speed and no brakes?
GPU apps are pretty specific... by bored · 2010-07-15 15:13 · Score: 2, Insightful

I've done a little CUDA programming, and I've yet to find significant speedups doing it. Every single time, some limitation in the arch keeps it from running well. My last little project, ran about 30x faster on the GPU than the CPU, the only problem was that the overhead of getting it to the GPU + computation + overhead of getting it back, was roughly equal to the time it took to just dedicate a CPU.
I was really excited about AES on the GPU too, until it turned out to be about 5% faster than my CPU.
Now if the GPU was designed more as a proper coprocessor (ala early x87, or early Weitek) and integrated into the memory hierarchy better (put the funky texture ram and such off to the side) some of my problems might go away.
Boilerplate APIs by theunixman · 2010-07-15 17:12 · Score: 1

Even better would be a language that didn't need horrendous amounts of crappy boilerplate code for every API.
FPGAsW by Anonymous Coward · 2010-07-15 17:17 · Score: 0

What about FPGAs ? You could install FPGAs in whole server farms, thus making it cheaper and also upgradable over software.
Dont want GPU, but wanted something else? "install" it without opening the cabinet!
How about other rooms? by dushkin · 2010-07-15 19:30 · Score: 1

I for one wouldn't mind gaming chips moving to the bedroom, if you know what I mean.

--
o hai
Erlang ? by Anonymous Coward · 2010-07-15 20:50 · Score: 0

Erlang has a good support for threads and parallelism, I think it would be a great idea to add support of GPU in Erlang. Erlang has a natural way to write parallel applications in a functional programming style also it implements the notion of "green threads" very well. Does someone see the perspectives ?
Well by Anonymous Coward · 2010-07-15 21:50 · Score: 0

We've already known about what we could have done.
true parallelism by Anonymous Coward · 2010-07-16 04:44 · Score: 0

We've been writing code to use multiple cores for some time already. The trick is to (also) avoid locking, because locking generates serialization. Virtualization has its limits, since any form of communication between the virtual machines (and their processes) becomes expensive.
Suggest using new techniques such as atomic variables (atomic built ins) and locking for whatever needs to be shared and then divide to conquer! Easiest thing is to delegate transactions to specific threads, so that the data that pertains the transaction is kept in one place -- doesn't move around. I know, easier said than done.
Parallel computation libraries by psilambda · 2010-07-18 05:15 · Score: 1

If you wish for your computations to be parallel at a level higher than algorithm steps (i.e. you can build libraries upon libraries that are efficient parallel computation throughout the layers of libraries), then neither the CUDA driver or the CUDA runtime API (or OpenCL or DirectCompute) are very good. An example of this for CUDA is that even usage of the Fermi concurrent kernel execution feature is not generally possible using all (or even very many) CUDA kernels in a program by just using the CUDA APIs.
MPI (message passing interface) gives parallel computation at the clustering level and the Kappa Library gives you this at the library component level. If somebody knows about something other than MPI or Kappa that does this and is available for general use, I would be interested to hear about it.