Why 'Gaming' Chips Are Moving Into the Server Room

← Back to Stories (view on slashdot.org)

Why 'Gaming' Chips Are Moving Into the Server Room

Posted by timothy on Thursday July 15, 2010 @07:37AM from the expense-report-manipulation-++ dept.

Esther Schindler writes "After several years of trying, graphics processing units (GPUs) are beginning to win over the major server vendors. Dell and IBM are the first tier-one server vendors to adopt GPUs as server processors for high-performance computing (HPC). Here's a high level view of the hardware change and what it might mean to your data center. (Hint: faster servers.) The article also addresses what it takes to write software for GPUs: 'Adopting GPU computing is not a drop-in task. You can't just add a few boards and let the processors do the rest, as when you add more CPUs. Some programming work has to be done, and it's not something that can be accomplished with a few libraries and lines of code.'"

18 of 137 comments (clear)

Min score:

Reason:

Sort:

A whole new level of parallelism by TwiztidK · 2010-07-15 07:42 · Score: 4, Insightful

I've heard that many programmers have issues coding for 2 and 4 core processors. I'd like to see how they'll addapt to running "run hundreds of threads" in parallel.

--
Sent from my iPhone 5
1. Re:A whole new level of parallelism by morcego · 2010-07-15 07:50 · Score: 3, Insightful
  
  This is just like programing for a computer cluster ... after a fashion.
  Anyone used to do both should have no problem with this.
  I'm anything but a high end programmer (I mostly only code for myself), and I have written plenty of code that runs with 7-10 threads. Believe me, when you change the way you think about how an algorithm works, it doesn't matter if you are using 3 or 10000 processors.
  
  --
  morcego
2. Re:A whole new level of parallelism by Sax+Maniac · 2010-07-15 07:51 · Score: 3, Insightful
  
  This isn't hundreds of threads that can run arbitrary code paths like a CPU, you have to totally redesign your code, or already have implemented parallel code so that you already run a number of threads that all do the same thing at the same time, just on different data.
  The threads all run in lockstep, as in, all the threads better be at the same PC at the same time. If you run into a branch in the code, then you lose your parallelism, as the divergent threads are frozen until they come back together.
  I'm not a big thread programmer, but I do work on threading tools. Most of the problems with threads seems to come with threads doing totally different code paths, and the unpredictable scheduling interactions that arise between them. GPU coding a lot more tightly controlled.
  
  --
  I can explanate how to administrate your network. You must configurate and segmentate it, so it can computate.
3. Re:A whole new level of parallelism by Nadaka · 2010-07-15 08:04 · Score: 4, Insightful
  
  No it isn't. That you think so just shows how much you still have left to learn.
  I am not a high end programmer either. But I have two degrees on the subject and have been working professionally in the field for years, including optimization and parallelization.
  Many algorithms just won't have much improvement with multi-threading.
  Many will even perform more poorly due to data contention and the overhead of context switches and creating threads.
  Many algorithms just can not be converted to a format that will work within the restrictions of GPGPU computing at all.
  The stream architecture of modern GPU's work radically differently than a conventional CPU.
  It is not as simple as scaling conventional multi-threading up to thousands of threads.
  Certain things that you are used to doing on a normal processor have an insane cost in GPU hardware.
  For instance, the if statement. Until recently OpenCL and CUDA didn't allow branching. Now they do, but they incur such a huge penalty in cycles that it just isn't worth it.
4. Re:A whole new level of parallelism by Dynetrekk · 2010-07-15 08:08 · Score: 5, Insightful
  
  Believe me, when you change the way you think about how an algorithm works, it doesn't matter if you are using 3 or 10000 processors.
  Have you ever read up on Amdahl's law?
5. Re:A whole new level of parallelism by pushing-robot · 2010-07-15 08:20 · Score: 3, Funny
  
  Microsoft must be doing a bang-up job then, because when I'm in Windows it doesn't matter if I'm using 3 or 10000 processors.
  
  --
  How can I believe you when you tell me what I don't want to hear?
6. Re:A whole new level of parallelism by jgagnon · 2010-07-15 08:23 · Score: 3, Interesting
  
  The problem with "programming for multiple cores/CPUs/threads" is that it is done in very different ways between languages, operating systems, and APIs. There is no such thing as a "standard for multi-thread programming". All the variants share some concepts in common but their implementations are mostly very different from each other. No amount of schooling can fully prepare you for this diversity.
  
  --
  Remember to maintain your supply of /facepalm oil to prevent chafing.
7. Re:A whole new level of parallelism by Chris+Burke · 2010-07-15 08:33 · Score: 4, Informative
  
  Programmers of Server applications are already used to multithreading, and they've been able to make good use of systems with large numbers of processors on them even before the advent of virtualization.
  But don't pay too much attention to the word "Server". Yes the machines that they're talking about are in the segment of the market referred to as "servers", as distinct from "desktops" or "mobile". But the target of GPU-based computing isn't "Servers" in the sense of the tasks you normally think of -- web servers, database servers, etc.
  The real target is mentioned in the article, and it's HPC, aka scientific computing. Normal server apps are integer code, and depend more on high memory bandwidth and I/O, which GPGPU doesn't really address. HPC wants that stuff too, but they also want floating point performance. As much floating point math performance as you can possibly give them. And GPUs are way beyond what CPUs can provide in that regard. Plus a lot of HPC applications are easier to parallelize than even the traditional server codes, though not all fall in the "embarrassingly parallel" category.
  There will be a few growing pains, but once APIs get straightened out and programmers get used to it (which shouldn't take too long for the ones writing HPC code), this is going to be a huge win for scientific computing.
  
  --
  
  The enemies of Democracy are
8. Re:A whole new level of parallelism by Hodapp · 2010-07-15 09:24 · Score: 3, Informative
  
  I am one such programmer. Yet I also coded for an Nvidia Tesla C1060 board and found it much more straightforward to handle several thousand threads at once.
  Not all types of threads are created equal. I usually explain CUDA to people as the "Zerg Rush" model of computing - instead of a couple, well-behaved, intelligent threads that try to be polite to each other and clean up their own messes, you throw a horde of a thousand little vicious, stupid threads at the problem all at once, and rely on some overlord to keep them in line.
  Most of the guides explained it as, "Flops are free, bandwidth is expensive." This board had a 384 or 512-bit wide memory bus with a very high latency, and the reason you throw that many threads at it is to let the hardware cover up the latency - it can merge a huge number of memory reads/writes into one operation, and as soon as a thread is waiting on memory I/O it can swap another thread into that same SP and let it compute. If memory serves me, the board was divided into blocks of 8 scalar processors (each block had some scratchpad memory that could be accessed almost as fast as a register) and you wrote groups of 16 threads which ran in lock-step on that processor (no recursion was allowed, and if one branched, the others would just wait around until it reached the same point) in two rounds.
  Sure, that's a bit complex to optimize for, but it beats the hell out of conventional threading while trying to optimize for x86 SIMD. And if you manage to write it so it runs well on CUDA, it generally will scale effortlessly to whatever card you throw it at.
  It's looking like OpenCL won't be much different, but I have yet to try it. I'm kind of eager, since apparently AMD/ATI's current cards, for the money, have a bit more raw power than Nvidia's.
9. Re:A whole new level of parallelism by psilambda · 2010-07-15 13:19 · Score: 3, Interesting
  
  The article and everybody else are ignoring one large, valid use of GPUs in the data center--whether you call it business intelligence or OLAP--it needs to be in the data center and it needs some serious number crunching. There is not as much difference between this and scientific number crunching as most people might think. I have been involved in both crunching numbers for financials at a major multinational and had the privilege of being the first to process the first full genome (complete genetic sequence--terabytes of data) for a single individual and actually the genomic analysis was much more integer based than the financials. Based on my experience with both, I created the Kappa library for doing CUDA or OpenMP analysis in a datacenter--whether for business or scientific work.
10. Re:A whole new level of parallelism by David+Greene · 2010-07-15 17:09 · Score: 4, Interesting
  
  The stream architecture of modern GPU's work radically differently than a conventional CPU.
  True if the comparison is to a commodity scalar CPU.
  
  It is not as simple as scaling conventional multi-threading up to thousands of threads.
  True. Many algorithms will not map well to the architecture. However, many others will map extremely well. Many scientific codes have been tuned over the decades to exploit high degrees of parallelism. Often the small data sets are the primary bottleneck. Strong scaling is hard, weak scaling is relatively easy.
  
  Certain things that you are used to doing on a normal processor have an insane cost in GPU hardware.
  In a sense. These are not scalar CPUs and traditional scalar optimization, while important, won't utilize the machine well. I can't think of any particular operation that's greatly slower then on a conventional CPU, provided one uses the programming model correctly (and some codes don't map well to that model).
  
  For instance, the if statement.
  No. Branching works perfectly fine if you program the GPU as a vector machine. The reason branches within a warp (using NVIDIA terminology) are expensive is simply because a warp is really a vector. The GPU vendors just don't want to tell you that because either they fear being tied to some perceived historical baggage with that term or they want to convince you they're doing something really new. GPUs are interesting, but they're really just threaded vector processors. Don't misunderstand me, though, it's a quite interesting architecture to work with!
  
  --
Good luck with that by tedgyz · 2010-07-15 07:47 · Score: 3, Insightful

This is a long-standing issue. If your programs don't just "magically" run faster, then count out 90% or more of the programs that will benefit from this.

--
"No matter where you go, there you are." -- Buckaroo Banzai
CUDA by Lord+Ender · 2010-07-15 07:48 · Score: 3, Informative

I was interested in CUDA until I learned that even the simplest of "hello world" apps is still quite complex and quite low-level.
NVidia needs to make the APIs and tools for CUDA programming simpler and more accessible, with solid support for higher-level languages. Once that happens, we could see adoption skyrocket.

--
A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
1. Re:CUDA by Rockoon · 2010-07-15 08:01 · Score: 4, Interesting
  
  Indeed. With Cuda, DirectCompute, and OpenCL, nearly 100% of your code is boilerplate interfacing to the API.
  
  There needs to be a language where this stuff is a first-class citizen and not just something provided by an API.
  
  --
  "His name was James Damore."
Re:Yes, of course by Yvan256 · 2010-07-15 07:54 · Score: 5, Funny

Portal 2? It's something for our Web server. It adds more portals to access the internet.
Re:Libraries by brian_tanner · 2010-07-15 08:13 · Score: 3, Informative

It's not free, unfortunately. I briefly looked into using it but got distracted by something shiny (maybe trying to finish my thesis...)

CULA is a GPU-accelerated linear algebra library that utilizes the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics.
http://www.culatools.com/
Re:OpenCL by Anonymous Coward · 2010-07-15 08:25 · Score: 3, Informative

Unfortunately, no. OpenCL does not map equally to different compute devices, and does not enforce uniformity of parallelism approaches. Code written in OpenCL for CPUs is not going to be fast on GPUs. Hell, OpenCL code written for ATI GPUs is not going to work well on nVidia GPUs.
Modern GPUs, for all their hype, are just DSPs by pslam · 2010-07-15 08:52 · Score: 3, Interesting

I could almost EOM that. They're massively parallel, deeply pipelined DSPs. This is why people have trouble with their programming model.
The only difference here is the arrays we're dealing with are 2D and the number of threads is huge (100s-1000s). But each pipe is just a DSP.
OpenCL and the like are basically revealing these chips for what they really are, and the more general purpose they try to make them, the more they resemble a conventional, if massively parallel, array of DSPs.
There's a lot of comments on this subject along the lines of "Why couldn't they make it easier to program?" Well, it always boils down to fundamental complexities in design, and those boil down to the laws of physics. The only way you can get things running this parallel and this fast is to mess with the programming model. People need to learn to deal with it, because all programming is going to end up heading this way.