Why 'Gaming' Chips Are Moving Into the Server Room
Esther Schindler writes "After several years of trying, graphics processing units (GPUs) are beginning to win over the major server vendors. Dell and IBM are the first tier-one server vendors to adopt GPUs as server processors for high-performance computing (HPC). Here's a high level view of the hardware change and what it might mean to your data center. (Hint: faster servers.) The article also addresses what it takes to write software for GPUs: 'Adopting GPU computing is not a drop-in task. You can't just add a few boards and let the processors do the rest, as when you add more CPUs. Some programming work has to be done, and it's not something that can be accomplished with a few libraries and lines of code.'"
I've heard that many programmers have issues coding for 2 and 4 core processors. I'd like to see how they'll addapt to running "run hundreds of threads" in parallel.
Sent from my iPhone 5
This is a long-standing issue. If your programs don't just "magically" run faster, then count out 90% or more of the programs that will benefit from this.
"No matter where you go, there you are." -- Buckaroo Banzai
The sysdamins need new machines with powerful GPUs, you know, for business purposes.
Oh and, they sell ERP software on Steam now, too, so we'll have to install that as well.
I was interested in CUDA until I learned that even the simplest of "hello world" apps is still quite complex and quite low-level.
NVidia needs to make the APIs and tools for CUDA programming simpler and more accessible, with solid support for higher-level languages. Once that happens, we could see adoption skyrocket.
A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
Sounds like a perfect job for OpenCL. When a program is rewritten for OpenCL, you can just drop in CPU's or GPU's and they get used.
It doesn't take a few libraries and lines of code... It takes a SHITLOAD of libraries and lines of code! - Lone Starr
I'm really interested in using GPGPU for my physics calculations. But you know - I don't want to learn Nvidia's low-level, proprietary (whateveritis) in order to do an addition or multiplication, which may or may not outperform the CPU version. What would be _really_ great is stuff like porting the standard "low-level numerics" libraries to the GPU: BLAS, LAPACK, FFTs, special functions, and whatnot - the building blocks for most numerical programs. LAPACK+BLAS you already get in multicore versions, and there's no extra work on my part to use all cores on my PC. Please, computer geeks (i.e. more computer geek than myself), let me have the same on the GPU. When that happens, we can all buy Nvidia HotShit gaming cards and get research done. Until then, GPGPU is for the superdupergeeks.
I'm no expert, but from what I understand, it wouldn't be at all surprising. IBM has been regularly using their Power processors for supercomputers, and the architecture is (largely) the same. The Cell has some extra graphics-friendly floating-point units, but it's not entirely differnent from the CPUs IBM has been pushing for computation in the past. I'm not even sure if the extra stuff in the Cell is interesting in the supercomputing arena.
So.. webpages will soon be available in 3D with anti-aliasing and realistic shading?
So why a GPU rather than a dedicated DSP? Seems they do pretty much the same thing except a GPU is optimised for graphics. A DSP offers 32 or even 64 bit integers, have had 64 bit floats for a while now, allow more flexible memory write positions, and can use the previous results of adjacent values in calculations.
...coming soon to a server farm near you!
No mention of Microsoft's RemoteFX coming in Windows 2008 R2 SP1? RemoteFX uses the server GPU for compression and to provide 3d capabilites to the desktop VMs.
Any company large enough for a datacenter is looking at VDI and RemoteFX is going to be supported by all of VDI providers except VMware. VDI, not relatively niche case massive calculations, will put GPUs in the datacenter.
Not only that, but they posit that Microsoft's solution solves the issue of both Nvidia's proprietary-ness and the OpenCL boards's "lack of action."
Fuck this article, I wish I could unclick on it.
Mod me down, my New Earth Global Warmingist friends!
I could almost EOM that. They're massively parallel, deeply pipelined DSPs. This is why people have trouble with their programming model.
The only difference here is the arrays we're dealing with are 2D and the number of threads is huge (100s-1000s). But each pipe is just a DSP.
OpenCL and the like are basically revealing these chips for what they really are, and the more general purpose they try to make them, the more they resemble a conventional, if massively parallel, array of DSPs.
There's a lot of comments on this subject along the lines of "Why couldn't they make it easier to program?" Well, it always boils down to fundamental complexities in design, and those boil down to the laws of physics. The only way you can get things running this parallel and this fast is to mess with the programming model. People need to learn to deal with it, because all programming is going to end up heading this way.
I've done a little CUDA programming, and I've yet to find significant speedups doing it. Every single time, some limitation in the arch keeps it from running well. My last little project, ran about 30x faster on the GPU than the CPU, the only problem was that the overhead of getting it to the GPU + computation + overhead of getting it back, was roughly equal to the time it took to just dedicate a CPU.
I was really excited about AES on the GPU too, until it turned out to be about 5% faster than my CPU.
Now if the GPU was designed more as a proper coprocessor (ala early x87, or early Weitek) and integrated into the memory hierarchy better (put the funky texture ram and such off to the side) some of my problems might go away.
No, it's the difference between "efficiency" and what is claimed as "efficient" to get a paper published. That's a really bad citation for AES on GPUs as there is a line of prior work going back to Cook and Cryptographics. In fact that paper is a classic example of getting something into the literature that has already been done. The authors have submitted it to an unrelated conference and failed to cite the relevant work.
If we look at their best figures then throw away the 15x claimed speedup as it doesn't consider memory transfer costs. The 5x speedup is more realistic. The GPU that they use (8800gtx) has 128 stream processors running at 1.35Ghz. The comparison is a PIV running at 3Ghz. Roughly speaking we can compare the cycles taken on each platform as a measure of the work done. The graphics card stream processors perform 57x more clock cycles.
The central workload in AES for high-performance is completely memory bound. The cycles are just used to stage results from memory and perform XOR instructions. So the stream processors only execute the code 5x quicker with 57x more clocks and a huge memory bandwidth advantage that I can't be bothered to look up.
So no, 10x less output per clock is not "efficient" in my book. But if you publish your paper in a crappy unrelated conference then you will get away with it.
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php