Ask Slashdot: Parallel Cluster In a Box?
QuantumMist writes "I'm helping someone with accelerating an embarrassingly parallel application. What's the best way to spend $10K to $15K to receive the maximum number of simultaneous threads of execution? The focus is on threads of execution as memory requirements are decently low e.g. ~512MB in memory at any given time (maybe up to 2 to 3X that at the very high end). I've looked at the latest Tesla card, as well as the four Teslas in a box solutions, and am having trouble justifying the markup for what's essentially 'double precision FP being enabled, some heat improvements, and ECC which actually decreases available memory (I recognize ECC's advantages though).' Spending close to $11K for the four Teslas in a 1U setup seems to be the only solution at this time. I was thinking that GTX cards can be replaced for a fraction of the cost, so should I just stuff four or more of them in a box? Note, they don't have to pay the power/cooling bill. Amazon is too expensive for this level of performance, so can't go cloud via EC2. Any parallel architectures out there at this price point, even for $5K more? Any good manycore offerings that I've missed? e.g. somebody who can stuff a ton of ARM or other CPUs/GPUs in a server (cluster in a box)? It would be great if this could be easily addressed via a PCI or other standard interface. Should I just stuff four GTX cards in a server and replace them as they die from heat? Any creative solutions out there? Thanks for any thoughts!"
Why not use AMD and OpenCL?
Just put bunch of GTX cards to nice, big server case with enough fans. You are hardly going to find any cheaper alternative.
When choosing cards, look for tests like this one:
http://www.behardware.com/articles/840-13/roundup-a-review-of-the-super-geforce-gtx-580s-from-asus-evga-gainward-gigabyte-msi-and-zotac.html
The IR thermal photos are great when choosing well cooled card.
Also use SW to control card fans to keep them running at 100% fan speed.
Noisy? Yes. But who cares, unless you plan putting it in your bedroom.
You can easily keep these cards at ~70C with full load.
If the off-the-shelf GTX cards work, you'd have 8 * Xeon + 8 * NVidia GPU's in 3U, all entirely parallel (I.E. 8 separate machines) to avoid the main CPU's being any kind of bottleneck. Stock each node w/ 2GB of RAM on the cheap and some cheaper SATA drives, you'd likely end up under $10k for the whole thing and have an 8-node cluster you can use for other tasks later.
I've noticed that "embarrassingly parallel" tasks, if you take the low-hanging fruit too far, end up running into some other unforeseen bottleneck. Thus me suggesting something faux-bladeish instead.
PlayStation 3s have proved a cost efficient way of setting up large scale parallel processing systems. Of course you'll have to find your way around Sony's blocks on the OtherOS system, and you'll need to keep it off the internet or firewalled in some way, but you essentially get cheap processing subsidised by the games that you don't need to buy.
Please consider this account deleted, I just can't be bothered with the spam anymore.
do you or them know how to program on a GPU?
if its really embarrassingly parallel EC2 spot instances and the gnu program 'parallel' will work quite nicely.
But if coding changes are required then the hardware is the least of your expenses.
If I could walk that way I wouldnt need cologne.
> Should I just stuff four GTX cards in a server and replace them as they die from heat?
It'd be more cost-efficient to improve the air flow or add liquid cooling. Yay mineral oil baths.
You can easily build a 64core 1U system with opterons using the quad socket setup, or 128 core using the quad socket with extension setup, that will only run you about 5k. These are general 128 cores, 2ghz+, you don't have to change the program to run on these, you do not need to obfuscate things as you would programming and dealing with gpus... Or you can wait for knights corner, or get the Tile64s.
If, for example, it's embarrassing parallel DSP operations, you might try some dedicated DSP engines, or even some Xilinx FPGAs.
If it's really embarrassingly parallel, just run it on whatever CPUs you have hanging about or can scrounge cheaply. As long as the application is written portably they don't even need to be the same architecture or operating system, although that would help with deployment. The only reason to try to scrunch everything in one box would be if you have space limitations.
You can get 48 real AMD Magny-Cours CPU cores with full DP floating point support and ~64GB ECC memory in a box for under 10K(EUR!) from e.g. Tyan and supermicro.
I run my embarassingly parallel stuff on that, and it works great. Depending on your application 64 Bulldozer cores which come in the same package for only slightly more money may perform better or not. I have not seen many realworld applications in which one GPU is actually faster than 12 to 16 server-class CPU cores.
Of course this depends a lot on wether you have done the GPU porting already or are just planning to, which you unfortunately don't state in your post
Others have pointed it out, but if you can run this on a GPU, you don't need to look any further than that.
Specifically, check out some of the BitCoin mining rigs people have built, like 4x Radeon 6990s in a single box. For comparison, a single 6990 easily beats a top-of-the-line modern CPU by a factor of 50 (as in, not 50%, but 5000%). You can build such a box for well under $5k.
In HPC we call it "pleasantly parallel," nothing is embarrassing about it! =]
If your code:
-scales to OpenCL/CUDA easily.
-does not require high concurrent memory transfers
-is fault tolerant (ie a failed card doesn't hose a whole day/week of runs)
-can use single precision flops
Then you can use commodity hardware like the gtx series cards. I'd go with the gtx 560ti (GF114 gpu).
Make nodes with:
quad core processors (amd or intel)
whatever ram is needed (8GB minimum)
2 x gtx560ti (448) run in SLI (or the 560ti dual from EVGA)
Basically a scaled down Cray XK6 node. http://www.cray.com/Assets/PDF/products/xk/CrayXK6Brochure.pdf
It all depends on your code.
earlier thread ...
"I love my job, but I hate talking to people like you" (Freddie Mercury)
How does the app parallelize? Is each process/thread dependent on every other process/thread or is it a 1000 processes flying in close formation that all need to complete at the same time but don't interact with each other? How embarrassingly parallel is embarrassingly parallel? Is that 512MB requirement per process or the sum of all processes?
GPUs might not be the right solution for this. GPUs are excellent for parallelizing some operations but not others. Have you done any benchmarks? Throwing lots of CPU at the problem may be the right solution depending on the algorithms used and how well they can be adapted for a GPU, if they can be adapted for a GPU.
For the $10K-$15K USD range, I'd look at Supermicro's offerings. You have options ranging from dual socket 16 core AMD systems with 2 Teslas to quad socket AMD systems to quad socket Intel solutions to dual socket Intel systems with 4 Tesla cards.
Do some testing of your code in various configurations before blindly throwing hardware at the problem. I support researchers who run molecular dynamics simulations. I've put together some GPU systems and after testing, it was discovered that for the calculations they are doing, the portions that could be offloaded to their code only accounted for at most 10% of the execution time, with the remainder being operations that the software packages could only do on CPU.
Why not a beowulf clust---
I'm sorry, I just can't. I searched the ~35 posts, browsing at -1, and no reference to a Beowulf cluster anywhere, let alone Natalie Portman or Grits.
Slashdot! You're slipping! I lament the days when even our trolls were amusing and somewhat topical to the discussion at hand! We've fallen so far!
Do not look into laser with remaining eye.
Yes, I haven't seen any references here or anywhere else either lately.
From http://en.wikipedia.org/wiki/Beowulf_cluster: "The name Beowulf originally referred to a specific computer built in 1994 by Thomas Sterling and Donald Becker at NASA. [...] There is no particular piece of software that defines a cluster as a Beowulf. Beowulf clusters normally run a Unix-like operating system, such as BSD, Linux, or Solaris, normally built from free and open source software. Commonly used parallel processing libraries include Message Passing Interface (MPI) and Parallel Virtual Machine (PVM). Both of these permit the programmer to divide a task among a group of networked computers, and collect the results of processing. Examples of MPI software include OpenMPI or MPICH. There are additional MPI implementations available. Beowulf systems are now deployed worldwide, chiefly in support of scientific computing."
Apparently, Beowuld clusters may be around, it is just that they don't go by that name any longer. I wonder what would be the latest buzzword for essentially the same thing?
http://www.mini-itx.com/projects/cluster/?p
The example at the URL above is quite old, but a good starting point. Just use a dozen cheap mini-itx cards with -- let's say -- Intel Core i5 and voilà! Probably the cheapest way to go, and, also much easier to program than using CUDA and nVidia. Hook the whole thing in a gigabit switch
I'll let the experts debate the best CPU for that job, but AMD should also have some nice products on offer.
The right to offend is far more important than the right not to be offended. (Rowan Atkinson)
What does SLI give you in CUDA? The newer GeForce cards support direct GPU-to-GPU memory copies, assuming they are on the same PCIe bus (NUMA systems might have multiple PCIe buses).
My research group built this 12-core/8-GPU system last year for about $10k: http://tinyurl.com/7ecqjfj
The system has a theoretical peak ~9.1 TFLOPS, single precision (simultaneously maxing out all CPUs and GPUs). I wish the GPUs had more individual memory (~1.25GB each), but we would have quickly broken our budget had we gone for Tesla-grade cards.
We have several racks full, purchased because "they're cheaper than Tesla's".
Except the Tesla's have, as pointed out, ECC memory and better thermal management, and the GTX's have several useful features (like the GPU load level in nvidia-smi) disabled.
The former cause the compute nodes to crash regularly. What you save on cards, you'll lose in salary for someone to nursemaid them. The latter makes it harder to integrate into a scheduler environment (we're using Torque).
Yes, this is primarily marketing discrimination, and there probably isn't $10 worth of real difference between the two. I hope the marketing droid who thought that scheme up burns. It's a total aggravation, but paying for Teslas is worthwhile.
I've played this parallel cost analysis game several times, and if you don't need high bandwidth communication between the threads, I usually come up with the Google solution: a big farm of cheap machines. AMD chips start looking good compared to Intel because you're not after a single thread finishing as fast as possible, you're after as many FLOPS per $ as you can get. We even did the analysis for an extreme Apple fanboi: MacPros vs MacMinis back in 2007, and a stack of 25 minis came out way more powerful than the 3 or 4 Pros you could get for the same money.
As someone who has done some GPU programming (specifically CUDA) be aware that there is more to the GPU parallelism model than just "lots of threads". Many embarrassingly parallel problems translate very poorly to CUDA. The primary things to consider is that:
1. GPUs are *data parallel*. This means that you need to have an algorithm in which each and every thread will be executing the same instruction at the same time (just on different data). For a cheap way to evaluate it, if you can't speed up your program by vectorizing it then the GPU won't help. Of course, you can have divergent "threads" on GPUs, but as soon as you do you've lost all benefit to using a GPU, and have essentially turned your GPU into an expensive but slow computer.
2. Moving data onto or off of the GPU is *slow*. So if you can leave all the data on the GPUs and none of the GPUs need to communicate with each other, then this will work well. If the threads need to frequently globally sync up, you're going to be in trouble.
That said, if you have the right kind of data parallel problem, GPUs will blow everything else out of the water at the same price point.
Yes I do. He's extending the calculations begun by Lewis Carroll in the imaginary space (through the looking glass), to see the effects as the ultimate limit increases.
What's
1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+
1+1+1+1+1+1+1+1+1+1+1+1
As I said, embarrassingly parallel. Get 7 computers working on it in parallel, with 1 for backup:
What's 1+1+1+1+1+1 (after some calculation, 6)
So that all is 42.
the ultimate answer is
1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+
1+1+1+1+1+1+1+1+1+1+1+1=42.
I should note that this mathematical calculation was also attempted by Douglas Adams, using genetic algorithms.
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
There's some high-powerd PCI cards filled with TI DSPs that you can get. Here's an article describing some of them. In terms of power efficiency per unit of work, the DSPs blow the doors off the main processor and the GPUs. Each DSP on the chip can do 16 single precision or 4 double precision floating point operations per cycle, at around 1GHz, and they're programmable in C/C++.
Relevant quote:
Buy 5 of these and you're only at 550W, $10,000 and 5 TFLOPs.
Program Intellivision!