Ask Slashdot: Parallel Cluster In a Box?
QuantumMist writes "I'm helping someone with accelerating an embarrassingly parallel application. What's the best way to spend $10K to $15K to receive the maximum number of simultaneous threads of execution? The focus is on threads of execution as memory requirements are decently low e.g. ~512MB in memory at any given time (maybe up to 2 to 3X that at the very high end). I've looked at the latest Tesla card, as well as the four Teslas in a box solutions, and am having trouble justifying the markup for what's essentially 'double precision FP being enabled, some heat improvements, and ECC which actually decreases available memory (I recognize ECC's advantages though).' Spending close to $11K for the four Teslas in a 1U setup seems to be the only solution at this time. I was thinking that GTX cards can be replaced for a fraction of the cost, so should I just stuff four or more of them in a box? Note, they don't have to pay the power/cooling bill. Amazon is too expensive for this level of performance, so can't go cloud via EC2. Any parallel architectures out there at this price point, even for $5K more? Any good manycore offerings that I've missed? e.g. somebody who can stuff a ton of ARM or other CPUs/GPUs in a server (cluster in a box)? It would be great if this could be easily addressed via a PCI or other standard interface. Should I just stuff four GTX cards in a server and replace them as they die from heat? Any creative solutions out there? Thanks for any thoughts!"
do you or them know how to program on a GPU?
if its really embarrassingly parallel EC2 spot instances and the gnu program 'parallel' will work quite nicely.
But if coding changes are required then the hardware is the least of your expenses.
If I could walk that way I wouldnt need cologne.
You can easily build a 64core 1U system with opterons using the quad socket setup, or 128 core using the quad socket with extension setup, that will only run you about 5k. These are general 128 cores, 2ghz+, you don't have to change the program to run on these, you do not need to obfuscate things as you would programming and dealing with gpus... Or you can wait for knights corner, or get the Tile64s.
Others have pointed it out, but if you can run this on a GPU, you don't need to look any further than that.
Specifically, check out some of the BitCoin mining rigs people have built, like 4x Radeon 6990s in a single box. For comparison, a single 6990 easily beats a top-of-the-line modern CPU by a factor of 50 (as in, not 50%, but 5000%). You can build such a box for well under $5k.
In HPC we call it "pleasantly parallel," nothing is embarrassing about it! =]
If your code:
-scales to OpenCL/CUDA easily.
-does not require high concurrent memory transfers
-is fault tolerant (ie a failed card doesn't hose a whole day/week of runs)
-can use single precision flops
Then you can use commodity hardware like the gtx series cards. I'd go with the gtx 560ti (GF114 gpu).
Make nodes with:
quad core processors (amd or intel)
whatever ram is needed (8GB minimum)
2 x gtx560ti (448) run in SLI (or the 560ti dual from EVGA)
Basically a scaled down Cray XK6 node. http://www.cray.com/Assets/PDF/products/xk/CrayXK6Brochure.pdf
It all depends on your code.
It would have been nice if he'd given us more information about the form factor he needs to put this into. Since the client isn't paying the electric or cooling bill then I have to assume that it's colocated, so there might be some real rack unit restrictions that prevent this from adequately working well. It also would have been nice to know storage demands too, as there are tradeoffs in front-accessible drive arrays for cooling and airflow purposes. Most of the cases with tons of hot-swap drives in front lack good front ventilation. If he only needs a few drives then that opens him up to a simple 3U or 4U chassis with a mostly open-grille of a front to make airflow a lot less restrictive.
Do not look into laser with remaining eye.
How does the app parallelize? Is each process/thread dependent on every other process/thread or is it a 1000 processes flying in close formation that all need to complete at the same time but don't interact with each other? How embarrassingly parallel is embarrassingly parallel? Is that 512MB requirement per process or the sum of all processes?
GPUs might not be the right solution for this. GPUs are excellent for parallelizing some operations but not others. Have you done any benchmarks? Throwing lots of CPU at the problem may be the right solution depending on the algorithms used and how well they can be adapted for a GPU, if they can be adapted for a GPU.
For the $10K-$15K USD range, I'd look at Supermicro's offerings. You have options ranging from dual socket 16 core AMD systems with 2 Teslas to quad socket AMD systems to quad socket Intel solutions to dual socket Intel systems with 4 Tesla cards.
Do some testing of your code in various configurations before blindly throwing hardware at the problem. I support researchers who run molecular dynamics simulations. I've put together some GPU systems and after testing, it was discovered that for the calculations they are doing, the portions that could be offloaded to their code only accounted for at most 10% of the execution time, with the remainder being operations that the software packages could only do on CPU.
Because it's new, and finding someone who's done it to get some pointers is really hard.
CUDA has been around a while, figuring it out isn't such a rough learning curve.
Overall I'm a little suspicious of someone looking to use a GPU for more threads on a problem. As going the GPU route is a really committed step, and the programming gets a new level of complicated. Using multiple cards has some odd issues in CUDA, ie. If you exceed the card index it defaults to card-0, rather than crashing. There are more places to screw up with a GPU- transferring memory- getting blocks, threads, and weaves organized(if done properly it hides all sorts of latency in calculations, done poorly it's worse than a CPU)- avoiding memory contention (the memory scheme isn't bad, but it needs to be understood).
So in most cases I'd first start with this chart http://www.cpubenchmark.net/cpu_value_available.html and tell them to cut their teeth on a GPU with a smaller(cheaper) test case.
Back-of-the-envelope comparison of PS3 and GTX:
A cluster of three PS3s: 920 GFLOPS. Price: about $800.
A PC with 3 GTX 460 cards: 2200 GFLOPS. Price: about $800.
Each of those GTX cards also has significantly more memory than the PS3, and are cheaper to develop for.
Why not a beowulf clust---
I'm sorry, I just can't. I searched the ~35 posts, browsing at -1, and no reference to a Beowulf cluster anywhere, let alone Natalie Portman or Grits.
Slashdot! You're slipping! I lament the days when even our trolls were amusing and somewhat topical to the discussion at hand! We've fallen so far!
Do not look into laser with remaining eye.
What does SLI give you in CUDA? The newer GeForce cards support direct GPU-to-GPU memory copies, assuming they are on the same PCIe bus (NUMA systems might have multiple PCIe buses).
My research group built this 12-core/8-GPU system last year for about $10k: http://tinyurl.com/7ecqjfj
The system has a theoretical peak ~9.1 TFLOPS, single precision (simultaneously maxing out all CPUs and GPUs). I wish the GPUs had more individual memory (~1.25GB each), but we would have quickly broken our budget had we gone for Tesla-grade cards.
We have several racks full, purchased because "they're cheaper than Tesla's".
Except the Tesla's have, as pointed out, ECC memory and better thermal management, and the GTX's have several useful features (like the GPU load level in nvidia-smi) disabled.
The former cause the compute nodes to crash regularly. What you save on cards, you'll lose in salary for someone to nursemaid them. The latter makes it harder to integrate into a scheduler environment (we're using Torque).
Yes, this is primarily marketing discrimination, and there probably isn't $10 worth of real difference between the two. I hope the marketing droid who thought that scheme up burns. It's a total aggravation, but paying for Teslas is worthwhile.