Ask Slashdot: Parallel Cluster In a Box?
QuantumMist writes "I'm helping someone with accelerating an embarrassingly parallel application. What's the best way to spend $10K to $15K to receive the maximum number of simultaneous threads of execution? The focus is on threads of execution as memory requirements are decently low e.g. ~512MB in memory at any given time (maybe up to 2 to 3X that at the very high end). I've looked at the latest Tesla card, as well as the four Teslas in a box solutions, and am having trouble justifying the markup for what's essentially 'double precision FP being enabled, some heat improvements, and ECC which actually decreases available memory (I recognize ECC's advantages though).' Spending close to $11K for the four Teslas in a 1U setup seems to be the only solution at this time. I was thinking that GTX cards can be replaced for a fraction of the cost, so should I just stuff four or more of them in a box? Note, they don't have to pay the power/cooling bill. Amazon is too expensive for this level of performance, so can't go cloud via EC2. Any parallel architectures out there at this price point, even for $5K more? Any good manycore offerings that I've missed? e.g. somebody who can stuff a ton of ARM or other CPUs/GPUs in a server (cluster in a box)? It would be great if this could be easily addressed via a PCI or other standard interface. Should I just stuff four GTX cards in a server and replace them as they die from heat? Any creative solutions out there? Thanks for any thoughts!"
Why not use AMD and OpenCL?
Sorry for just the link, but you could one off something like this http://helmer.sfe.se/
Spec out a quad-core AMD grey box with 4 gigs of ram (I saw 4 gigs of DDR3 RAM for $20 the other day). That shouldn't run you more than $400 a pop.
For 10K, you'll get 10,000/400*4=100 threads of execution.
Just put bunch of GTX cards to nice, big server case with enough fans. You are hardly going to find any cheaper alternative.
When choosing cards, look for tests like this one:
http://www.behardware.com/articles/840-13/roundup-a-review-of-the-super-geforce-gtx-580s-from-asus-evga-gainward-gigabyte-msi-and-zotac.html
The IR thermal photos are great when choosing well cooled card.
Also use SW to control card fans to keep them running at 100% fan speed.
Noisy? Yes. But who cares, unless you plan putting it in your bedroom.
You can easily keep these cards at ~70C with full load.
If the off-the-shelf GTX cards work, you'd have 8 * Xeon + 8 * NVidia GPU's in 3U, all entirely parallel (I.E. 8 separate machines) to avoid the main CPU's being any kind of bottleneck. Stock each node w/ 2GB of RAM on the cheap and some cheaper SATA drives, you'd likely end up under $10k for the whole thing and have an 8-node cluster you can use for other tasks later.
I've noticed that "embarrassingly parallel" tasks, if you take the low-hanging fruit too far, end up running into some other unforeseen bottleneck. Thus me suggesting something faux-bladeish instead.
PlayStation 3s have proved a cost efficient way of setting up large scale parallel processing systems. Of course you'll have to find your way around Sony's blocks on the OtherOS system, and you'll need to keep it off the internet or firewalled in some way, but you essentially get cheap processing subsidised by the games that you don't need to buy.
Please consider this account deleted, I just can't be bothered with the spam anymore.
https://sites.google.com/site/jimerickso/home/new-build
do you or them know how to program on a GPU?
if its really embarrassingly parallel EC2 spot instances and the gnu program 'parallel' will work quite nicely.
But if coding changes are required then the hardware is the least of your expenses.
If I could walk that way I wouldnt need cologne.
Amazon EC2.
> Should I just stuff four GTX cards in a server and replace them as they die from heat?
It'd be more cost-efficient to improve the air flow or add liquid cooling. Yay mineral oil baths.
Radeon HD 5800-5900 series Supports FP64
Radeon HD 6900 series supports FP64
You can easily build a 64core 1U system with opterons using the quad socket setup, or 128 core using the quad socket with extension setup, that will only run you about 5k. These are general 128 cores, 2ghz+, you don't have to change the program to run on these, you do not need to obfuscate things as you would programming and dealing with gpus... Or you can wait for knights corner, or get the Tile64s.
If, for example, it's embarrassing parallel DSP operations, you might try some dedicated DSP engines, or even some Xilinx FPGAs.
I built a cluster the other day with 8 i7-2600K processors.
CPU - Intel i7-2600k = $300
Motherboard - P8H67-M PRO/CSM = $110
Ram - 4x 4GB corsair = $100
2u case + 400w ps = $90
My total cost was under 5k for 8 nodes, and it runs very very fast. Although my application likes CPUs more then GPUs. I also use a total of 16U of space, but that is not much of a extra cost.
You really haven't given any details about your requirements.
This is a parallel problem, but will it run well on a GPU? If its an inherently divergent task, then probably not (Correct me if this isn't the case for other cards, I only have CUDA experience). If you want good answers, you'll need to describe your problem in more detail than just being embarrassingly parallel.
The recently-announced HP Moonshot architecture seems to meet most of your operational requirements. http://www.hp.com/hpinfo/newsroom/press/2011/111101xa.html
I haven't seen any pricing, though.
Buying a blade server on ebay would also be a great option. For around 5-6k you could get a nice blade with 10 nodes.
10 dual cpu nodes = 80 cores, if you have hyperthreading you can run 160 threads.
AMD cards are worth a look. Especially for embarrassingly parallel stuff they often deliver higher performance (see eg bitcoin) .
If it's really embarrassingly parallel, just run it on whatever CPUs you have hanging about or can scrounge cheaply. As long as the application is written portably they don't even need to be the same architecture or operating system, although that would help with deployment. The only reason to try to scrunch everything in one box would be if you have space limitations.
You can get 48 real AMD Magny-Cours CPU cores with full DP floating point support and ~64GB ECC memory in a box for under 10K(EUR!) from e.g. Tyan and supermicro.
I run my embarassingly parallel stuff on that, and it works great. Depending on your application 64 Bulldozer cores which come in the same package for only slightly more money may perform better or not. I have not seen many realworld applications in which one GPU is actually faster than 12 to 16 server-class CPU cores.
Of course this depends a lot on wether you have done the GPU porting already or are just planning to, which you unfortunately don't state in your post
Just make a cluster of these little guys.
HDMI output, USB input.
Encode data in to HDMI frames.
Have a decent board for decoding and to perform instructions from the HDMI data, then send more data back through USB.
You could probably even use the audio ports for even more throughput.
And if you get the ethernet version, that too.
I'm not even joking.
Well, partially. Might not be worthy of this case.
Plus, not out or even final.
But it sounds like an interesting idea anyway, so might as well throw it out there since it is pretty related.
Others have pointed it out, but if you can run this on a GPU, you don't need to look any further than that.
Specifically, check out some of the BitCoin mining rigs people have built, like 4x Radeon 6990s in a single box. For comparison, a single 6990 easily beats a top-of-the-line modern CPU by a factor of 50 (as in, not 50%, but 5000%). You can build such a box for well under $5k.
If you're not a GPU programmer the alternative is a 48-core AMD server (64-core systems are notoriously slow and have half the floating point units) with MPI. This is the solution that many academics are taking.
Also if you're lucky you might be able to get your hands on Intel's 100-core Atom processor, they're not for sale AFAIK but I believe you can apply to get one for free.
You are mining bitcoins too?
In HPC we call it "pleasantly parallel," nothing is embarrassing about it! =]
If your code:
-scales to OpenCL/CUDA easily.
-does not require high concurrent memory transfers
-is fault tolerant (ie a failed card doesn't hose a whole day/week of runs)
-can use single precision flops
Then you can use commodity hardware like the gtx series cards. I'd go with the gtx 560ti (GF114 gpu).
Make nodes with:
quad core processors (amd or intel)
whatever ram is needed (8GB minimum)
2 x gtx560ti (448) run in SLI (or the 560ti dual from EVGA)
Basically a scaled down Cray XK6 node. http://www.cray.com/Assets/PDF/products/xk/CrayXK6Brochure.pdf
It all depends on your code.
earlier thread ...
"I love my job, but I hate talking to people like you" (Freddie Mercury)
How does the app parallelize? Is each process/thread dependent on every other process/thread or is it a 1000 processes flying in close formation that all need to complete at the same time but don't interact with each other? How embarrassingly parallel is embarrassingly parallel? Is that 512MB requirement per process or the sum of all processes?
GPUs might not be the right solution for this. GPUs are excellent for parallelizing some operations but not others. Have you done any benchmarks? Throwing lots of CPU at the problem may be the right solution depending on the algorithms used and how well they can be adapted for a GPU, if they can be adapted for a GPU.
For the $10K-$15K USD range, I'd look at Supermicro's offerings. You have options ranging from dual socket 16 core AMD systems with 2 Teslas to quad socket AMD systems to quad socket Intel solutions to dual socket Intel systems with 4 Tesla cards.
Do some testing of your code in various configurations before blindly throwing hardware at the problem. I support researchers who run molecular dynamics simulations. I've put together some GPU systems and after testing, it was discovered that for the calculations they are doing, the portions that could be offloaded to their code only accounted for at most 10% of the execution time, with the remainder being operations that the software packages could only do on CPU.
Don't use high end GTX cards; twice as many lower end passively-cooled GPU cards will provide more than the equivalent performance with far less cost and failure rate. If your application really benefits more from additional threads vs single thread execution speed, this is the way to go. Most GPGPU clusters that aren't built using Tegra use this approach.
big FP bandwidth on a tesla doesn't do much for you if you only need integer execution. Maybe you'd be better off with a 4-cpu xeon box, or a bulldozer, or a 64-core arm. Really, you want to find a way to benchmark your particular software on a variety of potential cpu targets, and then do a price comparison.
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
Why not a beowulf clust---
I'm sorry, I just can't. I searched the ~35 posts, browsing at -1, and no reference to a Beowulf cluster anywhere, let alone Natalie Portman or Grits.
Slashdot! You're slipping! I lament the days when even our trolls were amusing and somewhat topical to the discussion at hand! We've fallen so far!
Do not look into laser with remaining eye.
This makes sense as i have a two year old that uses the devices. they are fairly easy to use and in the long run may be cheaper. however, with new updates to the program coming out yearly, most of these devices will be outdated very quickly. So after that, then what?
Orlando Web Design By Elijah Clark
Yes, I haven't seen any references here or anywhere else either lately.
From http://en.wikipedia.org/wiki/Beowulf_cluster: "The name Beowulf originally referred to a specific computer built in 1994 by Thomas Sterling and Donald Becker at NASA. [...] There is no particular piece of software that defines a cluster as a Beowulf. Beowulf clusters normally run a Unix-like operating system, such as BSD, Linux, or Solaris, normally built from free and open source software. Commonly used parallel processing libraries include Message Passing Interface (MPI) and Parallel Virtual Machine (PVM). Both of these permit the programmer to divide a task among a group of networked computers, and collect the results of processing. Examples of MPI software include OpenMPI or MPICH. There are additional MPI implementations available. Beowulf systems are now deployed worldwide, chiefly in support of scientific computing."
Apparently, Beowuld clusters may be around, it is just that they don't go by that name any longer. I wonder what would be the latest buzzword for essentially the same thing?
normally, the phrase means "lots of serial jobs", which have an input configuration and a result, and nothing in between (particularly no inter-job sharing). gp-gpu is suitable for a somewhat different sort of workload, basically single-instruction-multiple-threads. in short, are the threads working in lockstep?
http://www.mini-itx.com/projects/cluster/?p
The example at the URL above is quite old, but a good starting point. Just use a dozen cheap mini-itx cards with -- let's say -- Intel Core i5 and voilà! Probably the cheapest way to go, and, also much easier to program than using CUDA and nVidia. Hook the whole thing in a gigabit switch
I'll let the experts debate the best CPU for that job, but AMD should also have some nice products on offer.
The right to offend is far more important than the right not to be offended. (Rowan Atkinson)
Hi Guys,
I get paid 200/hr by the government to come up with an architecture for parallel processing, Rather than taking time reading through droll literature, I need to go traveling to my second house in the Cayman's. I wondered if I could ask slashdot and save myself the trouble.
Ttyl
Tax Evader
What does SLI give you in CUDA? The newer GeForce cards support direct GPU-to-GPU memory copies, assuming they are on the same PCIe bus (NUMA systems might have multiple PCIe buses).
My research group built this 12-core/8-GPU system last year for about $10k: http://tinyurl.com/7ecqjfj
The system has a theoretical peak ~9.1 TFLOPS, single precision (simultaneously maxing out all CPUs and GPUs). I wish the GPUs had more individual memory (~1.25GB each), but we would have quickly broken our budget had we gone for Tesla-grade cards.
if it could save you
We have several racks full, purchased because "they're cheaper than Tesla's".
Except the Tesla's have, as pointed out, ECC memory and better thermal management, and the GTX's have several useful features (like the GPU load level in nvidia-smi) disabled.
The former cause the compute nodes to crash regularly. What you save on cards, you'll lose in salary for someone to nursemaid them. The latter makes it harder to integrate into a scheduler environment (we're using Torque).
Yes, this is primarily marketing discrimination, and there probably isn't $10 worth of real difference between the two. I hope the marketing droid who thought that scheme up burns. It's a total aggravation, but paying for Teslas is worthwhile.
I think the time of the PS3 clusters has past. The Cell processor was released back in 2006! IBM released a few upgraded processors, mostly improving double-precision performance, but those systems are really cost prohibitive.
Assuming you can deal with PCIe latency, GPUs are the way to go.
... do not require embarrassingly parallel solutions.
They require math and algorithm design to make the solution *nonembarrassing*.
Give you an example: a typical FFT can, with easy math, cut it number of calculations by four. With a little care, you can halve the number of calculations again.
Start with the math. Then look at the solution.
Last of all, consider cloudware. It's out there. Let's see... on my android, I have "sourceLair". Yeah, that's one.
Once you have the cloudware solution in hand, *then* you can start thinking about spending money on a kindof parallel solution (such as what Google uses).
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
Even though the application is parallel, your bottleneck can easily be the memory bus. Adding tesla cores won't solve memory bus issues. For a number of apps, Intel i5 quad cores stacked up increase memory bandwidth on the cheap. 10 500$ machines, or 5 1000$ machines with a cheap NVidia GPU, may very will outperform anything that can be put in a single "box" - because there is 10x or 5x more memory bandwidth. That means you need software to write not just parallel code, but multimachine parallel code - in which case you should get in bed with a computation fabric like Hadoop or one of a million others (raw OpenMP is another example, if you're a GPU hacker type).
Replying to self: our Citrusleaf database does amazing parallel operations on Sandy Bridge i5 (2400) machines. Single socket machines have the best interrupt processing and lowest memory latency. Going to Xeon architectures is, price performance, a HUGE decrease. There was a great post somewhere about $/speed in CPUs, and of course the true consumer grade stuff (i5 and Phenom II) were 10x better than "datacenter" grade machines. This is especially true for Supermicro. As much as I like them, you can save 4x money by going Asus and using a physically larger box - if you're not going into a data center. Another cost savings is running the project at home - you'll get more bandwidth for $50/month then you'll ever get from a data center.
Go old school and immerse the entire machine in a tub of mineral oil?
"I believe in Karma. That means I can do bad things to people all day long and I assume they deserve it." : Dogbert
Multiple GTX card servers in a cluster? So 4 GTX GPU's to a box, and several boxes to the cluster.
"If any question why we died, Tell them because our fathers lied."
I've played this parallel cost analysis game several times, and if you don't need high bandwidth communication between the threads, I usually come up with the Google solution: a big farm of cheap machines. AMD chips start looking good compared to Intel because you're not after a single thread finishing as fast as possible, you're after as many FLOPS per $ as you can get. We even did the analysis for an extreme Apple fanboi: MacPros vs MacMinis back in 2007, and a stack of 25 minis came out way more powerful than the 3 or 4 Pros you could get for the same money.
Two 8-core processors with 8 threads per core == 128 simultaneous threads.
You could get a new Sun T3-1 for a little more. It would be roughly the same performance (it only has one physical processor, but it's 16 cores * 8 threads per core, so still 128 total).
Tilera has 100 core chips. If you don't need floating point (you never said what you were doing) they're a great choice.
Very informative, kinda technical: http://www.tweak3d.net/articles/howtolanparty/
sysadmins and parents of newborns get the same amount of sleep.
Most (but not all) double-precision work can be handled by single precision pairs. Basically you keep numbers in the form a+b, where a is "big" and b is "little" and handle them accordingly. The slowdown is less than you'd think, and often gives better performance than the kind of hardware dp that gpus offer. There's a bunch of libraries and papers out there if you google them.
ECC is nice, but can be avoided in a number of applications by simply doing regular checkpointing and restarting failed computations. Again, this only works for certain applications, but can save a whole heap of $ for the ones that can.
If your target workload fits in both of these groups, you can assemble a cluster using high-end AMD gamer cards that will thrash any Tesla-based solution on performance/$ by a huge amount.
Speaking anonymously because of my employer, but I have built such systems for these kind of applications.
The Cell, at the time of release, was mind-blowingly fast. Fastest chip around. But it didn't advance very far, and more conventional processors have now overtaken it.
If you imitate what distributed.net accomplished (and folding@home or others are currently accomplishing) just make a creative website detailing what you're doing and why people should give you their unused GPU/CPU cycles.
Otherwise 4x GTX 580 (as mentioned already) will destroy what you throw at it.
I think the fact that you mentioned Tesla and GTX in this article covers the unsaid statement of "we're using CUDA".
Or browse top500.org for a rental shopping list.
As someone who has done some GPU programming (specifically CUDA) be aware that there is more to the GPU parallelism model than just "lots of threads". Many embarrassingly parallel problems translate very poorly to CUDA. The primary things to consider is that:
1. GPUs are *data parallel*. This means that you need to have an algorithm in which each and every thread will be executing the same instruction at the same time (just on different data). For a cheap way to evaluate it, if you can't speed up your program by vectorizing it then the GPU won't help. Of course, you can have divergent "threads" on GPUs, but as soon as you do you've lost all benefit to using a GPU, and have essentially turned your GPU into an expensive but slow computer.
2. Moving data onto or off of the GPU is *slow*. So if you can leave all the data on the GPUs and none of the GPUs need to communicate with each other, then this will work well. If the threads need to frequently globally sync up, you're going to be in trouble.
That said, if you have the right kind of data parallel problem, GPUs will blow everything else out of the water at the same price point.
www.gumstix.com/store/product_info.php?products_id=247
Overo omap3s rock
They just claimed that it was mind-blowingly fast.
In theory there is no difference between theory and practice, but in practice there is.
"His name was James Damore."
If this is a one time thing, I assure you that ec2 is more cost effective than any hardware solution you can invent. If its ongoing, it's still probably cheaper. Teslas work really well, but they're a bitch to code for, and it doesn't fix your io problems. Pair it with a fusion/io perhaps?
Where you are essentially solving multiple, simple, equations over and over, we use 6 GPUs in a case. To a large extent these equations can be calculated separately (well the boundaries are well defined). Vector math, and a lot of it, runs extermely parallel on on GPUs
If your building a glorified web server, keep in mind that you will still be I/O bound. And just because your on a gigabit lan doesn't mean you will be pumping data out that fast.
If you are building a database engine, again, I/O limited. Even with the fastest of drives, you sill have to manage the drives and with databases there are all kinds of rules it has to enforce etc.
If you are trying to build a multi-user system, you really want to separate the memory buses per CPU to the extent possible I'm not sure were Opterons and Xeons technology has come but I'd be looking at Tyan motherboards and see what they support.
Only a few problems with the cell: you needed IBM's development libraries (only a dozen or so anal probes to get that), the 'production cell processors' had 6 'cell engines', while PS3's has 5 (basically, errors and throwbacks from the fab), also, they used one cell processor to lock down the PS3 so that 1) you couldn't use the GPU for games or anything IBM or Sony didn't like). Oh, and the basic speed of the chip was like a typical Power processor, which couldn't deal very well with out-of-order executions (read slow). As an example: I has a 1.8 GHz pentium4, and it was about 7 times as fast as this processor without the 'extra cell' engines running. You would need IBM's libraries, to get anything out of the Cell engines. It was also important to re-structure your code to deal with out-of-order executions. Best of luck with that.
There's some high-powerd PCI cards filled with TI DSPs that you can get. Here's an article describing some of them. In terms of power efficiency per unit of work, the DSPs blow the doors off the main processor and the GPUs. Each DSP on the chip can do 16 single precision or 4 double precision floating point operations per cycle, at around 1GHz, and they're programmable in C/C++.
Relevant quote:
Buy 5 of these and you're only at 550W, $10,000 and 5 TFLOPs.
Program Intellivision!
We've done a lot of testing of different GPUs to look at basic reliability: things like writing data to memory, waiting a while, and reading it back to see if any bits have spontaneously flipped. The conclusion is that on GTX boards, this really does happen. If you're doing production work where consistently getting the right result matters, you should stay away from them. On the other hand, we've never seen any memory errors on Tesla boards, even with ECC disabled. This might just be because Nvidia tends to clock their Teslas a little lower. Or maybe they test chips as they come off the assembly line, and ones with marginal results get sold as GTXs. But one way or another, there really is a difference.
"I'm too busy to research this and form an educated opinion, but I do have time to tell everyone my uninformed opinion."
There just was an actual competition, where various student teams worked out how to build mini-supers for this sort of workload in a restricted power envelope. Go do your homework and look at what they did, hmkay.
Well, yes. If I really wanted to be cool about it, I might consider going to Radio Shack, and buying an Anduino. Then use the 4 outputs, plus a couple shift registers, to make something that could program an 80c51XA. Then design my algorithm to go on those, plugged together such that they'd outperform even an Nvidia.
Or, even cooler, I might program the 80c51XAs in parallel, one being the calculations chip, and one handling all the i/o from one unit to the other. Then I could write a massively parallel program that downloaded to the one, and ran on all.
Or I might just sit here on slashdot and imagine doing something cool.
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
How about 40 of these? http://www.zotacusa.com/zbox-amd-e-350-apu-all-in-one-zbox-ad10-plus-u.html
This would give you 80x 1.6 CPU cores, 80GB, and 3200 GPU pipelines @ 500MHz.
For a portable case use a plastic footlocker, the kind with wheels and a hinged lid. This hinged lid was key for me as it allowed me to attach a keyboard, trackball mouse, and small monitor.
On the inside I have two ATX motherboards with dual core Athlon 64s, though I could have used anything had I felt like spending the money. The worker node has two graphics cards and an extra NIC for regular network traffic (the onboard gigabit NIC is used for message passing). The head node has an extra WiFi NIC as well for talking with the outside world. There are then two switches, one for each internal network, and two hard drives off the head node. The worker node boots off a USB stick. I found Ubuntu installed from a live CD provides a nice, small OS.
It's a little cramped (the top sides of the motherboards face each other), but there's enough room for the power supplies to divide the space down the middle, with the switches and hard drives mounted above that and opposite of one another. Everything is held in place with L brackets, plexiglass, screws and spacers. Between Newegg (computer hardware), Amazon (keyboard, mouse, and monitor), and the Home Depot (box and mounting hardware) the whole project only cost about $1,000.
What's really nice is that there's room enough in the box for four ATX systems with expansion cards, or probably eight-ten mini-ITX boards if you wanted to go that route.
If you haven't already, add these sites to your research:
http://www.clustermonkey.net/
http://debianclusters.org/index.php/Main_Page
http://www.calvin.edu/~adams/research/microwulf/
They were extreemly valuable to me.
It won't be particularly easy, but it will be fun and rewarding like no other, and it makes a great mobile monster to show off to your friends!
Heck, you'd be surprised how many projects have gotten people to volunteer to run such things. All you have to do is provide good uptime and statistics and people will come running! (Though a good project description helps too.)
(T>t && O(n)--) == sqrt(666)
Writing code for video cards is much more difficult than most people think. On the other hand, if it's really a light weight, low CPU task that's just insanely parallel, check out http://www.tilera.com/ They don't pack a ton or horses, but they do have a pile of cores.
Agile Artisans
Here's an example build for you with multiple GPUs:
http://fastra.ua.ac.be/en/index.html
You mention GPU but can you use get the solution up and running as quickly as the cpu solution? Optimised multi-gpu solutions are not that easy as the programmer has to do all the heavy lifting.
Does the code vectorise? If is does, then I'd be tempted to go with as many dual socket Intel machines as you can. Are you able to use the Intel compiler (leveraging into the MKL, IPP and IMF as much as possible). This assumes that communication is low. You are not going to have the cash for a low latency, high speed interconnect. The Intel compiler is free for linux and non-commercial use.
If the code doesn't vectorise, then I'd go for the recent quad socket, 16 core AMD setups and just go for blind horsepower.
Do you work for an organisation that already has big compute requirements and a system in place? Can you buy time there? Our installation runs at about a quarter the price of EC2 (and that includes people for installing software, configuring environments and providing compiler licenses). Can you contribute money to this group? $15K with us would get you 4 nodes of dual socket, 6 core Xeons, dual lane 256 bit wide registers and 2.8GHz clock .... 1.075TFlop. And access to the compilers to get very close to that theoretical performance easily.
There are all sorts of things that can steer the direction is any of the above and even others. Good luck.
.
http://fastra2.ua.ac.be/
As you mention Teslas i guess using openCL with AMD could be an option?
Since the fusion chips share memory (for better and worse) with the main CPU you can apperantly get faster(0?) "transfer time" between CPU/GPU also maybe with this method it'd be possible to pump up the GPU cores with larger amounts of useable memory than usual?
And since they're cheap you could buy a bucketload of them.
What sorts of embarrassingly parallel algorithms are you looking at performing?
Would it make sense to instead look at implementing this algorithm in an FPGA or several FPGA's?
some keywords:
Hardware-Software co-design
high-level to hdl
Grid of greenarrays chips from greenarraychips.com
Set then up to pass each "thread" around stage at time, in a loop between those chips acting in concert as multipliers / adders, and those acting as memory controllers. Implement register-less (and thus context-less) fine grained multithreading.
Will therefore handle "n" threads, limited only by memory size.
Organize overall topology to use the "instruction code" as an address to reach the op unit that instruction requires, and keep said operation units working close to peak rate.
Should get close to the 57.6 BIPS per chip, and at extremely low power usage.
Just to be clear, use the on-core instruction code just as "microcode", use multiple cores in parallel to handle wider data. (each core is only 18 bit native width).
Pass the "register" data in parallel with the operands, and switch between threads extremely rapidly.
Total "instruction cycle" is therefore many many "clocks", for one thread, but with very many threads you're effectively retiring one instruction per clock per available op unit. Single thread performance will suck, very-many threaded performance will be untouchable.
Performance per watt will also be orders of magnitude better than anything else.
Total power dissipation will be ~ 1 W per chip. (each chip with an array of 144 microcores, each capable of up to 400 MIPS, with internal registers and a small memory space - just use it for the microcode.).
When the chips go to a state of the art process, you'll be able to pick up another order of magnitude, because the rate will go from 400 to 4000 MHz.
Good luck and don't try to credit me with anything! (otherwise the company I work for will want to own the patent, and I don't want a patent to choke this idea to death.)
-- Anonynmous Coward.
Just because you mentioned ARM, perhaps you should look into Calxeda. I have no idea if their solution is well suited for your problem, it is a whole bunch of 32bit cores in one box. Someone else already has a similar arrangement using Intel Atom.
Nothing in the world is more dangerous than sincere ignorance and conscientious stupidity.
I have built 4 and 8 GPU systems. For 8 GPUs the TYAN FT72B7015 is currently the only solution that I know of. Here are some product offers with this board http://blog.renderstream.com/2010/11/renderstream-announces-12-tflop-systems/ The GeForce cards are fine but since they are not built for 24/7 on HPC use, most vendors will warn you about warranty issues if something breaks in such a system. But they are cheap, just put 2 additional cards on the shelf next to the system and replace if needed. They get extremely hot, so consider how to cool such a beast in advance. The cards are also considerably faster than the Tesla solutions. If you need raw performance, ECC will slow you down so see if you can do without.
You may be able to buy hardware more cheaply, but you're not going to beat Amazon on overall cost, once you take even minimal maintenance, power, server room space, etc. into account. You may be able to save money over EC2 by putting in your own labor, just realize that this can be a lot of work.
1. If you don't like it, don't click on the link to read it.
2. This is doing your homework, or at least partially. Think of it as a distributed/remote brainstorming session. Brainstorming is about throwing up a whole bunch of solutions, not evaluating any of them as they are suggested, no matter how silly they may seem initially.
3. The actual competition is a useful brainstorming contribution. A link or even a name that could be searched on would be rather more useful.
My favorite boxes currently are HP-G7 with 2x 640 GB Fusion-IO and some hard disk. The data is in partitioned tables (MySQL 5.5), and I run for example 20 SELECT (partition-optimized) queries in parallel on them.
This is much faster (for me) than for example a Hadoop cluster. Mainly because the data is already where I query it. For me the copying of big data sets is the main bottle neck.
My aggregations normally finish in 1-5 minutes. One box does 60 GB per minute, 4 boxes do 90 GB per minute, so no big win by adding boxes.
Those would total to about $10000 (USB networking?) or $14000 with ethernet :)
Also need to pay for: :D
- about 40 dumb switches or USB hubs (get a quantity rebate, left-over stock, fire sales or similar)
- some USB harddrives for the data (or send it somewhere on the internet). 2TB USB drives are about $120 where I live and things are usually more expensive here.
- keyboard (dump dive or splurge 30-40$)
- mouse (dump dive or splurge with keyboard)
- monitor (dump dive or splurge about 110$ --I just bought a 24" AOC lcd monitor with tons of connection options (vga, dvi-d/hdmi, USB) with Raspberry Pi in mind for about $110, it was on a special sale and the cheapest available no matter size anywhere where I live)
- perhaps one or two cheap shelves (clay bricks and boards should do) although I would be more inclined to just hang them all on string from the ceiling! XD
- maybe a few consumer fans? Could get messy with everything hanging down
- wiring (if you're using ethernet).
2-3 months to delivery and you would have to contact the R-Pi people and pay up front for a special batch.
This might all be impractical but it sure would look impressive and insane XD
http://www.flickr.com/photos/denix0/5702439532/
http://www.gumstix.com/press/Gumstix-strongbox-TI-Tech-Days-2011.pdf
Differences between GeForce and Tesla for compute are at:
http://www.nvidia.com/object/why-choose-tesla.html
Bottom line, GeForce is great for development, but if you want to build a cluster with GPUs,
you are better off using the commercial grade Tesla GPUs.
Sumit
NVIDIA Tesla Group
There is a new easier way to program GPUs now using Directives-based compilers.
Idea is that you add some high-level pragmas to your C or Fortran code that a parallelizing compiler
uses to map to the GPU accelerator. Of course, you have to expose parallelism in the code for
the compiler to do a decent job. Example, use more data-parallel data structures. But this is a nice
incremental way to take advantage of the GPU.
Check it out at:
http://www.nvidia.com/object/tesla-2x-4weeks-guaranteed.html?cid=dev
Sumit
NVIDIA - Tesla Group
What about Gumstix?
Where was a presentation of cluster of them on some conference on Youtube. Can`t help in findinging the URL.
https://www.gumstix.com/store/product_info.php?products_id=261
https://www.gumstix.com/store/product_info.php?products_id=247
Tyan looks like a Supermicro competitor, but their stuff is always under-engineered and falls apart. Supermicro FTW!
Social Credit would solve everything...
Given that it is an "embarrassingly parallel application" (its distributed computing not parallel computing) things like CUDA, OpenCL or MPI are excluded.
You, first of all, must make clear to yourself if you need threads of executions or cores for processing.
Price wise i would recommend multiple dual socket AMD servers: best price per core and if your analysis is scaling linear with cores, it is the best choice.
For administering such nodes i would recommend rocksclusters.org (underlying job scheduler can be torque, condor or sge)