Ask Slashdot: Parallel Cluster In a Box?

AMD by Anonymous Coward · 2011-12-03 05:35 · Score: 2, Informative

Why not use AMD and OpenCL?

Re:AMD by speckman · 2011-12-03 06:01 · Score: 1

Yeah. AMDs do more for cheaper, unless you really need the double precision. But you could get up to 10 boxes with 4 AMD cards a piece for that money. Although, are you saying that you need half a gig of RAM for a single thread? If so, GPUs are not the way to go.
Re:AMD by tempest69 · 2011-12-03 06:15 · Score: 4, Insightful

Because it's new, and finding someone who's done it to get some pointers is really hard.
CUDA has been around a while, figuring it out isn't such a rough learning curve.

Overall I'm a little suspicious of someone looking to use a GPU for more threads on a problem. As going the GPU route is a really committed step, and the programming gets a new level of complicated. Using multiple cards has some odd issues in CUDA, ie. If you exceed the card index it defaults to card-0, rather than crashing. There are more places to screw up with a GPU- transferring memory- getting blocks, threads, and weaves organized(if done properly it hides all sorts of latency in calculations, done poorly it's worse than a CPU)- avoiding memory contention (the memory scheme isn't bad, but it needs to be understood).

So in most cases I'd first start with this chart http://www.cpubenchmark.net/cpu_value_available.html and tell them to cut their teeth on a GPU with a smaller(cheaper) test case.
Re:AMD by GameboyRMH · 2011-12-03 07:19 · Score: 1

Because it's new, and finding someone who's done it to get some pointers is really hard.
CUDA has been around a while, figuring it out isn't such a rough learning curve.
On the downside, you're stuck with NVidia GPUs forever (or until they decide to drop CUDA, although I'll admit that's unlikely).

--
"When information is power, privacy is freedom" - Jah-Wren Ryel
Re:AMD by Anonymous Coward · 2011-12-03 07:51 · Score: 2, Insightful

Why not use AMD and OpenCL?
Sure use two AMD 6990 with 3072 stream units each, for a total of 6144 ALUs per box (DP FPU) with OpenCL 1.1.
Cost about $2500 per box! $700 per card plus $1000 for a CPU system with 1000W PSU.
Re:AMD by Anonymous Coward · 2011-12-03 10:20 · Score: 2, Insightful

That's why you would use OpenCL instead. It's a bit newer, and is still a little rough around the edges, but it works on CPU's and GPU's, and in windows or 'nix.
Re:AMD by Anonymous Coward · 2011-12-03 10:22 · Score: 0

It's not that new. Newer than CUDA, yes, but not too new to be an immature technology. As for people that have used OpenCL, there are plenty of projects out there that use it. Pyrit is a good example of someone using OpenCL.
I would suggest AMD as well. You will get more cores for your money. Not to mention, OpenCL is hoping to be an open and cross-platform standard as opposed to CUDA being a proprietary Nvidia technology.
Re:AMD by sneakyimp · 2011-12-03 11:01 · Score: 3, Interesting

I wonder if QuantumMist must take into account the cost of development. To say that the application is "embarassingly parallel" and at the same time that "memory requirements are decently low" suggests that s/he has an existing application that has been run on some box and perhaps belies a bit of ignorance about the nature of parallelism. Last time I checked, more threads required more memory. If the plan is to get the maximum number of threads possible, the amount of memory required could vary enormously. Additionally, the nature of the parallelism is not discussed. What does each thread do? If it's not something a GPU does then GPUs are not going to help. Also, will a GPU even fit in a 1u box that already contains a server? I doubt it.
In my very limited experience in writing multithreaded code, I have found that simply increasing the number of threads spawned doesn't necessarily equate to better performance. On the contrary, spawning too many can bring your application to a halt as an enormous number of threads vie for limited resources (network, disk, memory) and your application gets nothing done because it's too busy context switching between a huge number of resource-starved threads that do nothing while the threads that hold the resources never get scheduled to do valuable work.
I'd also like to point out that simply buying GPUs doesn't mean your application will suddenly spawn an ability to take advantage of even one GPU. The software development effort required to add GPU detection and utilization could easily chew up that $10-15k budget in no time.
If QuantumMist already has this application written and it's running but NOT GPU-enabled, then the best approach might be to just get the hottest multi-socket traditional CPU machine s/he can afford built on a dual LGA 1366 mobo or quad g34 mobo. Or, depending on the nature of this parallelism, it might be better to budget for some CUDA software development and a machine with a couple of GPUs.
Re:AMD by paulatz · 2011-12-03 11:26 · Score: 1

Setting up 10 boxes with 4 cpus each is just ridiculous, don't forget you'll have to maintain each of them forever. Actually for about the same price you can buy a IBM server with 4 sockets and 8 cores in each socket (total 32 cpus), with a couple of GB of RAM for cpu. Maybe even two of those server, if you can get a good deal and are satisfied with a bit less RAM and a slightly slower cpu.
Talking about mixed cpu/gpu, it is very fashionable at the moment, but it is expensive and requires specific programming: in the end a gpu is just a cpu, just optimized to perform better in some very specific cases (and much worse in others). I have the feeling it is just buzzword used to make you feel your hardware is not adequate, but only time will tell.

--
this post contain no useful information, no need to mod it down

Helmer is a cheap way to get there by Anonymous Coward · 2011-12-03 05:39 · Score: 0, Funny

Sorry for just the link, but you could one off something like this http://helmer.sfe.se/

Grey boxes by Anonymous Coward · 2011-12-03 05:40 · Score: 1

Spec out a quad-core AMD grey box with 4 gigs of ram (I saw 4 gigs of DDR3 RAM for $20 the other day). That shouldn't run you more than $400 a pop.

For 10K, you'll get 10,000/400*4=100 threads of execution.

Re:Grey boxes by tomhudson · 2011-12-03 06:22 · Score: 1

Look who's the idiot - the article says they aren't paying for the power.
Re:Grey boxes by tomhudson · 2011-12-03 09:02 · Score: 1

1. Lower failure rates mean less, not more, maintenance expenditures.
2. The more robust general (non-gpu-based) system can handle those failures better because the workload is distributed over a greater number of cores
3. The more robust system can also handle any future workload that doesn't translate easily into a gpu-based solution
4. Electricity was specifically not an issue - it was someone else's cost - which you failed to realize because you always post stupidity. Who's to say that maintenance isn't also someone else's cost? So don't make up an issue where there is none.
5. anyone looking at my profile will see what it stands for, you ignorant clod!
6. you're still smarting from the smackdown clone54321 gave you last year? You really are pathetic.
Re:Grey boxes by tomhudson · 2011-12-03 11:33 · Score: 1

If the failure rate is half, but the mtbf is much higher than the lifetime of the project, nobody is going to care. Most computer equipment has a half-life of 3 years or less - kind of like florescent tubes.
For a dim-bulb like you, I'll make it simple: Pretend I have a huge warehouse that I need illuminated for the next year. I can buy a few very expensive LED arrays (and still end up with shadows), or a ton of cheap florescents, and get complete coverage.
As per the article, I don't care about electrical costs. Maintenance? If a cheap bulb burns out, leave it be - there are a few thousand more still running (I said it was a HUGE warehouse, right?), nobody will notice. The LED array? One gone, I've got a huge blacked-out area, and I'm going to have to shell out $$$ for new equipment, plus down time. Some guy wacks one with a forklift - big bucks.
At the end of the year, most of the lightbulbs are still working. But now I've decided to partition the warehouse (the equivalent of changing the computing workload) ... and there's no way that the LED arrays can be positioned properly to illuminate every area, so I have to go and buy some more. The lightbulbs? No need for new ones, and any dimwit - maybe even you - can replace a fluorescent tube.
Re:Grey boxes by tomhudson · 2011-12-03 12:28 · Score: 1

Nowhere has it been stated that the person posting the question is responsible for ongoing maintenance. Quit making up "what if" scenarios - just like you wrongly claimed that it would cost them more electricity when the summary itself says that someone else is paying for the juice, so it's not a concern.
Re:Grey boxes by tomhudson · 2011-12-03 15:17 · Score: 1

> "what if scenarios do not require creation."
They most certainly do! Your creating what-if scenarios that have no basis in TFA was just as lame as your attempt to say it wasn't practical because of the higher cost of electrical consumption, when TFA made it clear that electricity use wasn't a consideration. Trying for stupid post of the year award?
You just don't like that a woman caught you on your original mistake (not noticing that the original article specifically said to ignore electrical consumption). Again.
Re:Grey boxes by tomhudson · 2011-12-03 17:32 · Score: 1

Obviously you missed another point in TFA - that the original poster was concerned about the higher rates of failure because of ... wait for it ... overheating due to running multiple GPUs. As other posters have pointed out, they can get a much lower thermal concentration using other solutions that permit better cooling.
So, not only did you miss the part where they said that electricity use wasn't an issue - you also missed the posters concerns about thermal load. Did you read ANYTHING except the headline?
Because if you did, you need to work on your reading skills.
Re:Grey boxes by tomhudson · 2011-12-03 17:36 · Score: 1

Probability of catastrophic failures is lower with enough discrete boxes (which, as other posters pointed out, works out to be cheaper as well as more flexible - they even give the price breakdown). Try to keep up. You're looking even stupider than you usually do.
Re:Grey boxes by tomhudson · 2011-12-04 09:01 · Score: 1

The only cost factor mentioned was acquisition. And look at how many different accounts you have to post under as each one gets mod-bombed to h*** because everyone knows you've been a jerk for years.
Re:Grey boxes by tomhudson · 2011-12-04 09:15 · Score: 1

Two points:
1. The total amount of heat generated does not directly correlate with the amount of cooling needed. It's the concentration in hot spots that matters.
2. Why would I, or anyone else "cower in your shadow"? This is the Internet - you don't HAVE a shadow. And again, why? Do you "get off" on cyber-stalking or something? Does it "embiggen" you to think you're scaring women with something other than your face? Did some woman dump you and now you're getting back at her vicariously? Stay tuned for the answers to these and other questions as we continue the Internet version of "As The Stomach Turns".
3. Anyone who reads my profile can easily find out what "tom" is shorthand for. It doesn't mean what you think it means - come to think of it, you do seem to have that problem with much of the English language. The alternative is that you read it and are being your usual annoying self. Either way, nobody cares. Except you. Oh wait - you're nobody, nobody cares, so YOU care. Got it!
Re:Grey boxes by tomhudson · 2011-12-04 09:18 · Score: 1

BTW - the 1st one doesn't count, since it was about TFA, and this stopped being about TFA long ago.

Nothing special by Anonymous Coward · 2011-12-03 05:40 · Score: 2, Informative

Just put bunch of GTX cards to nice, big server case with enough fans. You are hardly going to find any cheaper alternative.
When choosing cards, look for tests like this one:
http://www.behardware.com/articles/840-13/roundup-a-review-of-the-super-geforce-gtx-580s-from-asus-evga-gainward-gigabyte-msi-and-zotac.html
The IR thermal photos are great when choosing well cooled card.
Also use SW to control card fans to keep them running at 100% fan speed.
Noisy? Yes. But who cares, unless you plan putting it in your bedroom.
You can easily keep these cards at ~70C with full load.

Re:Nothing special by TWX · 2011-12-03 06:13 · Score: 4, Informative

It would have been nice if he'd given us more information about the form factor he needs to put this into. Since the client isn't paying the electric or cooling bill then I have to assume that it's colocated, so there might be some real rack unit restrictions that prevent this from adequately working well. It also would have been nice to know storage demands too, as there are tradeoffs in front-accessible drive arrays for cooling and airflow purposes. Most of the cases with tons of hot-swap drives in front lack good front ventilation. If he only needs a few drives then that opens him up to a simple 3U or 4U chassis with a mostly open-grille of a front to make airflow a lot less restrictive.

--
Do not look into laser with remaining eye.
Re:Nothing special by Ruie · 2011-12-03 08:01 · Score: 1

Just put bunch of GTX cards to nice, big server case with enough fans. You are hardly going to find any cheaper alternative.
That's actually pretty hard to do as you need a motherboard with lots of multiple-lane PCIe connections.
Re:Nothing special by SuricouRaven · 2011-12-03 08:12 · Score: 1

I recall it is possible to fit a 16x card in a 1x slot (Obviously at 1x performance), but this requires the card be hacked. Literally. With a hacksaw. All the power and essential control lanes are at the front, and if 15 of the 16 data lanes are not connected then the card will simply not use them.
Re:Nothing special by RulerOf · 2011-12-03 09:07 · Score: 1

It's a little easier to gouge out the back of the slot ;)

--
Boot Windows, Linux, and ESX over the network for free.
Re:Nothing special by ckaminski · 2011-12-03 09:40 · Score: 1

When building stuff like this you always put the "big" storage separately. Compute units do computing, and cache to SSD then store to the big pappy.
Re:Nothing special by SuricouRaven · 2011-12-03 10:02 · Score: 1

True, but then you'll probably run into other obstructions on the motherboard.
Re:Nothing special by Anonymous Coward · 2011-12-03 10:36 · Score: 0

Easier still to tap these guys:
http://cablesaurus.com/ ...for some 16>1 (or 4 or 8) PCIe slot converter cables.
Is it my imagination, or would the OP best be served by perusing the Mining sub-forum at bitcointalk.org ? Seems like he's reinventing the wheel if he doesn't...
Re:Nothing special by Ruie · 2011-12-03 11:34 · Score: 1

I recall it is possible to fit a 16x card in a 1x slot (Obviously at 1x performance), but this requires the card be hacked. Literally. With a hacksaw. All the power and essential control lanes are at the front, and if 15 of the 16 data lanes are not connected then the card will simply not use them.
Impractical. GPU cards have issues with bandwidth to the host anyway, cut it to 1x and you will be much better of with a plain multicore system.
Re:Nothing special by SuricouRaven · 2011-12-03 13:19 · Score: 1

Depends on the task. In games, yes, they have issues with bandwidth. But in GPGPU? Very task dependant. There are some functions, like cryptographic brute forcing, for which the transfer from host to GPU is negligable.
Re:Nothing special by Anonymous Coward · 2011-12-04 14:10 · Score: 0

If he is not paying power or cooling, I would assume a university.
Re:Nothing special by HappyPsycho · 2011-12-05 01:23 · Score: 1

If it is a decent installation, could liquid cooling be an answer?
From the looks / sounds of it we are looking at around a full rack of equipment.

SuperMicro MicroCloud w/ 8 NVidia GPUs? by Anonymous Coward · 2011-12-03 05:41 · Score: 2, Interesting

If the off-the-shelf GTX cards work, you'd have 8 * Xeon + 8 * NVidia GPU's in 3U, all entirely parallel (I.E. 8 separate machines) to avoid the main CPU's being any kind of bottleneck. Stock each node w/ 2GB of RAM on the cheap and some cheaper SATA drives, you'd likely end up under $10k for the whole thing and have an 8-node cluster you can use for other tasks later.

I've noticed that "embarrassingly parallel" tasks, if you take the low-hanging fruit too far, end up running into some other unforeseen bottleneck. Thus me suggesting something faux-bladeish instead.

Re:SuperMicro MicroCloud w/ 8 NVidia GPUs? by Lord_Naikon · 2011-12-03 06:00 · Score: 1

Good advice, because I've found out in practice that unless the problem set can be divided up in CPU core cache size blocks (i.e. 512k or less), memory bandwidth is going to be the major bottleneck. More machines = more memory bandwidth.
Re:SuperMicro MicroCloud w/ 8 NVidia GPUs? by fuzzyfuzzyfungus · 2011-12-03 06:42 · Score: 1

The one potentially tricky thing with that particular machine might be the graphics cards: PCIe x8, low profile, is not going to help your search for a high end GTX that will fit...

Unless he is heavily space constrained, he should probably take your advice on specs; but in 1 or 2U cases where getting a double-wide, full profile, PCIe x16 card installed will be easier.

PS3 by History's+Coming+To · 2011-12-03 05:41 · Score: 2, Interesting

PlayStation 3s have proved a cost efficient way of setting up large scale parallel processing systems. Of course you'll have to find your way around Sony's blocks on the OtherOS system, and you'll need to keep it off the internet or firewalled in some way, but you essentially get cheap processing subsidised by the games that you don't need to buy.

--
Please consider this account deleted, I just can't be bothered with the spam anymore.

Re:PS3 by Anonymous Coward · 2011-12-03 06:01 · Score: 2, Informative

I wouldn't give Sony a dollar of my business if they had the cure for cancer and I was a week away from death.
Re:PS3 by Anonymous Coward · 2011-12-03 06:07 · Score: 0

OS problems here. Newer PS3 won't allow direct install of linux.
Re:PS3 by Anonymous Coward · 2011-12-03 06:15 · Score: 5, Informative

PlayStation 3s have proved a cost efficient way of setting up large scale parallel processing systems. Of course you'll have to find your way around Sony's blocks on the OtherOS system, and you'll need to keep it off the internet or firewalled in some way, but you essentially get cheap processing subsidised by the games that you don't need to buy.
Back-of-the-envelope comparison of PS3 and GTX:
A cluster of three PS3s: 920 GFLOPS. Price: about $800.
A PC with 3 GTX 460 cards: 2200 GFLOPS. Price: about $800.
Each of those GTX cards also has significantly more memory than the PS3, and are cheaper to develop for.
Re:PS3 by History's+Coming+To · 2011-12-03 06:26 · Score: 1

Each of the GTX cards needs a motheboard, power supply etc, you get those thrown in with the consoles. Yup, there may be significant OS issues, and I've not done the sums on flops per dollar, so it may be a dumb idea...just throwing it into the mix.

--
Please consider this account deleted, I just can't be bothered with the spam anymore.
Re:PS3 by CronoCloud · 2011-12-03 06:28 · Score: 1

A cluster of three PS3s: 920 GFLOPS. Price: about $800.
Less than that, because they would have to buy older used PS3's, CECHA/CECHB/CECHE models and they'd need to have pre 3.21 firmware. Difficult but probably cheaper.

A PC with 3 GTX 460 cards: 2200 GFLOPS. Price: about $800.
Woudn't the 450's alone cost about 500, let alone a motherboard that can handle 3 of them, and a good power supply and cooling. I think that PC total is estimating a bit low.
Perhaps the poster could do both, because some calculations might work better on the PS3's and some on the 460's. Even in Folding@home, there's still calculations the PS3's do better than the GPU clients, because they're more versatile, as they say, taking the middle path between the CPU and GPU clients.
Re:PS3 by Anonymous Coward · 2011-12-03 06:37 · Score: 1

Each of the GTX cards needs a motheboard, power supply etc, you get those thrown in with the consoles.

No, not "each". 3 cards can fit in one PC. The ballpark price included the cost of the other parts.
Re:PS3 by bmsleight · 2011-12-03 06:52 · Score: 1

Don't you realise that Sony, would make a loss on this ?
Buying lots of subsidise PS3 and then NOT buying the games they are worse off.
Re:PS3 by Khyber · 2011-12-03 07:08 · Score: 1

Umm, the PS3 has a theoretical performance of 2TFLOPS, EACH.

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
Re:PS3 by Surt · 2011-12-03 08:04 · Score: 1

Cards:
http://www.newegg.com/Product/Product.aspx?Item=N82E16814162058&nm_mc=OTC-Froogle&cm_mmc=OTC-Froogle-_-Video+Cards-_-Galaxy-_-14162058
x3 = $360.
Motherboard:
http://www.newegg.com/Product/Product.aspx?Item=N82E16813128495
$114
Power supply:
http://www.newegg.com/Product/Product.aspx?Item=N82E16817152044
$144
CPU can be less than $50 if he really doesn't need the cpu to do much of anything.
So far I'm at $668. Probably have to buy a box to put it in for $50.
So now i'm at $718. What shall I buy with my $82?

--
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
Re:PS3 by Surt · 2011-12-03 08:17 · Score: 1

But actual performance is apparently drastically lower:
http://en.wikipedia.org/wiki/PlayStation_3_hardware
PlayStation 3's Cell CPU achieves a maximum of 230.4 GFLOPS in single precision floating point operations and 100 GFLOPS double precision.

--
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
Re:PS3 by Anonymous Coward · 2011-12-03 08:29 · Score: 0

RAM.
Re:PS3 by Surt · 2011-12-03 08:47 · Score: 1

Perfect, right on budget!

--
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking
Re:PS3 by Khyber · 2011-12-03 08:52 · Score: 1

It is only lower because of hypervisor restrictions.
Unfettered single-point leveraging the entire system (including the GPU) you can get around 1.3-1.5TFLOPS practical.
The issue, again, is the hypervisor.

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
Re:PS3 by QuantumRiff · 2011-12-03 10:11 · Score: 3, Informative

actually, you can run up to 16 PCIe slots in an external chassis for heavy processing:
http://www.dell.com/us/business/p/poweredge-c410x/pd

--

What are we going to do tonight Brain?
Re:PS3 by Gr8Apes · 2011-12-03 10:15 · Score: 1

4 years of losses and they're still around. Talk to me when they actually disappear.

--
The cesspool just got a check and balance.
Re:PS3 by uvajed_ekil · 2011-12-03 11:56 · Score: 1

You obviously didn't understand the previous comment, so I'll rephrase it: Sony loses money on PS3 hardware, and makes up the loss (or so they intend to) on game sales. They profit from games and accessories, lose money on hardware. I don't know where the actual numbers fall, and whether the PS3 business as a whole is profitable, but Sony would not be trying if they only sold the game consoles, which are effectively subsidized. Talk to me when you improve your reading comprehension.

--
This is a hacked account, for which the owner can not be held responsible.
Re:PS3 by johanatan · 2011-12-03 12:01 · Score: 1

Actually, that's probably not true this far into the lifetime of the current gen console.
Re:PS3 by Gr8Apes · 2011-12-03 12:37 · Score: 1

Yes, I'm well aware of the concept - we'll buy lots of their hardware to do something other than support them, thus hurting them because they lose $10/unit.
The problem is this activity still adds to their perceived marketshare and boosts their efforts and also reduces stock for items they're building anyways, and it also hurts their competition by reducing their revenue, demand, and perceived marketshare.
Buy someone else's hardware, and support them, rather than reducing Sony's potential losses. After all, if the units sit on the shelves in warehouses, they will have even bigger losses since they have to make 'x' anyways to support their factory(ies)

--
The cesspool just got a check and balance.
Re:PS3 by catmistake · 2011-12-03 15:27 · Score: 1

PlayStation 3s have proved a cost efficient way of setting up large scale parallel processing systems. Of course you'll have to find your way around Sony's blocks on the OtherOS system, and you'll need to keep it off the internet or firewalled in some way, but you essentially get cheap processing subsidised by the games that you don't need to buy.
It does have a conspicuously high price/performance ratio, but if you use it for a cluster, you won't be able to play any games. I'm pretty certain Sony locks PS3 clusters out of their gaming network, for reasons unknown to anyone but themselves.

--
The Admin and the Engineer
Re:PS3 by Anonymous Coward · 2011-12-03 16:09 · Score: 0

Sony has been making money on PS3's since before the slim came out. They make quite a bit on them now. But way to hold to 4 year old pricing.
Re:PS3 by Anonymous Coward · 2011-12-03 17:02 · Score: 0

Whoa there Out Of Touch Boy. Sony has been making money on PS3s since the Slim came out. Everything they are selling now makes a profit.
Re:PS3 by CronoCloud · 2011-12-04 02:26 · Score: 1

Hard drive, you might want an SSD for performance, and DVD drive.
Re:PS3 by c0nner · 2011-12-05 08:58 · Score: 1

That bad boy is just silly... the chassis is expensive, they only have a limited number of cards that they will support in it when purchased from them with the special sleds and they have a limited number of machines that they can be connected to. penguin has a more interesting offering with a Relion 4708a. They can stuff 8 GTX 580s in the case with a dual cpu and decent amount of memory for 15k or less. You can't even buy the tesla cards for that much. So unless you need the performance advantages of the tesla it is a pretty nice solution with a crazy amount of air being pushed through the case but it isn't something you would want to put under you desk.
I have both and the c410x (paired with the c6100) and the penguin solution and the individual boxes are much easier to deal with and were 1/4 the price though admitedly the dell solution we had to get tesla cards because that is what they support in the chassis.

here is what i did by Anonymous Coward · 2011-12-03 05:41 · Score: 0

https://sites.google.com/site/jimerickso/home/new-build

can you write GPU code? by zeldor · 2011-12-03 05:42 · Score: 5, Insightful

do you or them know how to program on a GPU?
if its really embarrassingly parallel EC2 spot instances and the gnu program 'parallel' will work quite nicely.
But if coding changes are required then the hardware is the least of your expenses.

--
If I could walk that way I wouldnt need cologne.

Re:can you write GPU code? by woodhouse · 2011-12-03 07:42 · Score: 1

Exactly. Unless the user has some experience in CUDA/Compute shaders/OpenCL, just shoving cards in there doesn't really solve the problem.
Re:can you write GPU code? by human+spam+filter · 2011-12-03 08:26 · Score: 1

Also, whether you will get a significant speedup by using the GPU really depends on the algorithm. Some algorithms may not even be possible to implement for the GPU (due to limitations of CUDA, OpenCL etc.).

I suggest by denshao2 · 2011-12-03 05:43 · Score: 0

Amazon EC2.

Re:I suggest by elsurexiste · 2011-12-03 05:53 · Score: 1

He specifically said it's too expensive. RTFQ :P

--
I rarely respond to comments. Also, don't ask for clarifications: a brain and Google are faster, believe me!
Re:I suggest by Anonymous Coward · 2011-12-03 05:54 · Score: 0

Guess someone didn't read the summary. EC2 was explicitly stated as being too expensive, and looking for a self-hosted instance.
Re:I suggest by Anonymous Coward · 2011-12-03 07:14 · Score: 0

Congrats, retard, you didn't read the question. EC2 is too expensive for hefty computational tasks. Cheaper to buy hardware that they can use as much as they want for free.

Die of heat? by TheSHAD0W · 2011-12-03 05:43 · Score: 2

> Should I just stuff four GTX cards in a server and replace them as they die from heat?

It'd be more cost-efficient to improve the air flow or add liquid cooling. Yay mineral oil baths.

AMD graphics cards by Anonymous Coward · 2011-12-03 05:48 · Score: 0

Radeon HD 5800-5900 series Supports FP64
Radeon HD 6900 series supports FP64

3k - 64cores + 54+GB of ram. by Anonymous Coward · 2011-12-03 05:48 · Score: 4, Interesting

You can easily build a 64core 1U system with opterons using the quad socket setup, or 128 core using the quad socket with extension setup, that will only run you about 5k. These are general 128 cores, 2ghz+, you don't have to change the program to run on these, you do not need to obfuscate things as you would programming and dealing with gpus... Or you can wait for knights corner, or get the Tile64s.

Re:3k - 64cores + 54+GB of ram. by Anonymous Coward · 2011-12-03 06:00 · Score: 0

Where the heck are you shopping?
Re:3k - 64cores + 54+GB of ram. by Anonymous Coward · 2011-12-03 06:13 · Score: 2, Informative

NewEgg. The 4 socket and extension boards are below 1k together. And the low-avg speed 16 core opterons are about 300-400, so 350*8 + 700 (board+extension) = 3.5k. The other 1.5k are power, 1333ghz ram, and the 1u container.
You can of course spend a lot more if you want the fastest opterons, but the return goes down quickly, the 2.2Ghz are fast, cheap, 16core cpus.
Re:3k - 64cores + 54+GB of ram. by dch24 · 2011-12-03 06:28 · Score: 4, Informative
Just took a look. They have 4 choices for a 16-core opteron listen:
- AMD Opteron 6262 HE Interlagos 1.6GHz Socket G34 85W 16-Core Server Processor OS6262VATGGGU - OEM $539.99
- AMD Opteron 6272 Interlagos 2.1GHz Socket G34 115W 16-Core Server Processor OS6272WKTGGGUWOF $539.99
- AMD Opteron 6274 Interlagos 2.2GHz Socket G34 115W 16-Core Server Processor OS6274WKTGGGUWOF $659.99 out of stock
- AMD Opteron 6274 Interlagos 2.2GHz Socket G34 115W 16-Core Server Processor OS6274WKTGGGU - OEM $659.99 out of stock
I'm going to keep looking, but I don't see any in the 300-400 range.
Re:3k - 64cores + 54+GB of ram. by Anonymous Coward · 2011-12-03 06:36 · Score: 0

all prices from newegg, i'm sure that you could get a better price somewhere else if you look...
tyan quad g34 board=$810
4x2gb ddr3 ram=$45
hard drive=$80
subtotal=$935
add proc...
4x8c 2.6ghz(32c)=1120...tot=$2055
4x12c 2.4ghz (48c)=1560...tot=$2495
4*16c 2.1ghz (64c)=2160...tot=$3095
so for 4k you could add more memory, add an ssd and a nice case and have 64 general purpose cores in a 1u box. buy three and you have a good general purpose mini-supercomputer with 192 cores.
Re:3k - 64cores + 54+GB of ram. by Jah-Wren+Ryel · 2011-12-03 09:53 · Score: 1

Just took a look. They have 4 choices for a 16-core opteron listen:
AMD Opteron 6262 HE Interlagos ...
It's worse than that. The submitter is talking about doing single-precision floating point. Interlagos only has 1 floating point unit for every 2 integer cores. So, for his purposes, it's only 8 cores per cpu.

--
When information is power, privacy is freedom.
Re:3k - 64cores + 54+GB of ram. by Anonymous Coward · 2011-12-03 13:13 · Score: 0

I make 64-core AMD systems using suprermicro 1U cases all the time. In fact we made a cluster with 24 of these in it recently. They are nice systems if the fans inside are not the old fans. The old fans caused so much vibration that it would affect the hard drive spin speeds and cause read/write errors. Other than that, they are "cheap". Just run the fans on "balanced" speed to ensure there is not a lot of vibration for the drives. "Balanced" speed seems to be sufficient to cool all 4 CPUs while we burn them in, altho you have to make sure your wiring is clean. Running the fans at either of the two faster speeds seems to be too much for the 1Us (altho for the 2Us it's not an issue).
Re:3k - 64cores + 54+GB of ram. by mcrbids · 2011-12-03 18:48 · Score: 1

It's easy to get an embarrassing amount of processing power if you go with white box equipment. I have 8 8-way 1-U servers with 32 GB of RAM serving a heavy, database driven app. The amount of stuff that gets done with that relatively small value-priced cluster is impressive.

--
I have no problem with your religion until you decide it's reason to deprive others of the truth.
Re:3k - 64cores + 54+GB of ram. by Hitokiri+Battousai · 2011-12-03 20:35 · Score: 1

Interlagos only has 1 floating point unit for every 2 integer cores.
Not really true. It has one 256 bit FP unit which can do AVX instructions, or it can be used as two 128 bit FP units.
There seems to be lots of confusion about the bulldozer architecture. Its real limitations are that it shares the L1 instruction cache, L2 cache, and decoder between (essentially) two cores, and that it has not-so-hot branch prediction (compared to Intel) combined with a longer pipeline resulting in lower IPC compared to Phenom. The FP performance has remained pretty much the same, except it can do FMAC and AVX now.
Re:3k - 64cores + 54+GB of ram. by jacquesm · 2011-12-04 03:45 · Score: 1

I'd love to see the full spec of those machines.

--
MP3 Search Engine

Need more information by pem · 2011-12-03 05:49 · Score: 3, Informative

If, for example, it's embarrassing parallel DSP operations, you might try some dedicated DSP engines, or even some Xilinx FPGAs.

Re:Need more information by Anonymous Coward · 2011-12-03 07:26 · Score: 0

If it's a good fit to a GPU's capability, GPUs provide more FLOPS/$ than DSPs or FPGAs, although both DSPs and FPGAs are much more flexible and are therefore applicable to a wider range of problems without performance-wrecking contortions.
The *other* reason to consider DSPs or FPGAs is energy efficiency: both give substantially better FLOPS/W than the GPUs. However the OP suggested that they're not paying for power, so this is unlikely to be relevant here.
Re:Need more information by gmarsh · 2011-12-03 07:54 · Score: 1

A GPU will spank a dedicated DSP chip at just about everything, even the highest end TI's and TigerSHARCs. Both DSPs and GPUs are designed to haul data out of memory and do vector multiplication on it, but the GPU has a heck of a lot more of both memory bandwidth and processing grunt.
A big FPGA card, or FPGA array system like a Copacobana, might be quicker assuming I/O limitations aren't a problem for the algorithm to be run. But FPGA hardware for HPC isn't really a commodity so it's awfully expensive - you're looking at $5K+ for a big Virtex on a PCIe card. Plus buying FPGA tools and IP blocks, and getting the VHDL/Verilog written, will eat up a budget really quick.
But if this is being done in an academic environment and there's no looming deadline for this project, the FPGA method might be something you can get a grant for and throw a computer engineering grad student or co-op student at.

I built a similar system recently by Anonymous Coward · 2011-12-03 05:50 · Score: 1

I built a cluster the other day with 8 i7-2600K processors.

CPU - Intel i7-2600k = $300
Motherboard - P8H67-M PRO/CSM = $110
Ram - 4x 4GB corsair = $100
2u case + 400w ps = $90

My total cost was under 5k for 8 nodes, and it runs very very fast. Although my application likes CPUs more then GPUs. I also use a total of 16U of space, but that is not much of a extra cost.

Requirements by Anonymous Coward · 2011-12-03 05:53 · Score: 1

You really haven't given any details about your requirements.

This is a parallel problem, but will it run well on a GPU? If its an inherently divergent task, then probably not (Correct me if this isn't the case for other cards, I only have CUDA experience). If you want good answers, you'll need to describe your problem in more detail than just being embarrassingly parallel.

HP Moonshot by Anonymous Coward · 2011-12-03 05:57 · Score: 0

The recently-announced HP Moonshot architecture seems to meet most of your operational requirements. http://www.hp.com/hpinfo/newsroom/press/2011/111101xa.html
I haven't seen any pricing, though.

Re:HP Moonshot by Anonymous Coward · 2011-12-03 06:00 · Score: 0

Moonshot is fucking awesome, but they're not available to purchase yet and they'll probably end up outside the posters price point by a large margin.

U of I by TheGreatOrangePeel · 2011-12-03 05:58 · Score: 4, Informative

Try getting in touch with the folks doing parallel processing research or the people with NCSA at U of I. I imagine one or both would have a few tips for you assuming they're open to doing that kind of collaboration.

http://parallel.illinois.edu/
http://www.ncsa.illinois.edu/

Re:U of I by sneakyimp · 2011-12-03 11:05 · Score: 1

MOD PARENT UP. Parallel processing is tricky stuff and performance depends on so many things -- not just the cost of a bunch of GPUs.

Blade servers by Anonymous Coward · 2011-12-03 06:00 · Score: 0

Buying a blade server on ebay would also be a great option. For around 5-6k you could get a nice blade with 10 nodes.

10 dual cpu nodes = 80 cores, if you have hyperthreading you can run 160 threads.

AMD by Anonymous Coward · 2011-12-03 06:02 · Score: 0

AMD cards are worth a look. Especially for embarrassingly parallel stuff they often deliver higher performance (see eg bitcoin) .

Do you need it in a box? by Tom+Goodale · 2011-12-03 06:03 · Score: 2

If it's really embarrassingly parallel, just run it on whatever CPUs you have hanging about or can scrounge cheaply. As long as the application is written portably they don't even need to be the same architecture or operating system, although that would help with deployment. The only reason to try to scrunch everything in one box would be if you have space limitations.

many AMD CPUs unless the GPU port is done already by Anonymous Coward · 2011-12-03 06:05 · Score: 2, Interesting

You can get 48 real AMD Magny-Cours CPU cores with full DP floating point support and ~64GB ECC memory in a box for under 10K(EUR!) from e.g. Tyan and supermicro.
I run my embarassingly parallel stuff on that, and it works great. Depending on your application 64 Bulldozer cores which come in the same package for only slightly more money may perform better or not. I have not seen many realworld applications in which one GPU is actually faster than 12 to 16 server-class CPU cores.
Of course this depends a lot on wether you have done the GPU porting already or are just planning to, which you unfortunately don't state in your post

Raspberry Pi by Anonymous Coward · 2011-12-03 06:06 · Score: 1

Just make a cluster of these little guys.
HDMI output, USB input.
Encode data in to HDMI frames.
Have a decent board for decoding and to perform instructions from the HDMI data, then send more data back through USB.
You could probably even use the audio ports for even more throughput.
And if you get the ethernet version, that too.

I'm not even joking.
Well, partially. Might not be worthy of this case.
Plus, not out or even final.

But it sounds like an interesting idea anyway, so might as well throw it out there since it is pretty related.

Re:Raspberry Pi by Anonymous Coward · 2011-12-03 07:59 · Score: 0

The HDMI thing seems a wierd way to accomplish output... I don't know of any cheap massively-multiport HDMI capture board, anyway. Is there a reason you suppose this to be cheaper than using USB for both input and output, and adding however many % more Pis it takes to make up the difference?
Re:Raspberry Pi by Anonymous Coward · 2011-12-03 09:16 · Score: 0

As cool as the Pi is, I don't think it's a very good match for this kind of requirement. Its processor is slow (even by the standards of ARM processors, which don't scale to the high speeds of desktop processors), and its GPU has no public documentation so cannot be used for anything except OpenGL ES calls (which is to say, it is very hard to use for GPGPU operations).

Definitely GPU. by pla · 2011-12-03 06:06 · Score: 4, Interesting

Others have pointed it out, but if you can run this on a GPU, you don't need to look any further than that.

Specifically, check out some of the BitCoin mining rigs people have built, like 4x Radeon 6990s in a single box. For comparison, a single 6990 easily beats a top-of-the-line modern CPU by a factor of 50 (as in, not 50%, but 5000%). You can build such a box for well under $5k.

MPI+AMD vs GPU by Anonymous Coward · 2011-12-03 06:10 · Score: 1

If you're not a GPU programmer the alternative is a 48-core AMD server (64-core systems are notoriously slow and have half the floating point units) with MPI. This is the solution that many academics are taking.

Also if you're lucky you might be able to get your hands on Intel's 100-core Atom processor, they're not for sale AFAIK but I believe you can apply to get one for free.

So? by M0j0_j0j0 · 2011-12-03 06:11 · Score: 1

You are mining bitcoins too?

commodity HPC depends on your code by Haven · 2011-12-03 06:11 · Score: 5, Informative

In HPC we call it "pleasantly parallel," nothing is embarrassing about it! =]

If your code:
-scales to OpenCL/CUDA easily.
-does not require high concurrent memory transfers
-is fault tolerant (ie a failed card doesn't hose a whole day/week of runs)
-can use single precision flops

Then you can use commodity hardware like the gtx series cards. I'd go with the gtx 560ti (GF114 gpu).

Make nodes with:
quad core processors (amd or intel)
whatever ram is needed (8GB minimum)
2 x gtx560ti (448) run in SLI (or the 560ti dual from EVGA)

Basically a scaled down Cray XK6 node. http://www.cray.com/Assets/PDF/products/xk/CrayXK6Brochure.pdf

It all depends on your code.

rent a botnet by Lazy+Jones · 2011-12-03 06:15 · Score: 3, Funny

earlier thread ...

--
"I love my job, but I hate talking to people like you" (Freddie Mercury)

Re:rent a botnet by Rockoon · 2011-12-03 09:13 · Score: 1

..or create one.

Throw up a web site, advertise it on over-clocker forums and what-not, and hold a competition..

A race with $15000 in prize money. The runners are scored on how many "work units" they complete. Work units are distributed randomly and multiple people receive the same units so there is result verification. 1st place gets $5000, 2nd place gets $4000, 3rd place gets $3000, 4th place gets $2000, and 5th place gets $1000.

--
"His name was James Damore."

How does it parallelize? by darkjedi521 · 2011-12-03 06:15 · Score: 4, Informative

How does the app parallelize? Is each process/thread dependent on every other process/thread or is it a 1000 processes flying in close formation that all need to complete at the same time but don't interact with each other? How embarrassingly parallel is embarrassingly parallel? Is that 512MB requirement per process or the sum of all processes?

GPUs might not be the right solution for this. GPUs are excellent for parallelizing some operations but not others. Have you done any benchmarks? Throwing lots of CPU at the problem may be the right solution depending on the algorithms used and how well they can be adapted for a GPU, if they can be adapted for a GPU.

For the $10K-$15K USD range, I'd look at Supermicro's offerings. You have options ranging from dual socket 16 core AMD systems with 2 Teslas to quad socket AMD systems to quad socket Intel solutions to dual socket Intel systems with 4 Tesla cards.

Do some testing of your code in various configurations before blindly throwing hardware at the problem. I support researchers who run molecular dynamics simulations. I've put together some GPU systems and after testing, it was discovered that for the calculations they are doing, the portions that could be offloaded to their code only accounted for at most 10% of the execution time, with the remainder being operations that the software packages could only do on CPU.

Re:How does it parallelize? by Jimbookis · 2011-12-03 11:01 · Score: 2

Right, do the computations actually need floating point at all or can you do fixed point maths (hence just use the integer units in the CPU) instead? Plenty of DSP oriented stuff certainly doesn't need floats. If you have integer/fixed point maths only then an AMD CPU might be a ripper for the money.
Re:How does it parallelize? by sneakyimp · 2011-12-03 11:08 · Score: 1

Finally someone talking sense. You go darkjedi.
Re:How does it parallelize? by Anonymous Coward · 2011-12-03 17:26 · Score: 0

I was going to point in the same direction of Supermicro.
Not a clue about the prize though.

Passive cooled GPU by GoRK · 2011-12-03 06:17 · Score: 1

Don't use high end GTX cards; twice as many lower end passively-cooled GPU cards will provide more than the equivalent performance with far less cost and failure rate. If your application really benefits more from additional threads vs single thread execution speed, this is the way to go. Most GPGPU clusters that aren't built using Tegra use this approach.

need a better characterization of the workload by Surt · 2011-12-03 06:17 · Score: 1

big FP bandwidth on a tesla doesn't do much for you if you only need integer execution. Maybe you'd be better off with a 4-cpu xeon box, or a bulldozer, or a 64-core arm. Really, you want to find a way to benchmark your particular software on a variety of potential cpu targets, and then do a price comparison.

--
"Who is the Journal of Quantum Physics going to believe?" --Stephen Hawking

Beowulf cluster! by TWX · 2011-12-03 06:17 · Score: 5, Funny

Why not a beowulf clust---

I'm sorry, I just can't. I searched the ~35 posts, browsing at -1, and no reference to a Beowulf cluster anywhere, let alone Natalie Portman or Grits.

Slashdot! You're slipping! I lament the days when even our trolls were amusing and somewhat topical to the discussion at hand! We've fallen so far!

--
Do not look into laser with remaining eye.

Re:Beowulf cluster! by mgblst · 2011-12-03 21:24 · Score: 1

Beowolf cluster? Is that some new fangled grid computing system?
So yeah, the guy in the 665546 id number tells us all about the old days. Come on!
Re:Beowulf cluster! by Anonymous Coward · 2011-12-03 22:24 · Score: 0

I for one welcome our new Beowulf Clust overlords?
Re:Beowulf cluster! by nyquist · 2011-12-04 02:43 · Score: 1

Beowolf cluster? Is that some new fangled grid computing system?
So yeah, the guy in the 665546 id number tells us all about the old days. Come on!
indeed
Re:Beowulf cluster! by TWX · 2011-12-04 05:41 · Score: 1

I wanted a better handle.

--
Do not look into laser with remaining eye.

But what next? by Anonymous Coward · 2011-12-03 06:23 · Score: 0

This makes sense as i have a two year old that uses the devices. they are fairly easy to use and in the long run may be cheaper. however, with new updates to the program coming out yearly, most of these devices will be outdated very quickly. So after that, then what?

Orlando Web Design By Elijah Clark

Re:But what next? by sneakyimp · 2011-12-03 11:09 · Score: 1

I just noticed there's no "spam" option for modding posts. /. should add that.

Beowulf clusters by G3ckoG33k · 2011-12-03 06:25 · Score: 3, Informative

Yes, I haven't seen any references here or anywhere else either lately.

From http://en.wikipedia.org/wiki/Beowulf_cluster: "The name Beowulf originally referred to a specific computer built in 1994 by Thomas Sterling and Donald Becker at NASA. [...] There is no particular piece of software that defines a cluster as a Beowulf. Beowulf clusters normally run a Unix-like operating system, such as BSD, Linux, or Solaris, normally built from free and open source software. Commonly used parallel processing libraries include Message Passing Interface (MPI) and Parallel Virtual Machine (PVM). Both of these permit the programmer to divide a task among a group of networked computers, and collect the results of processing. Examples of MPI software include OpenMPI or MPICH. There are additional MPI implementations available. Beowulf systems are now deployed worldwide, chiefly in support of scientific computing."

Apparently, Beowuld clusters may be around, it is just that they don't go by that name any longer. I wonder what would be the latest buzzword for essentially the same thing?

Re:Beowulf clusters by Anonymous Coward · 2011-12-03 06:34 · Score: 0

Somewhere between MapReduce (most literal) and The Cloud (most useless).
Re:Beowulf clusters by westyvw · 2011-12-03 06:41 · Score: 1

Do they just call it nothing now days it is just expeced to be some variant, or that it is so mainstream?

what kind of embarassingly parallel? by Anonymous Coward · 2011-12-03 06:26 · Score: 0

normally, the phrase means "lots of serial jobs", which have an input configuration and a result, and nothing in between (particularly no inter-job sharing). gp-gpu is suitable for a somewhat different sort of workload, basically single-instruction-multiple-threads. in short, are the threads working in lockstep?

Why, mini-cluster, of course! by Noryungi · 2011-12-03 06:34 · Score: 2

http://www.mini-itx.com/projects/cluster/?p

The example at the URL above is quite old, but a good starting point. Just use a dozen cheap mini-itx cards with -- let's say -- Intel Core i5 and voilà! Probably the cheapest way to go, and, also much easier to program than using CUDA and nVidia. Hook the whole thing in a gigabit switch

I'll let the experts debate the best CPU for that job, but AMD should also have some nice products on offer.

--
The right to offend is far more important than the right not to be offended. (Rowan Atkinson)

200hr by Anonymous Coward · 2011-12-03 06:39 · Score: 0

Hi Guys,

I get paid 200/hr by the government to come up with an architecture for parallel processing, Rather than taking time reading through droll literature, I need to go traveling to my second house in the Cayman's. I wondered if I could ask slashdot and save myself the trouble.

Ttyl
Tax Evader

We built a ~9.1 TFLOPS system for $10k last year. by Arakageeta · 2011-12-03 06:41 · Score: 4, Interesting

What does SLI give you in CUDA? The newer GeForce cards support direct GPU-to-GPU memory copies, assuming they are on the same PCIe bus (NUMA systems might have multiple PCIe buses).

My research group built this 12-core/8-GPU system last year for about $10k: http://tinyurl.com/7ecqjfj

The system has a theoretical peak ~9.1 TFLOPS, single precision (simultaneously maxing out all CPUs and GPUs). I wish the GPUs had more individual memory (~1.25GB each), but we would have quickly broken our budget had we gone for Tesla-grade cards.

Yes you would by Anonymous Coward · 2011-12-03 06:44 · Score: 0

if it could save you

Don't buy GTX's by MetricT · 2011-12-03 06:56 · Score: 4, Informative

We have several racks full, purchased because "they're cheaper than Tesla's".

Except the Tesla's have, as pointed out, ECC memory and better thermal management, and the GTX's have several useful features (like the GPU load level in nvidia-smi) disabled.

The former cause the compute nodes to crash regularly. What you save on cards, you'll lose in salary for someone to nursemaid them. The latter makes it harder to integrate into a scheduler environment (we're using Torque).

Yes, this is primarily marketing discrimination, and there probably isn't $10 worth of real difference between the two. I hope the marketing droid who thought that scheme up burns. It's a total aggravation, but paying for Teslas is worthwhile.

Re:Don't buy GTX's by Khyber · 2011-12-03 07:12 · Score: 2

Plenty of hacks to enable GPU load level. Probably several already out there as-is. The ECC memory is a different beast, though.

--
Still waiting on Serviscope_minor to wake up to fucking reality and realize that Jessica Price isn't going to fuck him.
Re:Don't buy GTX's by Anonymous Coward · 2011-12-03 14:21 · Score: 0

Really a whole other person to nursemaid them?
Surely some power strips with some monitoring software to determine if nodes drop off line and re power them. Seems like about half a days work for a slow person. Is that just with nvidia-smi? or is it actually locked out of the hardware and completely inaccessible from the api.
I know some of the forks for grid (sge) are working with gpu now, and I hadn't heard talk of that being a problem.

Re:PS3 -- sure, if you like your CPUs from 2006. by Arakageeta · 2011-12-03 07:03 · Score: 1

I think the time of the PS3 clusters has past. The Cell processor was released back in 2006! IBM released a few upgraded processors, mostly improving double-precision performance, but those systems are really cost prohibitive.

Assuming you can deal with PCIe latency, GPUs are the way to go.

Embarrassingly parallel problems... by MickLinux · 2011-12-03 07:13 · Score: 1

... do not require embarrassingly parallel solutions.

They require math and algorithm design to make the solution *nonembarrassing*.

Give you an example: a typical FFT can, with easy math, cut it number of calculations by four. With a little care, you can halve the number of calculations again.

Start with the math. Then look at the solution.

Last of all, consider cloudware. It's out there. Let's see... on my android, I have "sourceLair". Yeah, that's one.

Once you have the cloudware solution in hand, *then* you can start thinking about spending money on a kindof parallel solution (such as what Google uses).

--
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's

Re:Embarrassingly parallel problems... by ceoyoyo · 2011-12-03 07:46 · Score: 1

Ah, generalizations. Of course, you have no idea what he's working on.
Re:Embarrassingly parallel problems... by MickLinux · 2011-12-03 08:44 · Score: 2

Yes I do. He's extending the calculations begun by Lewis Carroll in the imaginary space (through the looking glass), to see the effects as the ultimate limit increases.
What's
1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+
1+1+1+1+1+1+1+1+1+1+1+1
As I said, embarrassingly parallel. Get 7 computers working on it in parallel, with 1 for backup:
What's 1+1+1+1+1+1 (after some calculation, 6)
So that all is 42.
the ultimate answer is
1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+ 1+1+1+1+1+1+
1+1+1+1+1+1+1+1+1+1+1+1=42.
I should note that this mathematical calculation was also attempted by Douglas Adams, using genetic algorithms.

--
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's
Re:Embarrassingly parallel problems... by Anonymous Coward · 2011-12-03 09:16 · Score: 0

You're an idiot.
Re:Embarrassingly parallel problems... by ceoyoyo · 2011-12-03 14:37 · Score: 1

And your Fourier transform algorithm to solve it faster is?
Re:Embarrassingly parallel problems... by Anonymous Coward · 2011-12-03 16:10 · Score: 0

If I had mod points, I'd mod this funny. I wish I could mod it funny twice; that "genetic algorithms" joke is worth +1 all by itself.

memory bus bottlenecks: 1 machine? by bbulkow · 2011-12-03 07:19 · Score: 1

Even though the application is parallel, your bottleneck can easily be the memory bus. Adding tesla cores won't solve memory bus issues. For a number of apps, Intel i5 quad cores stacked up increase memory bandwidth on the cheap. 10 500$ machines, or 5 1000$ machines with a cheap NVidia GPU, may very will outperform anything that can be put in a single "box" - because there is 10x or 5x more memory bandwidth. That means you need software to write not just parallel code, but multimachine parallel code - in which case you should get in bed with a computation fabric like Hadoop or one of a million others (raw OpenMP is another example, if you're a GPU hacker type).

Re:memory bus bottlenecks: 1 machine? by bbulkow · 2011-12-03 07:23 · Score: 1

Replying to self: our Citrusleaf database does amazing parallel operations on Sandy Bridge i5 (2400) machines. Single socket machines have the best interrupt processing and lowest memory latency. Going to Xeon architectures is, price performance, a HUGE decrease. There was a great post somewhere about $/speed in CPUs, and of course the true consumer grade stuff (i5 and Phenom II) were 10x better than "datacenter" grade machines. This is especially true for Supermicro. As much as I like them, you can save 4x money by going Asus and using a physically larger box - if you're not going into a data center. Another cost savings is running the project at home - you'll get more bandwidth for $50/month then you'll ever get from a data center.

Immerse it by hduff · 2011-12-03 07:38 · Score: 1

Go old school and immerse the entire machine in a tub of mineral oil?

--
"I believe in Karma. That means I can do bad things to people all day long and I assume they deserve it." : Dogbert

Re:Immerse it by catmistake · 2011-12-03 15:55 · Score: 1

Go old school and immerse the entire machine in a tub of mineral oil?
The best stuff to use is synthetic plasma (as in blood plasma). Its rather expensive though. [citation needed]

--
The Admin and the Engineer

How about by koan · 2011-12-03 07:45 · Score: 1

Multiple GTX card servers in a cluster? So 4 GTX GPU's to a box, and several boxes to the cluster.

--
"If any question why we died, Tell them because our fathers lied."

Try looking at the cheap end... by JoeMerchant · 2011-12-03 07:46 · Score: 2

I've played this parallel cost analysis game several times, and if you don't need high bandwidth communication between the threads, I usually come up with the Google solution: a big farm of cheap machines. AMD chips start looking good compared to Intel because you're not after a single thread finishing as fast as possible, you're after as many FLOPS per $ as you can get. We even did the analysis for an extreme Apple fanboi: MacPros vs MacMinis back in 2007, and a stack of 25 minis came out way more powerful than the 3 or 4 Pros you could get for the same money.

Re:Try looking at the cheap end... by chiph · 2011-12-03 11:40 · Score: 1

Mac Mini Server gets you a quad-core Intel i7 (double that number of threads if you enable hyper threading) for $999. Turn them on their side and you can stack 11 of them in the width of a standard 19" rack (will be 6U high or so). That's 44 (or perhaps 88) cores for under $11,000.
Other pluses: 900W power consumption when running at 100% utilization, idle is much much lower. Comes with dual hard drives that can be mirrored for reliability. Gigabit ethernet and 4 USB ports are available. When your work with them is done, you can repurpose them as ordinary desktop machines - just add monitor, keyboard + mouse.
Re:Try looking at the cheap end... by jarlsberg71 · 2011-12-05 04:46 · Score: 1

I wonder if you could use the on-board thunderbolt to get good communication between nodes?

--
E8B8B

pick up a Sun T5140 on ebay by Anonymous Coward · 2011-12-03 08:00 · Score: 1

Two 8-core processors with 8 threads per core == 128 simultaneous threads.

You could get a new Sun T3-1 for a little more. It would be roughly the same performance (it only has one physical processor, but it's 16 cores * 8 threads per core, so still 128 total).

Re:pick up a Sun T5140 on ebay by OrangeTide · 2011-12-03 09:48 · Score: 1

$19k doesn't sound like much of a bargain.

--
“Common sense is not so common.” — Voltaire

100 cores per chip on Tilera by Anonymous Coward · 2011-12-03 08:04 · Score: 0

Tilera has 100 core chips. If you don't need floating point (you never said what you were doing) they're a great choice.

How to setup a Beowulf cluster by cultiv8 · 2011-12-03 08:06 · Score: 1

Very informative, kinda technical: http://www.tweak3d.net/articles/howtolanparty/

--
sysadmins and parents of newborns get the same amount of sleep.

No need for the high-end, little need for doubles by Anonymous Coward · 2011-12-03 08:11 · Score: 0

Most (but not all) double-precision work can be handled by single precision pairs. Basically you keep numbers in the form a+b, where a is "big" and b is "little" and handle them accordingly. The slowdown is less than you'd think, and often gives better performance than the kind of hardware dp that gpus offer. There's a bunch of libraries and papers out there if you google them.

ECC is nice, but can be avoided in a number of applications by simply doing regular checkpointing and restarting failed computations. Again, this only works for certain applications, but can save a whole heap of $ for the ones that can.

If your target workload fits in both of these groups, you can assemble a cluster using high-end AMD gamer cards that will thrash any Tesla-based solution on performance/$ by a huge amount.

Speaking anonymously because of my employer, but I have built such systems for these kind of applications.

Re:PS3 -- sure, if you like your CPUs from 2006. by SuricouRaven · 2011-12-03 08:14 · Score: 1

The Cell, at the time of release, was mind-blowingly fast. Fastest chip around. But it didn't advance very far, and more conventional processors have now overtaken it.

Distributed.net by Anonymous Coward · 2011-12-03 08:35 · Score: 0

If you imitate what distributed.net accomplished (and folding@home or others are currently accomplishing) just make a creative website detailing what you're doing and why people should give you their unused GPU/CPU cycles.

Otherwise 4x GTX 580 (as mentioned already) will destroy what you throw at it.

I think the fact that you mentioned Tesla and GTX in this article covers the unsaid statement of "we're using CUDA".

Or browse top500.org for a rental shopping list.

Do your homework before going GPU by PatDev · 2011-12-03 08:36 · Score: 3, Informative

As someone who has done some GPU programming (specifically CUDA) be aware that there is more to the GPU parallelism model than just "lots of threads". Many embarrassingly parallel problems translate very poorly to CUDA. The primary things to consider is that:

1. GPUs are *data parallel*. This means that you need to have an algorithm in which each and every thread will be executing the same instruction at the same time (just on different data). For a cheap way to evaluate it, if you can't speed up your program by vectorizing it then the GPU won't help. Of course, you can have divergent "threads" on GPUs, but as soon as you do you've lost all benefit to using a GPU, and have essentially turned your GPU into an expensive but slow computer.

2. Moving data onto or off of the GPU is *slow*. So if you can leave all the data on the GPUs and none of the GPUs need to communicate with each other, then this will work well. If the threads need to frequently globally sync up, you're going to be in trouble.

That said, if you have the right kind of data parallel problem, GPUs will blow everything else out of the water at the same price point.

Re:Do your homework before going GPU by aminorex · 2011-12-04 07:16 · Score: 1

My experience (which is vast) has been that every problem large enough to make it worthwhile to parallelize can be made, by hook or by crook, to be the right kind of data parallel problem. The critical question is whether the cost of making it so is excessive. That cost has two forms: Expert man-hours and time-to-delivery. Given a sufficient budget, I've never seen a problem I could not solve. That may simply mean that the number of available problems is large enough so that I never ran out of solvable ones, and past performance is no guarantee of future performance, but it is the best available factual indicator.

--
-I like my women like I like my tea: green-

Arm cluster by Anonymous Coward · 2011-12-03 08:42 · Score: 0

www.gumstix.com/store/product_info.php?products_id=247

Overo omap3s rock

Re:PS3 -- sure, if you like your CPUs from 2006. by Rockoon · 2011-12-03 08:49 · Score: 1

They just claimed that it was mind-blowingly fast.

In theory there is no difference between theory and practice, but in practice there is.

--
"His name was James Damore."

Amazon WILL be cheaper by Anonymous Coward · 2011-12-03 09:17 · Score: 0

If this is a one time thing, I assure you that ec2 is more cost effective than any hardware solution you can invent. If its ongoing, it's still probably cheaper. Teslas work really well, but they're a bitch to code for, and it doesn't fix your io problems. Pair it with a fusion/io perhaps?

It really depends on why you need many threads by Anonymous Coward · 2011-12-03 09:40 · Score: 0

Where you are essentially solving multiple, simple, equations over and over, we use 6 GPUs in a case. To a large extent these equations can be calculated separately (well the boundaries are well defined). Vector math, and a lot of it, runs extermely parallel on on GPUs

If your building a glorified web server, keep in mind that you will still be I/O bound. And just because your on a gigabit lan doesn't mean you will be pumping data out that fast.

If you are building a database engine, again, I/O limited. Even with the fastest of drives, you sill have to manage the drives and with databases there are all kinds of rules it has to enforce etc.

If you are trying to build a multi-user system, you really want to separate the memory buses per CPU to the extent possible I'm not sure were Opterons and Xeons technology has come but I'd be looking at Tyan motherboards and see what they support.

Re:PS3 -- sure, if you like your CPUs from 2006. by Anonymous Coward · 2011-12-03 09:43 · Score: 0

Only a few problems with the cell: you needed IBM's development libraries (only a dozen or so anal probes to get that), the 'production cell processors' had 6 'cell engines', while PS3's has 5 (basically, errors and throwbacks from the fab), also, they used one cell processor to lock down the PS3 so that 1) you couldn't use the GPU for games or anything IBM or Sony didn't like). Oh, and the basic speed of the chip was like a typical Power processor, which couldn't deal very well with out-of-order executions (read slow). As an example: I has a 1.8 GHz pentium4, and it was about 7 times as fast as this processor without the 'extra cell' engines running. You would need IBM's libraries, to get anything out of the Cell engines. It was also important to re-structure your code to deal with out-of-order executions. Best of luck with that.

TI DSP cards? by Mr+Z · 2011-12-03 09:45 · Score: 2

There's some high-powerd PCI cards filled with TI DSPs that you can get. Here's an article describing some of them. In terms of power efficiency per unit of work, the DSPs blow the doors off the main processor and the GPUs. Each DSP on the chip can do 16 single precision or 4 double precision floating point operations per cycle, at around 1GHz, and they're programmable in C/C++.

Relevant quote:

Kenneth Nesteroff, business development manager for multicore processors at IT's DSP Systems unit, tells El Reg that in the first quarter, Advantech will come out with a full-length PCI-Express card that will deliver around 1 teraflops of single precision performance at a cost of around $2,000 and within a 110 watt thermal envelope.

Buy 5 of these and you're only at 550W, $10,000 and 5 TFLOPs.

--
Program Intellivision!

Re:TI DSP cards? by Mr+Z · 2011-12-03 09:58 · Score: 1

I should add also that depending on the nature of your task, it may perform closer or further from the "peak" performance on the DSP vs. on a GPU. So a single Tesla 1 TFLOP card may not perform the same as a single DSP 1 TFLOP card.

--
Program Intellivision!
Re:TI DSP cards? by Anonymous Coward · 2011-12-03 18:53 · Score: 0

He is talking of threads of execution and this might not mean what he thinks. It is possible he doesn't know what he wants because he has not spoken in any detail of the operations that are important to him such as floating point, integer, logical, etc.
The questioner is confused about different levels of parallelism (GPU versus CPU). Both can be parallel (GPU currently more than CPU) but do not help in acceleration of the same applications. For example, GPU will do little to accelerate a traditional web server or relational database where as adding more cores to a CPU will.
There is no definite valid answer until he gives more details about his algorithm and its needs. From his description it sounds as though he does not need any GPU acceleration but something closer to a traditional server. As it stands there is a good chance he'll end up wasting money on a GPU that he doesn't need.
Re:TI DSP cards? by Mr+Z · 2011-12-03 19:30 · Score: 1

Well, like I said in my followup "reply to self", it really does depend on the nature of the task. We don't have enough information to go on. The TI DSP cards do fill an interesting niche, though, and are a nice counterpoint to the Tesla cards in many applications.
Really, you need to just get some demo tools for a couple platforms, do some benchmarks, and see how each platform feels. You'd be silly to drop $10,000 - $15,000 on a server without first running some benchmarks on a smaller version of what you intend to buy, as well as collecting some data on how the results on the small system scale to the proposed larger system.
Proposing a solution and working backwards only truly works for constructing those initial benchmarks.

--
Program Intellivision!

GTX really is less reliable by SoftwareArtist · 2011-12-03 09:48 · Score: 1

We've done a lot of testing of different GPUs to look at basic reliability: things like writing data to memory, waiting a while, and reading it back to see if any bits have spontaneously flipped. The conclusion is that on GTX boards, this really does happen. If you're doing production work where consistently getting the right result matters, you should stay away from them. On the other hand, we've never seen any memory errors on Tesla boards, even with ECC disabled. This might just be because Nvidia tends to clock their Teslas a little lower. Or maybe they test chips as they come off the assembly line, and ones with marginal results get sold as GTXs. But one way or another, there really is a difference.

--
"I'm too busy to research this and form an educated opinion, but I do have time to tell everyone my uninformed opinion."

Yet another "slashdot do my homework" question by Anonymous Coward · 2011-12-03 09:55 · Score: 0

There just was an actual competition, where various student teams worked out how to build mini-supers for this sort of workload in a restricted power envelope. Go do your homework and look at what they did, hmkay.

Re:memory bus bottlenecks: 1 machine? by MickLinux · 2011-12-03 10:04 · Score: 1

Well, yes. If I really wanted to be cool about it, I might consider going to Radio Shack, and buying an Anduino. Then use the 4 outputs, plus a couple shift registers, to make something that could program an 80c51XA. Then design my algorithm to go on those, plugged together such that they'd outperform even an Nvidia.

Or, even cooler, I might program the 80c51XAs in parallel, one being the calculations chip, and one handling all the i/o from one unit to the other. Then I could write a massively parallel program that downloaded to the one, and ran on all.

Or I might just sit here on slashdot and imagine doing something cool.

--
Correct Horse Battery Staple: 72 bits of entropy. Enter "Correct H" into google. When it generates the phrase, that's

Cluster of Zotac AMD E-350's? by Anonymous Coward · 2011-12-03 10:22 · Score: 0

How about 40 of these? http://www.zotacusa.com/zbox-amd-e-350-apu-all-in-one-zbox-ad10-plus-u.html
This would give you 80x 1.6 CPU cores, 80GB, and 3200 GPU pipelines @ 500MHz.

I did this a little bit ago for fun. by RazorRaiser · 2011-12-03 10:28 · Score: 1

For a portable case use a plastic footlocker, the kind with wheels and a hinged lid. This hinged lid was key for me as it allowed me to attach a keyboard, trackball mouse, and small monitor.

On the inside I have two ATX motherboards with dual core Athlon 64s, though I could have used anything had I felt like spending the money. The worker node has two graphics cards and an extra NIC for regular network traffic (the onboard gigabit NIC is used for message passing). The head node has an extra WiFi NIC as well for talking with the outside world. There are then two switches, one for each internal network, and two hard drives off the head node. The worker node boots off a USB stick. I found Ubuntu installed from a live CD provides a nice, small OS.

It's a little cramped (the top sides of the motherboards face each other), but there's enough room for the power supplies to divide the space down the middle, with the switches and hard drives mounted above that and opposite of one another. Everything is held in place with L brackets, plexiglass, screws and spacers. Between Newegg (computer hardware), Amazon (keyboard, mouse, and monitor), and the Home Depot (box and mounting hardware) the whole project only cost about $1,000.

What's really nice is that there's room enough in the box for four ATX systems with expansion cards, or probably eight-ten mini-ITX boards if you wanted to go that route.

If you haven't already, add these sites to your research:
http://www.clustermonkey.net/
http://debianclusters.org/index.php/Main_Page
http://www.calvin.edu/~adams/research/microwulf/
They were extreemly valuable to me.

It won't be particularly easy, but it will be fun and rewarding like no other, and it makes a great mobile monster to show off to your friends!

Get volunteers for a botnet by Ken_g6 · 2011-12-03 10:59 · Score: 1

Heck, you'd be surprised how many projects have gotten people to volunteer to run such things. All you have to do is provide good uptime and statistics and people will come running! (Though a good project description helps too.)

--
(T>t && O(n)--) == sqrt(666)

Have you ever written CUDA code before? by jarich · 2011-12-03 11:17 · Score: 1

Writing code for video cards is much more difficult than most people think. On the other hand, if it's really a light weight, low CPU task that's just insanely parallel, check out http://www.tilera.com/ They don't pack a ton or horses, but they do have a pile of cores.

--
Agile Artisans

FASTRA by Anonymous Coward · 2011-12-03 12:52 · Score: 0

Here's an example build for you with multiple GPUs:

http://fastra.ua.ac.be/en/index.html

Depends on lots of things by kramulous · 2011-12-03 14:21 · Score: 1

You mention GPU but can you use get the solution up and running as quickly as the cpu solution? Optimised multi-gpu solutions are not that easy as the programmer has to do all the heavy lifting.

Does the code vectorise? If is does, then I'd be tempted to go with as many dual socket Intel machines as you can. Are you able to use the Intel compiler (leveraging into the MKL, IPP and IMF as much as possible). This assumes that communication is low. You are not going to have the cash for a low latency, high speed interconnect. The Intel compiler is free for linux and non-commercial use.

If the code doesn't vectorise, then I'd go for the recent quad socket, 16 core AMD setups and just go for blind horsepower.

Do you work for an organisation that already has big compute requirements and a system in place? Can you buy time there? Our installation runs at about a quarter the price of EC2 (and that includes people for installing software, configuring environments and providing compiler licenses). Can you contribute money to this group? $15K with us would get you 4 nodes of dual socket, 6 core Xeons, dual lane 256 bit wide registers and 2.8GHz clock .... 1.075TFlop. And access to the compilers to get very close to that theoretical performance easily.

There are all sorts of things that can steer the direction is any of the above and even others. Good luck.

--
.

Build A Fastra II clone by Anonymous Coward · 2011-12-03 14:43 · Score: 0

http://fastra2.ua.ac.be/

Fusion based cluster? by whizzter · 2011-12-03 14:54 · Score: 1

As you mention Teslas i guess using openCL with AMD could be an option?

Since the fusion chips share memory (for better and worse) with the main CPU you can apperantly get faster(0?) "transfer time" between CPU/GPU also maybe with this method it'd be possible to pump up the GPU cores with larger amounts of useable memory than usual?

And since they're cheap you could buy a bucketload of them.

FPGA implementation by Anonymous Coward · 2011-12-03 15:26 · Score: 0

What sorts of embarrassingly parallel algorithms are you looking at performing?
Would it make sense to instead look at implementing this algorithm in an FPGA or several FPGA's?
some keywords:
Hardware-Software co-design
high-level to hdl

use a grid of greenarrays chips by Anonymous Coward · 2011-12-03 15:27 · Score: 0

Grid of greenarrays chips from greenarraychips.com
Set then up to pass each "thread" around stage at time, in a loop between those chips acting in concert as multipliers / adders, and those acting as memory controllers. Implement register-less (and thus context-less) fine grained multithreading.

Will therefore handle "n" threads, limited only by memory size.

Organize overall topology to use the "instruction code" as an address to reach the op unit that instruction requires, and keep said operation units working close to peak rate.

Should get close to the 57.6 BIPS per chip, and at extremely low power usage.

Just to be clear, use the on-core instruction code just as "microcode", use multiple cores in parallel to handle wider data. (each core is only 18 bit native width).
Pass the "register" data in parallel with the operands, and switch between threads extremely rapidly.

Total "instruction cycle" is therefore many many "clocks", for one thread, but with very many threads you're effectively retiring one instruction per clock per available op unit. Single thread performance will suck, very-many threaded performance will be untouchable.
Performance per watt will also be orders of magnitude better than anything else.
Total power dissipation will be ~ 1 W per chip. (each chip with an array of 144 microcores, each capable of up to 400 MIPS, with internal registers and a small memory space - just use it for the microcode.).

When the chips go to a state of the art process, you'll be able to pick up another order of magnitude, because the rate will go from 400 to 4000 MHz.

Good luck and don't try to credit me with anything! (otherwise the company I work for will want to own the patent, and I don't want a patent to choke this idea to death.)

-- Anonynmous Coward.

Calxeda by bhima · 2011-12-03 18:59 · Score: 1

Just because you mentioned ARM, perhaps you should look into Calxeda. I have no idea if their solution is well suited for your problem, it is a whole bunch of 32bit cores in one box. Someone else already has a similar arrangement using Intel Atom.

--
Nothing in the world is more dangerous than sincere ignorance and conscientious stupidity.

8 GPU System by oneofthose · 2011-12-03 20:44 · Score: 1

I have built 4 and 8 GPU systems. For 8 GPUs the TYAN FT72B7015 is currently the only solution that I know of. Here are some product offers with this board http://blog.renderstream.com/2010/11/renderstream-announces-12-tflop-systems/ The GeForce cards are fine but since they are not built for 24/7 on HPC use, most vendors will warn you about warranty issues if something breaks in such a system. But they are cheap, just put 2 additional cards on the shelf next to the system and replace if needed. They get extremely hot, so consider how to cool such a beast in advance. The cards are also considerably faster than the Tesla solutions. If you need raw performance, ECC will slow you down so see if you can do without.

Amazon is not too expensive by khipu · 2011-12-04 01:39 · Score: 1

You may be able to buy hardware more cheaply, but you're not going to beat Amazon on overall cost, once you take even minimal maintenance, power, server room space, etc. into account. You may be able to save money over EC2 by putting in your own labor, just realize that this can be a lot of work.

Stop whining by Anonymous Coward · 2011-12-04 01:46 · Score: 0

1. If you don't like it, don't click on the link to read it.
2. This is doing your homework, or at least partially. Think of it as a distributed/remote brainstorming session. Brainstorming is about throwing up a whole bunch of solutions, not evaluating any of them as they are suggested, no matter how silly they may seem initially.
3. The actual competition is a useful brainstorming contribution. A link or even a name that could be searched on would be rather more useful.

G7 + Fusion-IO by Dr.Ruud · 2011-12-04 02:48 · Score: 1

My favorite boxes currently are HP-G7 with 2x 640 GB Fusion-IO and some hard disk. The data is in partitioned tables (MySQL 5.5), and I run for example 20 SELECT (partition-optimized) queries in parallel on them.

This is much faster (for me) than for example a Hadoop cluster. Mainly because the data is already where I query it. For me the copying of big data sets is the main bottle neck.

My aggregations normally finish in 1-5 minutes. One box does 60 GB per minute, 4 boxes do 90 GB per minute, so no big win by adding boxes.

Micro-mega-cluster of 400 Raspberry Pis by Anonymous Coward · 2011-12-04 04:20 · Score: 0

Those would total to about $10000 (USB networking?) or $14000 with ethernet :)

Also need to pay for:
- about 40 dumb switches or USB hubs (get a quantity rebate, left-over stock, fire sales or similar)
- some USB harddrives for the data (or send it somewhere on the internet). 2TB USB drives are about $120 where I live and things are usually more expensive here.
- keyboard (dump dive or splurge 30-40$)
- mouse (dump dive or splurge with keyboard)
- monitor (dump dive or splurge about 110$ --I just bought a 24" AOC lcd monitor with tons of connection options (vga, dvi-d/hdmi, USB) with Raspberry Pi in mind for about $110, it was on a special sale and the cheapest available no matter size anywhere where I live)
- perhaps one or two cheap shelves (clay bricks and boards should do) although I would be more inclined to just hang them all on string from the ceiling! XD
- maybe a few consumer fans? Could get messy with everything hanging down :D
- wiring (if you're using ethernet).

2-3 months to delivery and you would have to contact the R-Pi people and pay up front for a special batch.

This might all be impractical but it sure would look impressive and insane XD

sandia labs gumstix mini-cluster by Anonymous Coward · 2011-12-04 05:15 · Score: 0

http://www.flickr.com/photos/denix0/5702439532/
http://www.gumstix.com/press/Gumstix-strongbox-TI-Tech-Days-2011.pdf

Differences in Geforce & Tesla by gupg · 2011-12-04 07:06 · Score: 1

Differences between GeForce and Tesla for compute are at:
http://www.nvidia.com/object/why-choose-tesla.html

Bottom line, GeForce is great for development, but if you want to build a cluster with GPUs,
you are better off using the commercial grade Tesla GPUs.

Sumit
NVIDIA Tesla Group

Easy way to program GPUs by gupg · 2011-12-04 07:14 · Score: 1

There is a new easier way to program GPUs now using Directives-based compilers.

Idea is that you add some high-level pragmas to your C or Fortran code that a parallelizing compiler
uses to map to the GPU accelerator. Of course, you have to expose parallelism in the code for
the compiler to do a decent job. Example, use more data-parallel data structures. But this is a nice
incremental way to take advantage of the GPU.

Check it out at:
http://www.nvidia.com/object/tesla-2x-4weeks-guaranteed.html?cid=dev

Sumit
NVIDIA - Tesla Group

Gumstix by Anonymous Coward · 2011-12-04 08:26 · Score: 0

What about Gumstix?
Where was a presentation of cluster of them on some conference on Youtube. Can`t help in findinging the URL.

https://www.gumstix.com/store/product_info.php?products_id=261
https://www.gumstix.com/store/product_info.php?products_id=247

Re:many AMD CPUs unless the GPU port is done alrea by randyleepublic · 2011-12-04 10:15 · Score: 1

Tyan looks like a Supermicro competitor, but their stuff is always under-engineered and falls apart. Supermicro FTW!

--
Social Credit would solve everything...

parallel or distributed computing? by atria · 2011-12-04 11:42 · Score: 1

Given that it is an "embarrassingly parallel application" (its distributed computing not parallel computing) things like CUDA, OpenCL or MPI are excluded.
You, first of all, must make clear to yourself if you need threads of executions or cores for processing.
Price wise i would recommend multiple dual socket AMD servers: best price per core and if your analysis is scaling linear with cores, it is the best choice.
For administering such nodes i would recommend rocksclusters.org (underlying job scheduler can be torque, condor or sge)

Slashdot Mirror

Ask Slashdot: Parallel Cluster In a Box?

205 comments