IBM to use Cell in Blade Servers

Noteworthy Information by gasmonso · 2006-02-09 05:51 · Score: 4, Informative

Take a peek at http://www.research.ibm.com/cell/patents_and_publi cations.html to see the patents and whitepapers for cell technology. One interesting point is the Online Game Prototype white paper on there.

http://religiousfreaks.com/

I work in blade development. by Thaidog · 2006-02-09 06:19 · Score: 5, Informative

We've had blades with Cell cpus on them for quite a while. They're a lot different than any other architecture... resembling the pSeries layout more-so than others. One thing I don't like about the prototypes is that the Cell cpu's along with the bga memory they use are fused directly to the logic board. They're were a few pictures released to the public about a year ago on the Register but I can not find them now. Other than that they are seriously fast and very clusterable.

--

||| I still can't believe Parkay's not butter.

Smaller blade chassis? by killtherat · 2006-02-09 06:19 · Score: 2, Informative

IBM has opened the spec for their blade chassis design. Does anybody know if somebody is trying to make a 'desktop' blade chassis? Rather then buying a huge box that holds 14 blades, something that might only hold two.
This doesn't mean make a desktop out of a blade, because as I understand it, so far the JS20s (IBMs PPC 970 blade) don't even have video cards. You have to set them up over the serial port, and run them over the network.
But does anybody have a development sized unit you don't need a server rack and new power circuits for?

Re:Smaller blade chassis? by ivan256 · 2006-02-09 06:43 · Score: 2, Informative

Portable development units come mounted on their side in a 19" enclosure with a handle on top, semi-attractive looking trim pieces, and appropriate power supplies and cooling on the inside. They cost about three times what you'd pay for a standard rackmount production model.

Re:Sun has 'em beat by ArbitraryConstant · 2006-02-09 06:33 · Score: 3, Informative

"As I understand it, the various pipelines of the Cell chip tend to be more specialized than the Coolthreads technology Sun is using on their new T1 processor."

Yes. A Cell's SPUs are not PowerPC processors, so you can't run the same code on the PowerPC front end as you do on the SPUs. Not only that, but Cell and Niagara are designed for totally different things. Cell is designed for floating-point intensive apps with pretty poor general purpose capabilities, while a Niagara has 1 floating point unit shared between all 8 cores and 32 threads but they're all good at the branchy sort of thing servers ususally run.

I think these Cell servers will be more useful for things like render farms, They'll be essentially useless as generic servers for web or database duty.

--
I rarely criticize things I don't care about.

Re:Your organs are specialized, too. by ivan256 · 2006-02-09 06:36 · Score: 2, Informative

You used them as programable DSPs. The CPU couldn't actually do the hard work fast enough... The chip did the 'hard' work, and they just made the CPU do more work than a full modem.

Re:Sun has 'em beat by Anonymous Coward · 2006-02-09 06:42 · Score: 1, Informative

Most render farms these days spend most of their time crunching GI lighting and ambient occlusion. This is very parallel, but needs access to LOTS of memory. Unless each SPU can independently access the entire address space, rendering will be slower than on the PPC alone.

Cell and T1 not targetting the same space by raftpeople · 2006-02-09 06:43 · Score: 2, Informative

Sun's new processor is designed for many-connection business server applications. Web stuff.

The Cell is designed for image processing and other high-volume number crunching.

The design decisions both companies made were heavily influenced by their target markets for these specific processors, and those target markets are very different.

These are apples and oranges.

Re:Sun to use new chips: DragonBall by Tolookah · 2006-02-09 06:44 · Score: 2, Informative

They can't use that name, freescale (motorola) already has it, and killed the line
http://www.freescale.com/webapp/sps/site/taxonomy. jsp?nodeId=0162468rH3YTLCvL2v

if you knew this, then fwoosh went the joke over my head

Re:Big Difference Between Itanium and Cell by ShadowFlyP · 2006-02-09 06:53 · Score: 5, Informative

Actually, the bigger difference is in how the architecture changed. Cell processor is more along the lines of multi-core DSPs. The instruction set is different than general computing cores and there are many of them. The key is that these cores are disjoint. You can run one application on one core and another application on another core.

The Itanium is different than this in that it required instructions to be passed to the CPU as "bundles". Any of the instructions in a bundle could be executed in any order, but these instructions were all from the same application. Thus, in order to extract speed from the Itanium, the compiler was forced to extract parallelism from within functions. This is very difficult since most programming is fairly sequential. The Cell, on the other hand, allows you to execute different tasks and so puts this control back on the programmer instead of extra work for the compiler.

Itanium was (is) a great idea from compiler theory perspective, but doesn't work out all that well (yet) in the real world.

IBM already has these tools available by raftpeople · 2006-02-09 06:55 · Score: 2, Informative

To the programmer, communicating with the SPU is abstracted to file i/o operations. Go check out IBM developerworks pages for lots of info.

Re:Where have I heard this before? by John+Whitley · 2006-02-09 06:58 · Score: 5, Informative

Deja vu?

Nice quip, but the realities of the situation are completely different. My take on EPIC nee IA-64 when it was first publicly announced was surprise at an architecture that actually encouraged ultra-complex processor control logic. This, when prevailing trends tended to find ways to manage or reduce that complexity, or at least provide unambiguous chip-compiler synergy. Put another way, Intel made design choices that made the hardware itself very challenging to build and properly synergize with a compiler to achieve high total performance. Intel had certainly shown their chops at this sort of high-complexity chip controller design in the x86 line, but the move still seemed brazen from an outsider's perspective. History now shows that they certainly had trouble going down that path...

Cell, however, is basically a bog-stock PowerPC with DSP engines at its disposal. Think Altivec/MMX/SSE type units on steroids. This approach provides computing power that isn't applicable to all tasks, but is generally proven to perform well for applications that require high performance mathematical processing. Incidentally, that's precisely the target market that IBM's stated they're after with Cell-based servers. Moreover, Cell's scalability model and hardware complexities are much more managable.

To really leverage Cell's power from the software side will require some or all of 1) good compiler and toolchain support, 2) good library support, and 3) dedicated development effort for the specific application. IBM has the expertise and motivation to provide 1 and 2, and developers in the supercomputing world tend to get really good at 3. When your *highly optimized* supercomputer app may take on the order of a year to run, big emphasis tends to be put on making it run fast. Months of work to save years of time.

It still remains to be seen how this effort will play out in the marketplace, but variants of Cell's basic approach are working right now in many, many devices.

Re:How about a free optimizing compiler by Anonymous Coward · 2006-02-09 06:59 · Score: 2, Informative

There is a free GCC compiler for Cell. And Linux. And you can get a free simulator to run it on. All at http://www.ibm.com/developerworks/power/cell

Re:Sun has 'em beat by Anonymous Coward · 2006-02-09 07:13 · Score: 1, Informative

As luck has it, each SPE can DMA main memory to its local store independently. The Cell has huge amounts of bandwidth available to it.

Re:hardware abstraction? by 2megs · 2006-02-09 07:15 · Score: 3, Informative

Developers who write code that takes advantage of GPUs in modern gaming PCs are already familliar with this style programming,

But you can probably count on your fingers the number of developers who are using GPUs for anything other than rendering pixels, or at most some simple vectorizable simulations like water or cloth.

Taking an arbitrary program and turning it into something that would run well on a GPU (or a Cell SPU) usually requires a significant redesign of the algorithms and data structures as compared to what you would naively and straightforwardly do in C...or it won't get anywhere near peak performance and may even run slower. It's certainly possible to do, but you won't be re-using any of that originally written code, and it's a different way of thinking from what 95% of programmers are used to. I'm speaking from experience as someone who earns his living by being in the remaining 5%. :)

As the original poster said: you hand optimize (and design) your program for the cell.

Re:How about a free optimizing compiler by Tune · 2006-02-09 07:45 · Score: 4, Informative

First, as others have already commented, a gcc backend is already available and Linux runs on Cell.

Second, optimizing compilers tend to optimize only small parts of linear code. Simply put, this comes down to filtering binaries and replacing inefficient code sequences by more efficient ones. Depending on the quality of the compilercore, this typically gains a few percent, occasionally some 25% but that's nowhere near what Cell could offer, namely (theoretically) 800%.
The problem is refactoring the problem to run in
- small chunks,
- independently (parallel)
- and on a specialized processor.
A compiler can help only modestly with the last point. In any non-trivial case, this means reanalyzing the problem and reimplementing the solution from the start, making different tradeoffs. That is why people say Cell is difficult.

IMHO, the benefits of code optimization will be close to irrelevant for almost any successful application on Cell over the coming years. And while Moore's law has provided us with bigger and faster hardware, we programmers are still mostly empty-handed when it comes to program translation for parallel architectures.

We need a paradigm shift, not an optimizing compiler.

compute per silicon-area/watt/$ by Soong · 2006-02-09 08:30 · Score: 3, Informative

PPEs are bigger. Also, a dedicated slave processor doesn't have to worry about interrupts and context switches and OS crap, it can spend all its cycles on number crunching. Cell SPEs are all about moving large amounts of data and doing a whole lot of compute on that data. They're simpler and more efficient at what they're designed for.

--
Start Running Better Polls

Re:Big Difference Between Itanium and Cell by DoofusOfDeath · 2006-02-09 08:39 · Score: 2, Informative

You're close to correct. The Cell processor does have a bunch of cores that are basically DSPs (no virtual memory, etc.) BUT there's also another core that's basically a full-blown Power processor. That core is meant to rule the others.

So while you do still have to program differently for a cell with 8+1 cores than you would for a computer with 9 Power processors, it's still not like being stuck with just 9 DSPs.

"multi-core DSPs" WITH CRIPPLED FPUs!!! by mosel-saar-ruwer · 2006-02-09 08:40 · Score: 4, Informative

Actually, the bigger difference is in how the architecture changed. Cell processor is more along the lines of multi-core DSPs.

Standard computer graphics are RGB color at 24-bits per pixel [2^24 = 16777216], i.e. about 16 million colors.

Standard thinking in the graphics bidness is that: If our triangles will only be displayed in 24-bits worth of color, then why do we need to perform triangle-arithmetic in anything higher than maybe 32-bits worth of floating points?

Hence floating point calculations are 24-bit in the ATi world, and 32-bit in the nVidia and Playstation3/Cell world.

Boy, I hope they're upping that floating point number for these "server" chipsets, cause 32-bit single-precision floats are essentially worthless for even something as trivial as computing interest on a bank statement.

On the other hand, a "Cell" server CPU with a 128-bit FPU would be something to drool over. The problem, though, is that transistor counts on FPU's tend to increase as n^2, so each time you double the FPU bit-count [to 64-bits, then to 128-bits], your transistor count goes through the roof.

Re:Good point. Unfortunately ... by be-fan · 2006-02-10 07:49 · Score: 2, Informative

Cell's peak theoretical performance is 25 gigaflops, derived by taking the product of the clockspeed (3.2 GHz), and the number of operations per cycle (8). In reality, this figure is highly optimistic. Each SPE only has a single floating-point pipeline. The 8 operations/cycle figure is derived by counting a 4-element single-precision multiply-accumulate as 8 total operations. Moreover, when doing double-precision operations, it takes an additional 5x speed hit, since they must be performed in multiple clock cycles. That results in Cell's theoretical performance for double-precision code being a total of 10x lower (according to IBM), or around 2.5 gigaflops per SPE. At 10 gigaflops per chip, that's still relatively impressive, compared to the 5 gigaflops per chip a dual-core Opteron (2.4 GHz) can handle, but the actual performance of a Cell chip is going to be a lot less than the actual performance of the Opteron.

Understanding why requires a bit of understanding of chip-design, but the basics are simple. The Cell SPE basically has four things working against it:

1) No dynamic branch prediction. This means that when the Cell SPE encounters a branch instruction, it will always assume the backwards branch is taken. This works fine for loops, where its good to assume that the branch at the end of the loop will jump back to the beginning of the loop, but doesn't work well for anything else. If the guess is wrong, then the CPU pays an 18 cycle penalty while the pipeline is flushed and the correct branch path is followed. The Opteron, on the other hand, keeps track of the history of each branch. It can then make a much better guess about which way the branch will go, and avoid paying a penalty for guessing wrong. Since the Opteron's pipeline is shorter, this also means the penalty for an incorrect guess is much less (around 12 cycles). The net result of all this is that if your code has lots of short loops (static branch prediction always mispredicts the iteration that exists the loop), or a lot of complex control flow, Cell's SPE's are going to lose a lot of their theoretical performance since many cycles will be wasted on mispredicted branches.

2) Very high latency for instructions and loads. In Cell, the floating-point latency is at least six cycles, and the load latency from the local store is at least 6 cycles. For Opteron, its 4 cycles and 3 cycles, respectively. Basically, the instruction latency tells you by how many clock cycles you must seperate dependent operations. Eg: on an Opteron, you can issue a memory load, and assuming an L1 cache hit, you can issue an instruction that uses the loaded register 3 cycles later. If you have no instructions you can issue until that load is completed, then you just issue nothing that cycle and lose some of your potential throughput. Since the SPE's latencies are much higher, there is a much higher chance that you won't have any non-dependent instructions to issue on a given cycle, and must waste that cycle.

3) A very specialized memory model. Cell's SPEs can only directly address 256KB of local memory. If you have data bigger than that, you have to manually shuffle it in and out of that local memory. The latency for doing this shuffling is extremely high on Cell. This means that in code that accesses big data sets, if you can't effectively partition your data sets, you'll waste a lot of time shuffling things in and out of memory.

4) No out-of-order execution. Modern CPUs like an Opteron will rearrange your instructions to get around the instruction latencies I mentioned earlier. It'll look ahead in the code stream a couple of dozen instructions to find non-dependent ones that can be issued while waiting for other ones to finish. Cell won't do that. If you have an ADD in your code, and then right after you have a MUL that uses the results of the ADD, then Cell will merrily wait 6 cycles waiting for the ADD to finish, even if right after the MUL you have another ADD that doesn't need the results of the first one. This places a lot of burden on the c

--
A deep unwavering belief is a sure sign you're missing something...

Slashdot Mirror

IBM to use Cell in Blade Servers

20 of 159 comments (clear)