IBM Releases Cell SDK
derek_farn writes "IBM has released an SDK running under Fedora core 4 for the Cell Broadband Engine (CBE) Processor. The software includes many gnu tools, but the underlying compiler does not appear to be gnu based. For those keen to start running programs before they get their hands on actual hardware a full system simulator is available. The minimum system requirement specification has obviously not been written by the marketing department: 'Processor - x86 or x86-64; anything under 2GHz or so will be slow to the point of being unusable.'"
No. In our insanely litigious society, a company has graciously allowed another to create and market a different processor by the same exact name.
My favorite quote from TFA...
...the Cell processor is an upcoming PowerPC variant that will be used in the PlayStation 3. It's great at DSP but terrible at branch prediction, and would not make a very good Mac. If you want to know full tech specs, Hannibal is da man.
Um. That's kind of a weird statement. I think they mean to say that it encompasses much of the multiprocessing capabilities of a modern PC in a single chip. i.e. It's your CPU and GPU rolled into one.
Cell processors aren't really anything all that new per say. The multi-core design makes them superficially similar to GPUs (which are also vector processors) with the difference that GPUs use multiple pipelines for parallel processing whereas each cell is a self-contained pipeline capable of true multi-threaded execution. In theory, the interplay between these chips could accelerate a lot of the work currently done through a combination of software and hardware. e.g. All the work that graphics drivers do to process OpenGL commands into vector instructions could be done on one or two cells, thus allowing those cells to feed the other cells with data.
I guess you could say that the cell processor is the start of a general purpose vector processing design. I'm not really sure if it will take off, but unbroken thoroughput on these things is just incredible.
Javascript + Nintendo DSi = DSiCade
Easy answer - the wiki article on "Cell" isn't that good. Cell isn't a System-On-A-Chip. It's just a stripped-down, in-order power pc core coupled to 8 single-purpose in-order SIMD units, using an unconventional cache/local memory architecture. It can run perfectly optimized code very, very fast, at extremely low power consumption to boot, but optimization will be/is a bitch. For instance, you have to unroll your "for" loops to start, since those SIMD co-processors can't do loops.
I'm sure IBM and Sony have much better documentation on the CPU than I do, but that's it in a nutshell. Everything else you hear about it is just marketing. Oh yeah, almost forgot. Microsoft's "Xenon" processor for the Xbox360 is pretty much just 3 of those stripped down, in-order PPC cores in one cpu die.
------- "From bored to fanboy in 3.8 asian girls" ----------
Thats great news, but as an embedded systems designer and eternal tinkerer, where will I be able to buy a handfull of these processors to experiment with? Without having to dismantle loads of games machines ;o)
Open Source Drum Kit, LPLC deve board - mjhdesigns.com
As the Cell is basically a PPC processor I find it strange that the SDK is for x86 processors. Fedora Core 4 (PowerPC), also known as ppc-fc4-rpms-1.0.0-1.i386.rpm is listed as one of the files you need to download. Maybe it's just because of the large installed base of x86 machines.
It'd be nice if IBM released a PPC SDK for Fedora, it would have the potential to run much faster than an x86 SDK and simulator.
infested with jello like fishes no melotron wishes
The software includes many gnu tools, but the underlying compiler does not appear to be gnu based.
Is this any surprise? My understanding was the Cell's a vector process, and despite the recent upgrades to GCC, it's still fairly awful at autovectorisation.
Can anyone clarify?
Why Fedora is so often considered the default target distribution I don't know. Even the project page states it's an unsupported, experimental OS, and one now comparitvely marginal when tallied.
Must be a case of 'brand leakage' from a distant past, one that held Redhat as the most popular desktop Linux distribution.
Shame, I guess IBM is missing out on where the real action is.
The reason you want to unroll loops is because of various other delays. If it takes 7 cycles to load from the local store to a register, you want to throw a few more operations in there to fill the stall slots. Unrolling can provide those operations, as well as reduce the relative importance of branch overheads.
How does one get a hold of a real CBE-based system now? It is not easy: Cell reference and other systems are not expected to ship in volume until spring 2006 at the earliest. In the meantime, one can contact the right people within IBM to inquire about early access.
By the end of Q1 2006 (or thereabouts), we expect to see shipments of Mercury Computer Systems' Dual Cell-Based Blades; Toshiba's comprehensive Cell Reference Set development platform; and of course the Sony PlayStation 3.
The cell processors can do DMA to and from main memory while computing. As IBM puts it, "The most productive SPE memory-access model appears to be the one in which a list (such as a scatter-gather list) of DMA transfers is constructed in an SPE's local store so that the SPE's DMA controller can process the list asynchronously while the SPE operates on previously transferred data." So the cell processors basically have to be used as pipeline elements in a messaging system.
That's a tough design constraint. It's fine for low-interaction problems like cryptanalysis. It's OK for signal processing. It may or may not be good for rendering; the cell processors don't have enough memory to store a whole frame, or even a big chunk of one.
This is actually an old supercomputer design trick. In the supercomputer world, it was not too successful; look up the the nCube and the BBN Butterfly, all of which were a bunch of non-shared-memory machines tied to a control CPU. But the problem was that those machines were intended for heavy number-crunching on big problems, and those problems didn't break up well.
The closest machine architecturally to the "cell" processor is the Sony PS2. The PS2 is basically a rather slow general purpose CPU and two fast vector units. Initial programmer reaction to the PS2 was quite negative, and early games weren't very good. It took about two years before people figured out how to program the beast effectively. It was worth it because there were enough PS2s in the world to justify the programming headaches.
The small memory per cell processor is going to a big hassle for rendering. GPUs today let the pixel processors get at the frame buffer, dealing with the latency problem by having lots of pixel processors. The PS2 has a GS unit which owns the frame buffer and does the per-pixel updates. It looks like the cell architecture must do all frame buffer operations in the main CPU, which will bottleneck the graphics pipeline. For the "cell" scheme to succeed in graphics, there's going to have to be some kind of pixel-level GPU bolted on somewhere.
It's not really clear what the "cell" processors are for. They're fine for audio processing, but seem to be overkill for that alone. The memory limitations make them underpowered for rendering. And they're a pain to program for more general applications. Multicore shared-memory multiprocessors with good cacheing look like a better bet.
Read the cell architecture manual.
You have this backwards. Optimizing compilers will turn tail-recursive style source into "normal" loops.
You can write a loop recursively, so that:becomesRecursion in foo-helper is in the tail position. That is, foo-helper only calls itself as the final operation before returning.
Compiling this naively involves a function call per recursion, which on most architectures results in pushing data onto the stack. However, because we are doing tail-recursion, we can do a tail call elimination optimization.
How this works is that the "return" before the recursion is taken to mean that any automatic variables are dead, any stack space used for the arguments is reusable, and the recursive call is really a jump.
That is, when foo-helper calls itself, it really does an argument rewrite and jump, which in effect "pretends" that foo-helper was called with different arguments in the first place.
In other words, tail call elimination turns recursive loops into iterative loops.
Writing in "tail-recursive style" just means making sure your recursion is done in tail position (i.e., attached to a "return"). Some compilers for a variety of languages can identify recursion which is not done in the tail position, and reorder the recursion into tail position (and then the tail calls are eliminated into iterative loops). However, many compilers can't, and many more don't do tail-call elimination at all.
Once you've optimized recursive loops into iterative ones, you can optimize iterative loops however you like, including partially or fully unrolling them.
In summary, recursion is a way of looping, but function calls are not free. In particular, they usually consume stack space. If you only return the result of your recursion, then you are tail-recursing. Tail recursion can be turned into code which does not incur function-call overhead.