IBM Releases Cell SDK

Well . . . by Yocto+Yotta · 2005-11-10 04:28 · Score: 2, Funny

But does it run Linux?

Oh. Well, okay then.

--
A B A C A B B

Wikipedia article question by goofyheadedpunk · 2005-11-10 04:34 · Score: 2, Insightful

Not knowing too much about the cell processor I read the wikipedia article. I came across this: "In other ways the Cell resembles a modern desktop computer on a single chip."

Why?

--

What if the entire Universe were a chrooted environment with everything symlinked from the host?

Re:Wikipedia article question by AKAImBatman · 2005-11-10 04:43 · Score: 4, Insightful

Um. That's kind of a weird statement. I think they mean to say that it encompasses much of the multiprocessing capabilities of a modern PC in a single chip. i.e. It's your CPU and GPU rolled into one.

Cell processors aren't really anything all that new per say. The multi-core design makes them superficially similar to GPUs (which are also vector processors) with the difference that GPUs use multiple pipelines for parallel processing whereas each cell is a self-contained pipeline capable of true multi-threaded execution. In theory, the interplay between these chips could accelerate a lot of the work currently done through a combination of software and hardware. e.g. All the work that graphics drivers do to process OpenGL commands into vector instructions could be done on one or two cells, thus allowing those cells to feed the other cells with data.

I guess you could say that the cell processor is the start of a general purpose vector processing design. I'm not really sure if it will take off, but unbroken thoroughput on these things is just incredible.

--
Javascript + Nintendo DSi = DSiCade
Re:Wikipedia article question by l33t-gu3lph1t3 · 2005-11-10 04:45 · Score: 4, Insightful

Easy answer - the wiki article on "Cell" isn't that good. Cell isn't a System-On-A-Chip. It's just a stripped-down, in-order power pc core coupled to 8 single-purpose in-order SIMD units, using an unconventional cache/local memory architecture. It can run perfectly optimized code very, very fast, at extremely low power consumption to boot, but optimization will be/is a bitch. For instance, you have to unroll your "for" loops to start, since those SIMD co-processors can't do loops.

I'm sure IBM and Sony have much better documentation on the CPU than I do, but that's it in a nutshell. Everything else you hear about it is just marketing. Oh yeah, almost forgot. Microsoft's "Xenon" processor for the Xbox360 is pretty much just 3 of those stripped down, in-order PPC cores in one cpu die.

--
------- "From bored to fanboy in 3.8 asian girls" ----------
Re:Wikipedia article question by AKAImBatman · 2005-11-10 04:55 · Score: 2, Interesting

Cell isn't a System-On-A-Chip. It's just a stripped-down, in-order power pc core coupled to 8 single-purpose in-order SIMD units, using an unconventional cache/local memory architecture

You know, I'm looking back at all these replies to the poor guy, and I can't help but think that he's sitting in front of his computer wondering, "Can't anyone explain it in ENGLISH?!?" :-P

For instance, you have to unroll your "for" loops to start, since those SIMD co-processors can't do loops.

Actually, we need a new programming model. Instead of using FOR loops, we need a model under while you can say, "Perform these instructions X number of times." One could probably do a bit of guess-work in the compiler based on loops like "for(i=0;i<COUNT;i++)", but that doesn't help cases where the loop uses a more complex conditional statement (or where the test is affected by the loop itself). Thus the language needs to be changed to force the programmer to pre-compute the loop length for maximum performance. For example:
int i = 0; do(COUNT) { /*code goes here */ i++; }

--
Javascript + Nintendo DSi = DSiCade
Re:Wikipedia article question by plalonde2 · 2005-11-10 04:55 · Score: 4, Informative

You are wrong. These SIMD processors do loops just fine. There's a hefty hit for a mis-predicted branch, but the branch hint instruction works wonders for loops.
The reason you want to unroll loops is because of various other delays. If it takes 7 cycles to load from the local store to a register, you want to throw a few more operations in there to fill the stall slots. Unrolling can provide those operations, as well as reduce the relative importance of branch overheads.
Re:Wikipedia article question by Jellybob · 2005-11-10 05:05 · Score: 2, Funny

Looks like Ruby to me, although it's a little to verbose ;)

0..9 { |i| puts i }
Re:Wikipedia article question by tomstdenis · 2005-11-10 05:31 · Score: 2, Informative

GCC can unroll all loops if you want including those with variable itteration counts. In those cases it uses a variant of duff's device. [well on x86 anyways].

As for the other posters, the real reason you want to unroll loops is basically to avoid the cost of managing the loop, e.g.

a simple loop like

for (a = i = 0; i b; i++) a += data[i];

In x86 would amount to

mov ecx,b
loop:
add eax,[ebx]
add ebx,4
dec ecx
jnz loop

So you have a 50% efficiency at best. Now if you unroll it to

mov ecx,b
shr ecx,1
loop:
add eax,[ebx]
add eax,[ebx+4]
add ebx,8
dec ecx
jnz loop

You now have 5 instructions for two itterations. That's down from 8 you would have before, and so on, e.g.

mov ecx,b
shr ecx,2
loop:
add eax,[ebx]
add eax,[ebx+4]
add eax,[ebx+8]
add eax,[ebx+12]
add ebx,16
dec ecx
jnz loop

Does 7 opcodes for 4 itterations [down from the 16 required previously, e.g. 100% more efficient].

Tom

--
Someday, I'll have a real sig.
Re:Wikipedia article question by AKAImBatman · 2005-11-10 05:37 · Score: 2, Interesting

mov ecx,b
shr ecx,2
loop:
add eax,[ebx]
add eax,[ebx+4]
add eax,[ebx+8]
add eax,[ebx+12]
add ebx,16
dec ecx
jnz loop

With SIMD instructions, you can execute all four of those adds in one instruction. I wish I knew SSE a bit better, then I could rewrite the above. Sadly, I haven't gotten around to learning the precise syntax. :-(

However, there's a fairly good (if not a bit dated) explanation of SIMD here.

--
Javascript + Nintendo DSi = DSiCade
Re:Wikipedia article question by hr+raattgift · 2005-11-10 07:59 · Score: 3, Informative

Perhaps something like writing in tail recursive style to help out an optimising compiler?...

You have this backwards. Optimizing compilers will turn tail-recursive style source into "normal" loops.

You can write a loop recursively, so that:
foo() { int x=8; int b=1; while(x > 0) { b << 1; --x; } return b; }
becomes
foo() { return foo-helper(10, 1); } foo-helper(int x, int b) { if(x <= 0) return b; else return foo-helper(--x, b << 1); }
Recursion in foo-helper is in the tail position. That is, foo-helper only calls itself as the final operation before returning.

Compiling this naively involves a function call per recursion, which on most architectures results in pushing data onto the stack. However, because we are doing tail-recursion, we can do a tail call elimination optimization.

How this works is that the "return" before the recursion is taken to mean that any automatic variables are dead, any stack space used for the arguments is reusable, and the recursive call is really a jump.

That is, when foo-helper calls itself, it really does an argument rewrite and jump, which in effect "pretends" that foo-helper was called with different arguments in the first place.

In other words, tail call elimination turns recursive loops into iterative loops.

Writing in "tail-recursive style" just means making sure your recursion is done in tail position (i.e., attached to a "return"). Some compilers for a variety of languages can identify recursion which is not done in the tail position, and reorder the recursion into tail position (and then the tail calls are eliminated into iterative loops). However, many compilers can't, and many more don't do tail-call elimination at all. :-(

Once you've optimized recursive loops into iterative ones, you can optimize iterative loops however you like, including partially or fully unrolling them.

In summary, recursion is a way of looping, but function calls are not free. In particular, they usually consume stack space. If you only return the result of your recursion, then you are tail-recursing. Tail recursion can be turned into code which does not incur function-call overhead.
Re:Wikipedia article question by hr+raattgift · 2005-11-10 08:13 · Score: 2, Funny

(dotimes i (code-goes-here))

Ack, pfft, says the evil Schemer. This is just insipid syntactic sugar for what you really mean:
(let loop ((i number-of-iterations)) (if (= i 0) #f ;; because CommonLisp dotimes returns NIL (begin (code-goes-here) (loop (- i 1)))))
instead of whatever dark magic your buggy
(dotimes (i number-of-iterations) (code-goes-here))
ends up being mangled into by your CommonLisp compiler because it can't do a safe-for-space tail recursion.
Re:Wikipedia article question by hr+raattgift · 2005-11-10 08:59 · Score: 2, Interesting

Ah, OK, I had to think about this a bit... please correct me if I'm still misunderstanding you.

I now think you were using a simile or making an analogy to argue that compilers can benefit from careful construction of loops in the source code.

If so, then of course I agree with you.

Saying this in a much more general way: careful choice of syntax can make the semantics more clear to the compiler.

A high level language with "dotimes (count) { action }" syntax lets the compiler make good choices about loop unrolling and the counter's type.

A language where you have to test and modify your own counter lets the writer make good or incredibly awful choices about loop unrolling and the counter's type.

This version:
foo() { double d = 1.0; int x=1; while(d > 0) { x = x << 1; d -= 0.1; } return x; }
is semantic brain-damage on a system with very slow very IEEE doubles, and loop-unrolling this naively is not going to help.

A compiler which realizes that this is a loop whose length is constant can unroll the loop fully, partially, or simply use a better/faster iterator like an integer. But should we end up with 0x400 or 0x800?

Haha, now throw side-effecting at your smart compiler by
inserting a debugging
printf("d: %G, x: %x\n", d, x);
into the while loop ... how should it optimize that?
... d: 0.2, x: 100 d: 0.1, x: 200 d: 1.38778E-16, x: 400 d: -0.1, x: 800
Right?

Anyway, I think we're not really disagreeing. You can write loops stupidly, whether they're iterative (as above) or whether they're recursive. A compiler probably can't save you if you are particularly stupid. It might even make things worse.

For what it's worth, when I say your sentence to myself, I want to make the like bold, I guess to emphasize the simile.

Re:Is this the same Cell processor used in the PS3 by Spazntwich · 2005-11-10 04:39 · Score: 5, Funny

No. In our insanely litigious society, a company has graciously allowed another to create and market a different processor by the same exact name.

Unproductive? by RManning · 2005-11-10 04:40 · Score: 5, Funny

My favorite quote from TFA...

...in addition, the ILAR license states that "You are not authorized to use the Program for productive purposes" -- so make sure that your time spent with these downloads is as unproductive as possible.

Re:Unproductive? by Kayamon · 2005-11-10 05:13 · Score: 2

Sounds like my job. I don't think there'll be any problems there. :-)

--
Kayamon

Since the submitter didn't bother to explain... by frankie · 2005-11-10 04:41 · Score: 4, Informative

...the Cell processor is an upcoming PowerPC variant that will be used in the PlayStation 3. It's great at DSP but terrible at branch prediction, and would not make a very good Mac. If you want to know full tech specs, Hannibal is da man.

Source for actual chips? by mustafap · 2005-11-10 04:47 · Score: 3, Interesting

Thats great news, but as an embedded systems designer and eternal tinkerer, where will I be able to buy a handfull of these processors to experiment with? Without having to dismantle loads of games machines ;o)

--
Open Source Drum Kit, LPLC deve board - mjhdesigns.com

What about a PPC SDK and simulator? by kuwan · 2005-11-10 04:49 · Score: 4, Interesting

As the Cell is basically a PPC processor I find it strange that the SDK is for x86 processors. Fedora Core 4 (PowerPC), also known as ppc-fc4-rpms-1.0.0-1.i386.rpm is listed as one of the files you need to download. Maybe it's just because of the large installed base of x86 machines.

It'd be nice if IBM released a PPC SDK for Fedora, it would have the potential to run much faster than an x86 SDK and simulator.

--
infested with jello like fishes no melotron wishes

Re:What about a PPC SDK and simulator? by pbohrer · 2005-11-10 14:18 · Score: 3, Informative

The simulator is actually maintained on a number of different platforms within IBM. Since the rest of the SDK team (xlc, cross-dev gcc, sample & libs, etc) chose Fedora Core 4 on x86 as a means of enabling the most number of people, we didn't want to confuse too many people by supplying the simulator on a variety of platforms for which the rest of the SDK is not supported. This was somewhat of a big-bang release of quite a bit of software to enable exploration of Cell. Now that we have this released and the open source side of the SDK is available on the web, I am sure people will have no problem adapting that build environment to be hosted on Linux/PPC. In support of that, we will be providing a Linux/PPC version of the Cell simulator soon on alphaWorks.

--
--Pat IBM Austin Research Lab Performance and Tools, Mgr pbohrer@us.ibm.com

GNU toolchain by lisaparratt · 2005-11-10 04:50 · Score: 5, Interesting

The software includes many gnu tools, but the underlying compiler does not appear to be gnu based.

Is this any surprise? My understanding was the Cell's a vector process, and despite the recent upgrades to GCC, it's still fairly awful at autovectorisation.

Can anyone clarify?

Re:GNU toolchain by Have+Blue · 2005-11-10 05:24 · Score: 3, Informative

IBM may have run into the same problems with the Cell that they did with the PowerPC 970- the chip breaks some fundamental assumptions GCC makes, and to add the best optimization possible it would necessary to modify the compiler more drastically than the GCC leads would allow (to keep GCC completely platform-agnostic).
Re:GNU toolchain by Wesley+Felter · 2005-11-10 05:32 · Score: 3, Informative

The SDK includes both GCC and XLC. GCC's autovectorization isn't the greatest, but Apple and IBM have been working on it. I think if you want fast SPE code you'll end up using intrinsics anyway.

Echoes of Redhat by delire · 2005-11-10 04:51 · Score: 3, Insightful

Why Fedora is so often considered the default target distribution I don't know. Even the project page states it's an unsupported, experimental OS, and one now comparitvely marginal when tallied.

Must be a case of 'brand leakage' from a distant past, one that held Redhat as the most popular desktop Linux distribution.

Shame, I guess IBM is missing out on where the real action is.

Re:Echoes of Redhat by LnxAddct · 2005-11-10 06:10 · Score: 4, Insightful

Fedora overtook Suse within a year and a half in terms of users. It is now a close 3rd to Debian which is a far second from Red Hat (Red Hat and Fedora together have around 3 times the market share of Debian, check netcraft to confirm those numbers). The numbers on distrowatch are not downloads or users, that number is how many people clicked on the link to read about Ubuntu. Mark Shuttleworth is obscenely good at getting press about Ubuntu so the Ubuntu link gets a lot of click throughs, and now that it is at the top, it is kind of self fulfilling as interested people want to read about the top distro so they click on that more.

When it comes down to it, Fedora is the most advanced linux distribution out there. It comes standard with SELinux and virtualization. It uses LVM by default, integrates exec-shield and other code foritfying techniques into all major services. It has the latest and greatest of everything. Things just work in Fedora because a large portion of that code was coded by Red Hat. Red Hat maintains GCC and glibc, they commit more kernel code than anyone else, they play a large role in everything from Apache and Gnome to creating GCJ to get java to run natively under linux. Whether you like it or not, Fedora is the distro most professionals go with, despite what the slashdot popular oppinion is and despite the large amounts of noise that a few ubuntu users create.

Out of the big two, Novell and Red Hat, Novell has never been worse off and Red Hat has never been healthier. Red Hat doesn't officially provide support for Fedora, but it is built and paid for by Red Hat and their engineers (in addition to the community contributions). By targetting Fedora, IBM knows that they are targeting a stable platform with the largest array of hardware support. IBM is in bed with both Novell and Red Hat, they didn't choose Fedora because they were paid to or something... they chose Fedora based on technical merits. Claiming that Fedora is unstable is no different than claiming GMail is in beta, both products are still the best in their respective industries. Why do people go spreading FUD about such a good produc when they've never used it themselves? Whether you want to admit it or not, Fedora is the platform to target for most. It is compatible in large part with RHEL, so you're getting the most bang for your buck.

IBM doesn't just shit around, or make decisions for dumb reasons. If Fedora is good enough for IBM it is good enough for anyone. Apparently this is a common oppinion as more and more businesses switch to Fedora desktops. Here is one recent story of a major Australian company, Kennards, replacing 400 desktops with Fedora. Don't be so close minded or you might be left behind.
Regards,
Steve

Re:Is this the same Cell processor used in the PS3 by Anonymous Coward · 2005-11-10 05:17 · Score: 2, Funny

I not get mine run. Please send exact instruction how downloaded PS3 games play can?

Cell Hardware... by GoatSucker · 2005-11-10 05:35 · Score: 4, Informative

From the article:
How does one get a hold of a real CBE-based system now? It is not easy: Cell reference and other systems are not expected to ship in volume until spring 2006 at the earliest. In the meantime, one can contact the right people within IBM to inquire about early access.

By the end of Q1 2006 (or thereabouts), we expect to see shipments of Mercury Computer Systems' Dual Cell-Based Blades; Toshiba's comprehensive Cell Reference Set development platform; and of course the Sony PlayStation 3.

Re:Linux on PS3? by MaskedSlacker · 2005-11-10 05:37 · Score: 2, Interesting

Almost definitely. A cheap beowulf of PS3s.

Rosetta to the rescue? by Caspian · 2005-11-10 05:45 · Score: 2, Interesting

'Processor - x86 or x86-64; anything under 2GHz or so will be slow to the point of being unusable.'

OK, so what they're saying is "it's slow to emulate a PPC variant on an x86 variant". Duh.

But Apple seems to have cooked up something wonderful (or at least licensed something wonderful) in this vein in the form of Rosetta, the tech that lets Mac OS X for x86 run Mac OS X for PPC binaries very fast.

Sony has several metric fucktons of money. Can't they license the Rosetta technology, or pay for it to be basically "ported" from its current state of PPC-on-x86 to Cell-on-x86? Cell is PPC-based, so it shouldn't be so hard, no?

--
With spending like this, exactly what are "conservatives" conserving?

Re:Rosetta to the rescue? by Hal_Porter · 2005-11-10 07:04 · Score: 2, Interesting

Apple wrote a great 68K emulator for the PowerPC macs. It was non JIT, and worked like a big jump table. So you took a 16bit 68k instruction, shifted it and jumped to the base of the table + the shifted offset. The code there would essentially be a PowerPC version of the 68K code.

http://www.mactech.com/articles/mactech/Vol.10/10. 09/Emulation/

So you end up doing four instructions to decode the 68K instruction, and then whatever it takes to actually do the operation, typically 2-4.

JIT emulators would profile the code and check which bits were frequently executed. Then they would essentially copy the table entries into a buffer. So in a loop, you'd actually execute native just execute the 2-4 native instructions and skip the table dispatch.
There's another benefit too, you can skip things like condition code updates, if you know that they will be overwritten by another instruction before they are checked. Plus you can do peephole optimisations, constant folding and so on.

There's a wonderful article here -

http://www.gtoal.com/sbt/

I can easily believe that CPU intensive code like image processing can run at a very impressive speed, especially as top of the range x86 chips have better SpecInt perormance than a top of the range PPC.

Incidentally, I read about Apple's second generation 68K emulator being a "dynamic recompiler", so they've been working on this sort of thing for ages.

--
echo -e 'global _start\n _start:\n mov eax, 2\n int 80h\n jmp _start' > a.asm; nasm a.asm -f elf; ld a.o -o a;

"cell" architecture is all about local memory by Animats · 2005-11-10 06:25 · Score: 4, Informative

The "cell" processors have fast access to local, unshared memory, and slow access to global memory. That's the defining property of the architecture. You have to design your "cell" program around that limitation. Most memory usage must be in local memory. Local memory is fast, but not large, perhaps as little as 128KB per processor.

The cell processors can do DMA to and from main memory while computing. As IBM puts it, "The most productive SPE memory-access model appears to be the one in which a list (such as a scatter-gather list) of DMA transfers is constructed in an SPE's local store so that the SPE's DMA controller can process the list asynchronously while the SPE operates on previously transferred data." So the cell processors basically have to be used as pipeline elements in a messaging system.

That's a tough design constraint. It's fine for low-interaction problems like cryptanalysis. It's OK for signal processing. It may or may not be good for rendering; the cell processors don't have enough memory to store a whole frame, or even a big chunk of one.

This is actually an old supercomputer design trick. In the supercomputer world, it was not too successful; look up the the nCube and the BBN Butterfly, all of which were a bunch of non-shared-memory machines tied to a control CPU. But the problem was that those machines were intended for heavy number-crunching on big problems, and those problems didn't break up well.

The closest machine architecturally to the "cell" processor is the Sony PS2. The PS2 is basically a rather slow general purpose CPU and two fast vector units. Initial programmer reaction to the PS2 was quite negative, and early games weren't very good. It took about two years before people figured out how to program the beast effectively. It was worth it because there were enough PS2s in the world to justify the programming headaches.

The small memory per cell processor is going to a big hassle for rendering. GPUs today let the pixel processors get at the frame buffer, dealing with the latency problem by having lots of pixel processors. The PS2 has a GS unit which owns the frame buffer and does the per-pixel updates. It looks like the cell architecture must do all frame buffer operations in the main CPU, which will bottleneck the graphics pipeline. For the "cell" scheme to succeed in graphics, there's going to have to be some kind of pixel-level GPU bolted on somewhere.

It's not really clear what the "cell" processors are for. They're fine for audio processing, but seem to be overkill for that alone. The memory limitations make them underpowered for rendering. And they're a pain to program for more general applications. Multicore shared-memory multiprocessors with good cacheing look like a better bet.

Read the cell architecture manual.

Re:"cell" architecture is all about local memory by taracta · 2005-11-10 08:03 · Score: 2, Informative

I think too much emphasis is being placed on "slow" access to system memory for the CELL processor when is is "slow" only relative to access to local memory of the SPUs. Please remember that system memory for the CELL is about 8 times faster than the memory in todays high end PCs with lower latency. XDR is by far the best memory type available unfortunately nobody like RAMBUS the company. So please when you are speaking about access to system memory keep in mind that the CELL processor has about the same memory bandwith has top of the line Graphics cards and probably lower latency. Don't you wish your PC had the bandwith of top of the line Graphics cards?
Re:"cell" architecture is all about local memory by frostfreek · 2005-11-10 08:19 · Score: 2, Informative

> It's not really clear...

There was a Toshiba demo, showing 8 Cells; 6 used to decode forty-eight HDTV MPEG4 streams, simultaneously, 1 for scaling the results to display, and one left over. A spare, I guess?

This reminds me of the Texas Instruments 320C80 processor; 1 RISC general purpose cpu, plus four DSP-oriented CPUs. Each had an on-chip memory chunk. 4KB. 256KB would be fantastic, after the experience of programming for the C80. 256KB will be plenty of memory to work on a tile of framebuffer.

1. DMA tile -> local RAM
2. render to local...
3. ???
4. Profit!

Whoops, where was I going with that, again?

Not a PPC Processor by MJOverkill · 2005-11-10 08:36 · Score: 2, Informative

Once again, the cell is not a PPC processor. It is not PPC based. The cell going into the playstation 3 has a POWER based PPE (power processing element) that is used as a controller, not a main system processor. Releasing an SDK for Macs would not give any advantage over an X-86 based SDK because you are still emulating another platform.

Wiki

Re:Not a PPC Processor by Wesley+Felter · 2005-11-10 13:30 · Score: 2, Informative

"Power Architecture" is PowerPC.

What is Power Architecture technology?

"Power Architecture is an umbrella term for the PowerPC® and POWER4(TM) and POWER5(TM) processors produced by IBM, as well as PowerPC processors from other suppliers."

The NVidia GPU in the PS3 by Animats · 2005-11-12 09:49 · Score: 2, Informative

That's not what Sony is saying:

SCEA press release:

SONY COMPUTER ENTERTAINMENT INC. AND NVIDIA ANNOUNCE JOINT GPU DEVELOPMENT FOR SCEI'S NEXT-GENERATION COMPUTER ENTERTAINMENT SYSTEM> .

TOKYO and SANTA CLARA, CA
DECEMBER 7, 2004
"Sony Computer Entertainment Inc. (SCEI) and NVIDIA Corporation (Nasdaq: NVDA) today announced that the companies have been collaborating on bringing advanced graphics technology and computer entertainment technology to SCEI's highly anticipated next-generation computer entertainment system. Both companies are jointly developing a custom graphics processing unit (GPU) incorporating NVIDIA's next-generation GeForce(TM) and SCEI's system solutions for next-generation computer entertainment systems featuring the Cell* processor".

35 of 207 comments (clear)