IBM to use Cell in Blade Servers

Where have I heard this before? by Jordan+Catalano · 2006-02-09 05:50 · Score: 3, Insightful

That could prove more challenging than usual because Cell's architecture is so different. IBM hopes this summer's release of the Cell-based servers kick-starts work by third-party programmers.'"

Deja vu?

Re:Where have I heard this before? by John+Whitley · 2006-02-09 06:58 · Score: 5, Informative

Deja vu?

Nice quip, but the realities of the situation are completely different. My take on EPIC nee IA-64 when it was first publicly announced was surprise at an architecture that actually encouraged ultra-complex processor control logic. This, when prevailing trends tended to find ways to manage or reduce that complexity, or at least provide unambiguous chip-compiler synergy. Put another way, Intel made design choices that made the hardware itself very challenging to build and properly synergize with a compiler to achieve high total performance. Intel had certainly shown their chops at this sort of high-complexity chip controller design in the x86 line, but the move still seemed brazen from an outsider's perspective. History now shows that they certainly had trouble going down that path...

Cell, however, is basically a bog-stock PowerPC with DSP engines at its disposal. Think Altivec/MMX/SSE type units on steroids. This approach provides computing power that isn't applicable to all tasks, but is generally proven to perform well for applications that require high performance mathematical processing. Incidentally, that's precisely the target market that IBM's stated they're after with Cell-based servers. Moreover, Cell's scalability model and hardware complexities are much more managable.

To really leverage Cell's power from the software side will require some or all of 1) good compiler and toolchain support, 2) good library support, and 3) dedicated development effort for the specific application. IBM has the expertise and motivation to provide 1 and 2, and developers in the supercomputing world tend to get really good at 3. When your *highly optimized* supercomputer app may take on the order of a year to run, big emphasis tends to be put on making it run fast. Months of work to save years of time.

It still remains to be seen how this effort will play out in the marketplace, but variants of Cell's basic approach are working right now in many, many devices.
Re:Where have I heard this before? by geekoid · 2006-02-09 07:17 · Score: 1

Does that mean putting cell chips on a video card could enhance dynamic world creation?

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
Re:Where have I heard this before? by Anonymous Coward · 2006-02-09 07:35 · Score: 0

"but is generally proven to perform well for applications that require high performance mathematical processing"

Wow... imagine a beowulf of... oh dammit, I'll get my coat, then.
Re:Where have I heard this before? by Doctor+Memory · 2006-02-09 08:24 · Score: 1

Intel made design choices that made the hardware itself very challenging

You misspelled "HP". Itanium was originally going to be the new PA-RISC chip, (originally named 'PA-WideWord'). HP approached Intel when it became apparent that they wouldn't produce the volume of chips to make it profitable to upgrade their fab (which they would have to do to produce a chip of Itanium's complexity). So, enter Intel ca. 1994. Sun produced a version of Solaris for the new chip, IBM and SCO played together nicely (along with Sequent) to produce a version of Unix they called Monterey, Compaq ported Tru64, and SGI actually went so far as to announce they were going to drop MIPS altogether to support it.

And now the wheel has turned, and we're in a brand new x86 world. Itanium was late, SGI forced another couple of generations out of MIPS, and AMD showed Intel how x86 was done. Sun dropped Solaris/Itanium in 2000, IBM and Dell have both dropped their Itanium lines, and Microsoft has announced support for Itanium in Longhorn.

--
Just junk food for thought...
Re:Where have I heard this before? by adisakp · 2006-02-09 10:10 · Score: 1

Cell, however, is basically a bog-stock PowerPC with DSP engines at its disposal.

Actually, the Power PC Unit (PPU) in a cell is a highly simplified streamlined Power PC and nothing at all like the PowerPC's you'll find in a G5 Mac. While it runs at a higher clock rate, it's missing lots of stuff like out-of-order execution and advanced branch prediction and has a much simpler load-store unit. For example, on Cell there are huge penalties for load-hit-store but on current gen Power PC's there is a unit to forward stores to loads while they pend in the SIQ. If you expect code on a current generation Power PC to behave exactly the same on a Cell PPU, you're in for a big surprise (and not in a good way).

Noteworthy Information by gasmonso · 2006-02-09 05:51 · Score: 4, Informative

Take a peek at http://www.research.ibm.com/cell/patents_and_publi cations.html to see the patents and whitepapers for cell technology. One interesting point is the Online Game Prototype white paper on there.

http://religiousfreaks.com/

Sun to use new chips by db32 · 2006-02-09 05:51 · Score: 5, Funny

Sun Microsystems has decided to include the Gohan chip to combat IBM's Cell chips.

--
The only change I can believe in is what I find in my couch cushions.

Re:Sun to use new chips by freshman_a · 2006-02-09 05:59 · Score: 1

But I thought Sun's new chip was named Nia...

Oh wait...

I get it! +1, Funny for you.

--
Slackware
Re:Sun to use new chips by db32 · 2006-02-09 06:10 · Score: 1

I was wondering how I got a +1 interesting on that.

--
The only change I can believe in is what I find in my couch cushions.
Re:Sun to use new chips by Anonymous Coward · 2006-02-09 06:29 · Score: 0

Surely the Dragonball would be a better choice? :)
Re:Sun to use new chips by cant_get_a_good_nick · 2006-02-09 06:44 · Score: 1

Surely the Dragonball would be a better choice? :)
Not sure if you were joking, but Dragonball was already used. Motorola uses Dragonball for it's 68k embedded line. I never got a Dragonball Z sticker cool enough that i wanted to stick on my Palm IIIxe.
Re:Sun to use new chips by dascandy · 2006-02-09 07:14 · Score: 1

Intels counters with Majin Boo processors, AMD finishes with Goku processors.

How about a free optimizing compiler by mi · 2006-02-09 05:55 · Score: 5, Insightful

a free optimizing compiler, that takes advantage of the architecture, would do wonders...

It being command-line compatible with (or simply a back-end of) an existing compiler like gcc is even better.

Add a port of a good OS, and your platform is suddenly incredibly attractive to developers.

--
In Soviet Washington the swamp drains you.

Re:How about a free optimizing compiler by RingDev · 2006-02-09 05:58 · Score: 1

I'd even be happy with a CLR or VM (ala .Net or Java) that would take care of the compilation. If they could get the .Net framework to run on it, I can think of a few apps I would be up for redesigning to take advantage of the multi-threading advantages.

-Rick

--
"Most people in the U.S. wouldn't know they live in a tyrannical state if it walked up and grabbed their junk." - MyFirs
Re:How about a free optimizing compiler by MaestroSartori · 2006-02-09 06:08 · Score: 2, Insightful

The main problem, I suspect, is that general purpose code just doesn't run very well on it. You really need to optimise for each application, to tune how you handle your data and what algorithms you use in order feed the Cell properly if you want to get the most out of it.

--
Game dev and music blog
Re:How about a free optimizing compiler by ajs · 2006-02-09 06:21 · Score: 4, Insightful

And of course, optimizing in that way is probably analogous to the halting problem, but that doesn't mean that a good general-purpose back-end for GCC could not be written. History teaches us one thing about specialized hardware that we should never forget: the average user of your hardware is going to need to have VERY LITTLE of their code hand-tuned for it. For example, let's say that this hardware tends to be very good at encryption. Your average user would likely be running a Web server or some other sort of networking technology, and almost NONE of that code cares about the 10-100 hand-tuned routines in OpenSSL that you wrote for this platform.

Get a good compiler and general-purpose OS up and running fast (which, by the way, I'm sure IBM is doing), and you'll see many more people writing special-purpose code where they need it.
Re:How about a free optimizing compiler by Anonymous Coward · 2006-02-09 06:59 · Score: 2, Informative

There is a free GCC compiler for Cell. And Linux. And you can get a free simulator to run it on. All at http://www.ibm.com/developerworks/power/cell
Re:How about a free optimizing compiler by Anonymous Coward · 2006-02-09 07:17 · Score: 0

There is such a research project at IBM. http://domino.research.ibm.com/comm/research_proje cts.nsf/pages/cellcompiler.index.html
Re:How about a free optimizing compiler by Tune · 2006-02-09 07:45 · Score: 4, Informative

First, as others have already commented, a gcc backend is already available and Linux runs on Cell.

Second, optimizing compilers tend to optimize only small parts of linear code. Simply put, this comes down to filtering binaries and replacing inefficient code sequences by more efficient ones. Depending on the quality of the compilercore, this typically gains a few percent, occasionally some 25% but that's nowhere near what Cell could offer, namely (theoretically) 800%.
The problem is refactoring the problem to run in
- small chunks,
- independently (parallel)
- and on a specialized processor.
A compiler can help only modestly with the last point. In any non-trivial case, this means reanalyzing the problem and reimplementing the solution from the start, making different tradeoffs. That is why people say Cell is difficult.

IMHO, the benefits of code optimization will be close to irrelevant for almost any successful application on Cell over the coming years. And while Moore's law has provided us with bigger and faster hardware, we programmers are still mostly empty-handed when it comes to program translation for parallel architectures.

We need a paradigm shift, not an optimizing compiler.
Re:How about a free optimizing compiler by xero314 · 2006-02-09 07:53 · Score: 1

I think that you are correct that having an optimizing compiler of some sort available would increase the rate of adoption of cell processors by developers. On the other hand I find it to be a REALLY bad idea, if taken to far or done incorrectly. If you make it easy for developers to create bad software that is not writen specificially for the cell architecture you will lose alot of the benifits of the cell. People will write a few general purpose apps that are not better performing than there x86 equivelants and then Cell will be washed away (please refere to the percevied performance of the Emotion Engine). It may be better to figure out how to get developers to work with more direct access to the architecture, at say the assembly level. Though finding a modern developer that even knows what assembly is might be difficult (yes I know there are a few of us out there).
Re:How about a free optimizing compiler by linhux · 2006-02-09 08:49 · Score: 1

Also see the Barcelona Supercomputing Centre's Linux-On-Cell project.
Re:How about a free optimizing compiler by fitten · 2006-02-09 10:17 · Score: 1

With the help of a bunch of libraries, sure. Otherwise, you not only have a multiprocessor debugging situation but you have a heterogenous multiprocessor debugging situation.

GCC will not automagically take your program and break it into parts that will run on the PPC core and the SPEs in parallel. I don't know of any that do a great job on a homogenous multiprocessor system but there are some that try to do some parallelization (OpenMP enabled compilers for example). I don't know of any that will on any architecture that the Cell copied (the various DSPs from TI and other companies, for example).

Many programmers have never had to debug an application that is multithreaded, much less an application where it is multithreaded and the threads run on heterogenous cores simultaneously. (I have done so, incidentally.)
Re:How about a free optimizing compiler by Anonymous Coward · 2006-02-09 14:28 · Score: 0

The problem is refactoring the problem to run in - small chunks, - independently (parallel) - and on a specialized processor.
Owww, recursive problem solving.
Re:How about a free optimizing compiler by Anonymous Coward · 2006-02-10 13:24 · Score: 0

http://sonicclang.ringdev.com/Levels.php

quaters -> quarters

Linux on Cell by morgan_greywolf · 2006-02-09 05:58 · Score: 4, Insightful

Considering they've already got Linux on Cell and a proposed model for making userland apps to take advantage of the SPUs, and have had these since last summer, I wouldn't be surprised if some open source code is already in the process of being ported.

Anyone know of any specific server apps?

--
My blog

Re:Linux on Cell by 80+85+83+83+89+33 · 2006-02-09 06:53 · Score: 1

i wonder if good ol' MS is going to make their next OS run on it. i know that is far fetched considering they wouldn't support the itanium. and speaking of compilers, if they don't provide the performance, well, the Cell will go the same route as the itanium.

--
i disable sigs
Re:Linux on Cell by Anonymous Coward · 2006-02-09 07:30 · Score: 0

Microsoft = Xbox 360 = IBM Power CPU = Cell.

Anyone want to speculate on the OS on Xbox360?
Re:Linux on Cell by Anonymous Coward · 2006-02-09 07:50 · Score: 0

Microsoft does support the Itanium. Win64/IA64 is only available as an OEM product, and only in Server variants. But then when did you last see an Itanium laptop or home PC?
Re:Linux on Cell by Anonymous Coward · 2006-02-09 18:38 · Score: 0

I'd speculate that you said rectangle = square.

cell cells itself by digitaldc · 2006-02-09 06:02 · Score: 3, Funny

Juhi Jotwani, IBM's director of Blade Center and xSeries solutions, holds the company's new Cell processor during a presentation yesterday in New York.

She said, "Come on, juh know jouwant it!"

--
He who knows best knows how little he knows. - Thomas Jefferson

Re:cell cells itself by Anonymous Coward · 2006-02-09 18:28 · Score: 0

It loves you long time

Sun has 'em beat by AKAImBatman · 2006-02-09 06:02 · Score: 4, Interesting

As I understand it, the various pipelines of the Cell chip tend to be more specialized than the Coolthreads technology Sun is using on their new T1 processor. However, even with 32 full-blown pipelines, Sun is also concerned about whether their chips will be put to good use or not.

I'm not quite sure what IBM is planning to do, but Sun has started a contest to see who can build the coolest program that takes advantage of their new Coolthreads technology. The prize is a cool $50,000, so Sun seems to be serious about this. The results of the contest may very well prove whether the new parallel technologies have a future or not.

--
Javascript + Nintendo DSi = DSiCade

Re:Sun has 'em beat by Anonymous Coward · 2006-02-09 06:16 · Score: 2, Insightful

The prize is a cool $50,000, so Sun seems to be serious about this.

If Sun were really serious, they'd put a $500,000 team on it to develop something themselves. Paying for 1/3 - 1/2 a man-year of development is not that serious.
Re:Sun has 'em beat by Zantetsuken · 2006-02-09 06:30 · Score: 2, Interesting

Especially when IBM's already setting the groundwork for Cell to be used in supercomputers (for seismic activity, nuclear warhead simulations, ect), rendering 3D MRIs (reportedly, current image rendering for this is done on Intel Pentium 4s and takes about 4 minutes, when they did the tech demo of it on a Cell platform, it took about 20 seconds).
Re:Sun has 'em beat by ArbitraryConstant · 2006-02-09 06:33 · Score: 3, Informative

"As I understand it, the various pipelines of the Cell chip tend to be more specialized than the Coolthreads technology Sun is using on their new T1 processor."

Yes. A Cell's SPUs are not PowerPC processors, so you can't run the same code on the PowerPC front end as you do on the SPUs. Not only that, but Cell and Niagara are designed for totally different things. Cell is designed for floating-point intensive apps with pretty poor general purpose capabilities, while a Niagara has 1 floating point unit shared between all 8 cores and 32 threads but they're all good at the branchy sort of thing servers ususally run.

I think these Cell servers will be more useful for things like render farms, They'll be essentially useless as generic servers for web or database duty.

--
I rarely criticize things I don't care about.
Re:Sun has 'em beat by Anonymous Coward · 2006-02-09 06:42 · Score: 1, Informative

Most render farms these days spend most of their time crunching GI lighting and ambient occlusion. This is very parallel, but needs access to LOTS of memory. Unless each SPU can independently access the entire address space, rendering will be slower than on the PPC alone.
Re:Sun has 'em beat by Amouth · 2006-02-09 06:44 · Score: 1

yea.. that sounds great i would love to take a shot at it..

don't have anything to test it with.. wonder if they give free boxes to people who want to try and take a stab at it.

--
'...if only "Jumping to a Conclusion" was an event in the Olympics.'
Re:Sun has 'em beat by laffer1 · 2006-02-09 07:13 · Score: 1

Thats great and all but i think the key here is price and availability. Intel took over the server market with their low end chips. Companies said they wanted cheap servers and lots of servers. IBM and Sun have a vested interest in the return of big iron but i don't know if companies want that. I'm curious to see what happens with the new sun and ibm moves.

--
MidnightBSD: The BSD for Everyone
Re:Sun has 'em beat by Anonymous Coward · 2006-02-09 07:13 · Score: 1, Informative

As luck has it, each SPE can DMA main memory to its local store independently. The Cell has huge amounts of bandwidth available to it.
Re:Sun has 'em beat by Zantetsuken · 2006-02-09 08:02 · Score: 1

True that chips before designs like the Cell and Niagra chips with 8 or so cores became popular by being low cost before concentrating on horsepower, but the way I figure IBM reasons the Cell and Niagra chips will be relatively low cost with uber performance is the same reason for the advent of the beowulf cluster - you use relatively cheap chips or in the case of Cell, cores, and use a lot of em in parallel. (I say relatively cheap cores in the case of Cell because I'm not sure just how much horsepower/cost each core would have by themselves, but I'll bet what little of an ass I have they are cheaper cores than, say, a high end Opteron or Xeon dual core (especially if you want a dual core, dual chip server with each opteron costing about 800 bucks a pop))
Re:Sun has 'em beat by Anonymous Coward · 2006-02-09 10:10 · Score: 0

Yeah, and it also requires a bit of manual programming to make it handle all that memory (pipeline it through or double/triple buffer, etc.) and it isn't something a compiler will really do for you. Your best bet is to get a library (that was hand written) that'll do it for you.

Who woulda thunk it? by d3ac0n · 2006-02-09 06:02 · Score: 3, Funny

Blades in Cells are usually a Bad Thing. Apparently Cells in Blades are a good thing! Go figure...

--
Official Heretic from the "Church of Global Warming". Proven right thanks to whistle blowers. AGW = Flat Earth Theory

Your organs are specialized, too. by Orrin+Bloquy · 2006-02-09 06:07 · Score: 5, Interesting

It's a hell of a paradigm shift for programmers to go from writing code that targets one CPU to code that deliberately splinters tasks across a bank of specialized processors.

It's fun to bash the Cell as a general purpose CPU when no one has actually suggested it's designed for that.

All of the above being true, it remains to be seen what gains IBM's POWER/Cell system actually offers above present architectures -- RISC was the next big thing, too, until Intel internalized part of it into the x86 architecture.

Flyover landscape graphics demos are a shopworn rabbit pulled out of a threadbare hat: convert fractals into craggy vertical displacements with extremely primitive lighting/mapping. Show me an architecture that can *realtime* render Incredibles-caliber cloth/hair simulations and I'll get a hard-on while ATI and nVidia executives slit their wrists.

--
"Made up/misattributed quote that makes me look smart. I am on /. and I must look smart."

Re:Your organs are specialized, too. by ivan256 · 2006-02-09 06:27 · Score: 1

It's a hell of a paradigm shift for programmers to go from writing code that targets one CPU to code that deliberately splinters tasks across a bank of specialized processors.

You mean specialized processors like FPUs, 3d audio accelerators, 3d video accelerators (and the sub-processing units contained in video accelerators), encryption and TCP offload engines, WinModems, MPEG encoder/decoders, and platform management controllers?

Yeah, they'll have a real hard time adjusting... In 1982.
Re:Your organs are specialized, too. by SirTalon42 · 2006-02-09 06:29 · Score: 1

"WinModems"

WinModems made the processor do the real work, they were the cheap crappy ones that sucked as a modem.
Re:Your organs are specialized, too. by ivan256 · 2006-02-09 06:36 · Score: 2, Informative

You used them as programable DSPs. The CPU couldn't actually do the hard work fast enough... The chip did the 'hard' work, and they just made the CPU do more work than a full modem.
Re:Your organs are specialized, too. by Doctor+Memory · 2006-02-09 07:31 · Score: 1

It's a hell of a paradigm shift for programmers to go from writing code that targets one CPU to code that deliberately splinters tasks across a bank of specialized processors.

Not really, developers have been using co-processors for years -- numeric (a la Weitek or 8087), DSP, odd-wad AI and "dataflow" boxes. And I imagine the early attempts will follow a similar pattern: present the functionality of the co-pro wrapped neatly in a library, then just call the library routines. Presto, your code is automatically vectored to the other unit, and you can either wait synchronously or specify either a callback routine or an event flag to set when the call completes.

Of course, as the techniques get explored in more depth, and developers start to chafe at the restrictions of the library model, then people will start writing in-line code, or writing code to be loaded directly onto the specialized processors and communicating with the main-line code via specific protocols. Frankly, it's more of the same-old, same-old, hardly a "paradigm shift" -- unless you haven't seen it before.

--
Just junk food for thought...
Re:Your organs are specialized, too. by realbadjuju · 2006-02-09 10:18 · Score: 1

Flyover landscape graphics demos are a shopworn rabbit pulled out of a threadbare hat: convert fractals into craggy vertical displacements with extremely primitive lighting/mapping. Show me an architecture that can *realtime* render Incredibles-caliber cloth/hair simulations and I'll get a hard-on while ATI and nVidia executives slit their wrists.

Ha! Reminds me of when, during the summer of '96, I had a job at NIST Boulder and the "SGI Bus" stopped by. It was a tour bus demoing lot's of expensive hardware. They had an 8 processor Onyx doing the whole realtime-rendered-flight thing. I made a comment comparing the demo to Microsoft Flight. Man, did that guy give me a dirty look.
Re:Your organs are specialized, too. by fitten · 2006-02-09 10:37 · Score: 1

I'm still waiting on my PS2 to give me real-time "Toy Story" quality rendered video like Sony promised when the PS2 was released :(
Re:Your organs are specialized, too. by kadathseeker · 2006-02-09 11:46 · Score: 1

I'll get a hard-on while ATI and nVidia executives slit their wrists.

I do NOT want to see your browser history and cash. Pervert. Freak. Sicko. I need to wash myself now...

--
The 'Net is a waste of time, and that's exactly what's right about it. - William Gibson
Re:Your organs are specialized, too. by Orrin+Bloquy · 2006-02-10 10:19 · Score: 1

I stand corrected on the issue of familiarity with the principle of vectorization.

I remain unimpressed with landscape simulations as a demonstration of processing power. Hair and cloth simulations are much, much harder to pull off realtime in single-processor systems, and a demonstration of how Cell processors manage strand-based hair and draping cloth AND pass that data back to a POWER chip managing the animation of the base figure wearing that hair/clothes would be an instant sale to me of a cohesive, distributed system. Those are the effects in CGI which even console game developers won't touch, and they've got locked down systems with full video card APIs.

--
"Made up/misattributed quote that makes me look smart. I am on /. and I must look smart."
Re:Your organs are specialized, too. by ivan256 · 2006-02-10 17:03 · Score: 1

Hair and cloth simulations are much, much harder to pull off realtime in single-processor systems

You don't see them, not so much because they're hard (they *are* hard), but because they're made even harder by the shortcuts taken by the typical 2D accelerator. A lot of a scene you see in "real time" on modern video hardware is pre-rendered in the form of texture maps and bump maps. Rather than shift paradigm, we have iteratively increased the number of transformations per frame that can be done on these pre-renered surfaces, but that means graphics processors are becoming more and more specialized for a task that is basically incompatable with realistic fabric and hair rendering.
Re:Your organs are specialized, too. by ivan256 · 2006-02-10 17:05 · Score: 1

Err... That should have sayd "3D accelerator", of course.

hardware abstraction? by slackaddict · 2006-02-09 06:07 · Score: 1

Would it be possible to write some kind of virtualization that would present an easy-to-develop-on layer? Besides, if you already have Linux that runs on this platform and compilers written, how would it be any harder than developing for any other platform? A rose by any other name...

--
ConsultingFair.com

Re:hardware abstraction? by AKAImBatman · 2006-02-09 06:17 · Score: 0, Offtopic

Would it be possible to write some kind of virtualization that would present an easy-to-develop-on layer?

You know, that's a really good idea!

--
Javascript + Nintendo DSi = DSiCade
Re:hardware abstraction? by SirTalon42 · 2006-02-09 06:21 · Score: 1

You can act (pretty much) like its a power processor and your apps will run 'fine' on it. But if you want the REAL power (no pun intended) of the Cell, you hand optimize (and design) your program for the cell.
Re:hardware abstraction? by ivan256 · 2006-02-09 06:47 · Score: 2, Insightful

you hand optimize (and design) your program for the cell.

Every parallel architecture I've ever programmed for had nice APIs for offloading and directing tasks to the various available processing units. There shouldn't be much 'hand-optimization' involved in the sense you're implying.

Developers who write code that takes advantage of GPUs in modern gaming PCs are already familliar with this style programming, and the ones that understand the architecture instead of memorizing the APIs or program out of a cookbook should have no trouble adapting.
Re:hardware abstraction? by 2megs · 2006-02-09 07:15 · Score: 3, Informative

Developers who write code that takes advantage of GPUs in modern gaming PCs are already familliar with this style programming,

But you can probably count on your fingers the number of developers who are using GPUs for anything other than rendering pixels, or at most some simple vectorizable simulations like water or cloth.

Taking an arbitrary program and turning it into something that would run well on a GPU (or a Cell SPU) usually requires a significant redesign of the algorithms and data structures as compared to what you would naively and straightforwardly do in C...or it won't get anywhere near peak performance and may even run slower. It's certainly possible to do, but you won't be re-using any of that originally written code, and it's a different way of thinking from what 95% of programmers are used to. I'm speaking from experience as someone who earns his living by being in the remaining 5%. :)

As the original poster said: you hand optimize (and design) your program for the cell.
Re:hardware abstraction? by ivan256 · 2006-02-09 07:59 · Score: 1

But you can probably count on your fingers the number of developers who are using GPUs for anything other than rendering pixels,

And for good reason. GPUs are designed to render pixels, not do other stuff.

Taking an arbitrary program and turning it into something that would run well on a GPU (or a Cell SPU)

I don't understand why you think I'm saying that those two things are equivalent. Taking an arbitrary program and turning it into something that would run well on a GPU would be unusual. You're talking about running code written for a single purpose on a processor designed for a different purpose. Hopefully the specialized processor you're getting your task to run on was actually designed to run that particular type of task you're trying to run.

Programming your Foo SPU to run your Foo task shouldn't be any harder than programming your pixel shader to shade pixels.
Re:hardware abstraction? by the_humeister · 2006-02-09 14:28 · Score: 1

These guys use your computer's GPU to process audio. The GPU on modern video cards are basically massive SIMD engines. It should be theoretically possibel to write an OS that runs on one of these things.

Re:Increasingly Popular?! by taskforce · 2006-02-09 06:08 · Score: 1

Nope, Cell isn't increasingly popular, but the summary referred to them being put "inside its increasingly popular low-power blade servers."

--
My 3D Texturing Skinning work (under construction)

PS3 release date? by nutshell42 · 2006-02-09 06:09 · Score: 4, Insightful

This probably means that the PS3 will either actually make its "spring" release or that it is hampered by problems with the Blu-Ray drives/disks instead of a Cell shortage because otherwise I couldn't imagine that Sony would allow IBM to use even one Cell for something that's not a PS3 for the first 3 months.

--
Don't think of it as a flame---it's more like an argument that does 3d6 fire damage

Re:PS3 release date? by MBCook · 2006-02-09 06:46 · Score: 1

I read a post somewhere (Kotaku, Gizmodo, Joystiq, or somewhere else) that quoted a Sony/IBM official as saying that yeilds on the Cell chips were doing very good now and they got the yields up to the level they are now (whatever that is) MUCH faster than previous new chips. If that's true, then there may not be any shortage problem with the PS3, at least not from the Cell.
There is always the chance that the RAM, GPU, Blu-Ray drive, or something else would end up in short supply.

--
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
Re:PS3 release date? by SetupWeasel · 2006-02-09 07:00 · Score: 2

There is always the chance that the RAM, GPU, Blu-Ray drive, or something else would end up in short supply.

My guess is finished software.
Re:PS3 release date? by fitten · 2006-02-09 10:20 · Score: 1

Don't tell Mercury that!
Re:PS3 release date? by nutshell42 · 2006-02-09 11:18 · Score: 1

I wonder what that means in absolut numbers and whether they're thinking about using Cells with all 8 SPUs activated for the PS3.

--
Don't think of it as a flame---it's more like an argument that does 3d6 fire damage

Re:Increasingly Popular?! by psbrogna · 2006-02-09 06:17 · Score: 4, Funny

Marketing people are quick to describe the 0 to >0 transition as "increasingly popular". Of course the rest of the world considers it statistical noise. : )

Exciting by alta · 2006-02-09 06:18 · Score: 1

So this means I'll be able to take my PS3 and slide it into my IBM Blade chassis when I need more CPU. When I'm done, I pull it out and play.

--
Do not meddle in the affairs of sysadmins, for they are subtle, and quick to anger.

Re:Exciting by Anonymous Coward · 2006-02-09 06:23 · Score: 5, Funny

When I'm done, I pull it out and play.

i didn't need to know that.
Re:Exciting by simpl3x · 2006-02-09 15:06 · Score: 1

The opposite comes to mind... Being able to take advantage of a bunch of Cells to support the mobile device. Imagine a lan party in a server farm.

I work in blade development. by Thaidog · 2006-02-09 06:19 · Score: 5, Informative

We've had blades with Cell cpus on them for quite a while. They're a lot different than any other architecture... resembling the pSeries layout more-so than others. One thing I don't like about the prototypes is that the Cell cpu's along with the bga memory they use are fused directly to the logic board. They're were a few pictures released to the public about a year ago on the Register but I can not find them now. Other than that they are seriously fast and very clusterable.

--

||| I still can't believe Parkay's not butter.

Re:I work in blade development. by ivan256 · 2006-02-09 06:40 · Score: 1

the Cell cpu's along with the bga memory they use are fused directly to the logic board

That's not uncommon for Pentium blades either. The socket increases the width, and that space is better used for cooling.
Re:I work in blade development. by Stormwatch · 2006-02-09 06:43 · Score: 1

they are seriously fast and very clusterable.
Just imagine... a beowu-*SHOT*

--
Circumcision is child abuse.
Re:I work in blade development. by fitten · 2006-02-09 10:35 · Score: 2, Interesting

They're a lot different than any other architecture...

Actually, they are similar to a number of DSPs and other discrete solutions from the past. For example:

The TMS 320DM64x series of DSP from TI which has an ARM9 and a number of DSPs on it.

The TMS 320DM54x and 55x series of DSP from TI which has an ARM7 and a number of DSPs on it.

And a descrete version in the CSPI MAP 1310/11 which had a PPC and multiple multi-core DSP chips on it as early as 1997.

Smaller blade chassis? by killtherat · 2006-02-09 06:19 · Score: 2, Informative

IBM has opened the spec for their blade chassis design. Does anybody know if somebody is trying to make a 'desktop' blade chassis? Rather then buying a huge box that holds 14 blades, something that might only hold two.
This doesn't mean make a desktop out of a blade, because as I understand it, so far the JS20s (IBMs PPC 970 blade) don't even have video cards. You have to set them up over the serial port, and run them over the network.
But does anybody have a development sized unit you don't need a server rack and new power circuits for?

Re:Smaller blade chassis? by ivan256 · 2006-02-09 06:43 · Score: 2, Informative

Portable development units come mounted on their side in a 19" enclosure with a handle on top, semi-attractive looking trim pieces, and appropriate power supplies and cooling on the inside. They cost about three times what you'd pay for a standard rackmount production model.
Re:Smaller blade chassis? by TopSpin · 2006-02-09 07:18 · Score: 1

Interesting question. I've been lurking around blade platforms lately and I'm not happy. Why the proprietary chassis? Why the internal storage?

My ideal "blade" system mounts in standard 19" racks. 2 or 3 complete systems in 1U, 48VDC powered from another 1U transformer. Let me stack 1-8 of these in my rack and don't make me pay for the damn IBM/HP/etc chassis.

My system also has no storage inside the blades. Just give me 4 network interfaces per "blade", with at least 2 optionally capable of providing ISCSI TOE/HBA or 2/4 gigabit FC. No SATA/SCSI bus hardware to pay for, cool, power or otherwise. No CD/DVD/Floppy nonsense either. If I briefly need removable storage I'll use USB, thank you very much.

This doesn't exist as far as I can tell. If someone knows better, please chime in.

--
Lurking at the bottom of the gravity well, getting old
Re:Smaller blade chassis? by killtherat · 2006-02-09 08:12 · Score: 1

They cost about three times what you'd pay for a standard rackmount production model.

Well, that's not usefull...
I guess it's time to start a blade chassis case mod ;-) Or maybe mod a ATX case to handle a blade.
Re:Smaller blade chassis? by Anonymous Coward · 2006-02-09 08:40 · Score: 1, Insightful

"My system also has no storage inside the blades. Just give me 4 network interfaces per "blade", with at least 2 optionally capable of providing ISCSI TOE/HBA or 2/4 gigabit FC. No SATA/SCSI bus hardware to pay for, cool, power or otherwise. No CD/DVD/Floppy nonsense either. If I briefly need removable storage I'll use USB, thank you very much."

The IBM blades allow you to do this. They have up to two internal drives, but one can be replaced with a daughter card that provides either two more ethernet ports or a card for two FC ports. Their DVD and Floppy drives are USB devices which can be switched to connect to any of the 14 blades in the chassis.
Re:Smaller blade chassis? by ivan256 · 2006-02-09 09:17 · Score: 1

Travel cases for rackmount musical equipment make nice shells for your single piece of otherwise rackmount-only computer equipment.

Re:Increasingly Popular?! by JackL · 2006-02-09 06:21 · Score: 1

Got a little hasty there. Whoops.

Thanks.

Big Difference Between Itanium and Cell by raftpeople · 2006-02-09 06:23 · Score: 1

Itanium may perform better for some number crunching apps, but not enough to outweigh the costs, generally.

The cell processor, on the other hand offers such a giant increase in performance (for some applications) that you will see people investing time and money to take advantage of it. In addition, with Toshiba, Sony and IBM all with product plans and thus the related volume and eco-system surrounding development tools, etc., I think the cell is positioned far better than Itanium to succeed.

Re:Big Difference Between Itanium and Cell by Skowronek · 2006-02-09 06:52 · Score: 4, Funny

Itanium offers such a giant increase in performance (for some applications) compared to rival RISC products that you will see people investing time and money to take advantage of it. In addition, with Intel, SGI and HP all with product plans and thus the related volume and eco-system surrounding development tools, etc., I think the Itanium is positioned far better than Alpha to succeed.

D'oh.
Re:Big Difference Between Itanium and Cell by ShadowFlyP · 2006-02-09 06:53 · Score: 5, Informative

Actually, the bigger difference is in how the architecture changed. Cell processor is more along the lines of multi-core DSPs. The instruction set is different than general computing cores and there are many of them. The key is that these cores are disjoint. You can run one application on one core and another application on another core.

The Itanium is different than this in that it required instructions to be passed to the CPU as "bundles". Any of the instructions in a bundle could be executed in any order, but these instructions were all from the same application. Thus, in order to extract speed from the Itanium, the compiler was forced to extract parallelism from within functions. This is very difficult since most programming is fairly sequential. The Cell, on the other hand, allows you to execute different tasks and so puts this control back on the programmer instead of extra work for the compiler.

Itanium was (is) a great idea from compiler theory perspective, but doesn't work out all that well (yet) in the real world.
Re:Big Difference Between Itanium and Cell by DoofusOfDeath · 2006-02-09 08:39 · Score: 2, Informative

You're close to correct. The Cell processor does have a bunch of cores that are basically DSPs (no virtual memory, etc.) BUT there's also another core that's basically a full-blown Power processor. That core is meant to rule the others.

So while you do still have to program differently for a cell with 8+1 cores than you would for a computer with 9 Power processors, it's still not like being stuck with just 9 DSPs.
Re:Big Difference Between Itanium and Cell by networkBoy · 2006-02-09 09:27 · Score: 2, Funny

Parent not funny, it's insightful!
-nB

--
whois gawk date unzip strip find touch finger mount join nice man top fsck grep eject more yes exit umount sleep dump
Re:Big Difference Between Itanium and Cell by ChrisA90278 · 2006-02-09 12:16 · Score: 1

"...Thus, in order to extract speed from the Itanium, the compiler was forced to extract parallelism from within functions. This is very difficult since most programming is fairly sequential. No, it's not that hard. The CDC 6600 "super computer" build in the 1960's accepted "bundled" instructions to and required a compiler or human programmer to take advange of the 6600's parallelism. The old FORTRAN compiler could many times beat out an experianced assembly language programmer. It was not really that hard. You'd just fetch "A" from RAM then increment "B" and then add it to "C" and be thent he "fetch A" instruction would be done and you could issue the "add A to C instruction. The compiler used a kind of tree search algorium to find the best order of instructions and kept a dependancy graph. It's really a "classic" problem that's been well studied. What we are seeingnow is now really so many new ideas but just cheaper lower costs. the 40 year old tricks are now showing up in low cost processors
Re:Big Difference Between Itanium and Cell by rocketpig · 2006-02-09 12:36 · Score: 2, Funny

One Cell to rule them all, One Cell to find them, One Cell to bring them all and in the darkness bind them.
Re:Big Difference Between Itanium and Cell by spuzzzzzzz · 2006-02-09 12:57 · Score: 1

The old FORTRAN compiler could many times beat out an experianced assembly language programmer.
Yes, and I think that FORTRAN code performs quite well on Itanium. The problem is that C code, with its almost unrestricted use of pointers, doesn't lend itself easily to that sort of optimisation. If you have a chuck of code with lots of pointer references, the compiler will need to make some pretty big deductions on where those pointers could be pointing before it can hope to parallelise anything.

--

Don't you hate meta-sigs?

Re:Increasingly Popular?! by Anonymous Coward · 2006-02-09 06:23 · Score: 0

No problem, I actually read it like you read it when I re-read my the summary before I submitted it and I was like: "Wow did I really say that?"

Bad timing on CoolThreads contest by Anonymous Coward · 2006-02-09 06:34 · Score: 0

Unfortunately I had just gotten rid of my SB100 since there was nothing I couldn't do my Linux x86 boxes instead and I needed the space. I could have used the $50k though. :)

This does raise a point about the importance of maintaining a presence in the PC market since that's where most of the programmers are. Especially for applications that exploit specific hardware features. I have a open source project with some portable api's which have completely different algorithms for the implementations on sparc, x86, and ppc. In the future, it's only going to be x86 specific implementations which will *not* be portable to other architectures most likely.

Azureus or another BitTorrent program by jhines · 2006-02-09 06:41 · Score: 1

Then all that is needed is a honking big web connection, and something that can be legally downloaded for a while. Seed a couple of thousand torrents, and let the world at it.

Wow by Nom+du+Keyboard · 2006-02-09 06:42 · Score: 1

Wow. Program for a while, then take a break playing the latest PS3 game -- all without leaving the confines of your own terminal into the system.

--
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."

Cell and T1 not targetting the same space by raftpeople · 2006-02-09 06:43 · Score: 2, Informative

Sun's new processor is designed for many-connection business server applications. Web stuff.

The Cell is designed for image processing and other high-volume number crunching.

The design decisions both companies made were heavily influenced by their target markets for these specific processors, and those target markets are very different.

These are apples and oranges.

Re:Sun to use new chips: DragonBall by Tolookah · 2006-02-09 06:44 · Score: 2, Informative

They can't use that name, freescale (motorola) already has it, and killed the line
http://www.freescale.com/webapp/sps/site/taxonomy. jsp?nodeId=0162468rH3YTLCvL2v

if you knew this, then fwoosh went the joke over my head

Good point. Unfortunately ... by vlad_petric · 2006-02-09 06:46 · Score: 4, Interesting

It's *very* difficult to get a compiler to exploit this kind of parallelism. Unless you're doing scientific Fortran loopy code, where it's much easier to do things like automatic vectorization/parallelization, it's basically almost impossible for the compiler (out of curiosity, try to use the automatic openmp parallelization feature within Intel C Compiler on standard C/C++ code; the results will likely underwhelm you). Unfortunately, even if you do have scientific code, the slave processing units only do simple precision (IIRC).

In my opinion, this thing will run well games, but that's about it. I've seen so far 2 presentations by IBM about the Cell processor (at (micro-)architecture conferences). Both times, the question on everybody's mind was "How do you program these things?". The answer was pretty much a hand-wavy "oh hmmm, well, blah blah blah manual"

--

The Raven

Re:Good point. Unfortunately ... by Salis · 2006-02-09 12:58 · Score: 1

The SPEs can do double precision, but at half the flops.

--
Favorite /. tagline: "On the eighth day, God created FORTRAN." And it was good.
Re:Good point. Unfortunately ... by be-fan · 2006-02-09 16:13 · Score: 1

Actually, 10th the FLOPS. Which makes Cell very unimpressive compared to a dual core Opteron for real scientifi computations.

--
A deep unwavering belief is a sure sign you're missing something...
Re:Good point. Unfortunately ... by Samrobb · 2006-02-09 16:20 · Score: 1

In my opinion, this thing will run well games, but that's about it.

Two words, one algorithm: MapReduce.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

--
"Great men are not always wise: neither do the aged understand judgement." Job 32:9
Re:Good point. Unfortunately ... by Salis · 2006-02-09 16:50 · Score: 1

Hmm. Why is that? I do scientific computing, but my knowledge of chip design is iffy.

--
Favorite /. tagline: "On the eighth day, God created FORTRAN." And it was good.
Re:Good point. Unfortunately ... by be-fan · 2006-02-10 07:49 · Score: 2, Informative

Cell's peak theoretical performance is 25 gigaflops, derived by taking the product of the clockspeed (3.2 GHz), and the number of operations per cycle (8). In reality, this figure is highly optimistic. Each SPE only has a single floating-point pipeline. The 8 operations/cycle figure is derived by counting a 4-element single-precision multiply-accumulate as 8 total operations. Moreover, when doing double-precision operations, it takes an additional 5x speed hit, since they must be performed in multiple clock cycles. That results in Cell's theoretical performance for double-precision code being a total of 10x lower (according to IBM), or around 2.5 gigaflops per SPE. At 10 gigaflops per chip, that's still relatively impressive, compared to the 5 gigaflops per chip a dual-core Opteron (2.4 GHz) can handle, but the actual performance of a Cell chip is going to be a lot less than the actual performance of the Opteron.

Understanding why requires a bit of understanding of chip-design, but the basics are simple. The Cell SPE basically has four things working against it:

1) No dynamic branch prediction. This means that when the Cell SPE encounters a branch instruction, it will always assume the backwards branch is taken. This works fine for loops, where its good to assume that the branch at the end of the loop will jump back to the beginning of the loop, but doesn't work well for anything else. If the guess is wrong, then the CPU pays an 18 cycle penalty while the pipeline is flushed and the correct branch path is followed. The Opteron, on the other hand, keeps track of the history of each branch. It can then make a much better guess about which way the branch will go, and avoid paying a penalty for guessing wrong. Since the Opteron's pipeline is shorter, this also means the penalty for an incorrect guess is much less (around 12 cycles). The net result of all this is that if your code has lots of short loops (static branch prediction always mispredicts the iteration that exists the loop), or a lot of complex control flow, Cell's SPE's are going to lose a lot of their theoretical performance since many cycles will be wasted on mispredicted branches.

2) Very high latency for instructions and loads. In Cell, the floating-point latency is at least six cycles, and the load latency from the local store is at least 6 cycles. For Opteron, its 4 cycles and 3 cycles, respectively. Basically, the instruction latency tells you by how many clock cycles you must seperate dependent operations. Eg: on an Opteron, you can issue a memory load, and assuming an L1 cache hit, you can issue an instruction that uses the loaded register 3 cycles later. If you have no instructions you can issue until that load is completed, then you just issue nothing that cycle and lose some of your potential throughput. Since the SPE's latencies are much higher, there is a much higher chance that you won't have any non-dependent instructions to issue on a given cycle, and must waste that cycle.

3) A very specialized memory model. Cell's SPEs can only directly address 256KB of local memory. If you have data bigger than that, you have to manually shuffle it in and out of that local memory. The latency for doing this shuffling is extremely high on Cell. This means that in code that accesses big data sets, if you can't effectively partition your data sets, you'll waste a lot of time shuffling things in and out of memory.

4) No out-of-order execution. Modern CPUs like an Opteron will rearrange your instructions to get around the instruction latencies I mentioned earlier. It'll look ahead in the code stream a couple of dozen instructions to find non-dependent ones that can be issued while waiting for other ones to finish. Cell won't do that. If you have an ADD in your code, and then right after you have a MUL that uses the results of the ADD, then Cell will merrily wait 6 cycles waiting for the ADD to finish, even if right after the MUL you have another ADD that doesn't need the results of the first one. This places a lot of burden on the c

--
A deep unwavering belief is a sure sign you're missing something...
Re:Good point. Unfortunately ... by Salis · 2006-02-10 08:24 · Score: 1

That's a very good summary of the Cell's SPEs.

For scientific computing, I think the Cell's advantages will heavily depend on how many vectorizable loops are in the existing code. For example, in Molecular Dynamics (MD) code, one frequently calculates the forces between each pair of atoms (basically solving a = F/m over and over). Systems with N atoms have N*(N-1)/2 pairs so when N is large that loop requires significant calculations (this excludes cut off distances and other tricks). MD code is used to (try to) fold proteins and to solve a host of other important problems.

So if you can store all of the force calculation instructions in the SPEs, insert atom positions, and retrieve the force numbers, then you should be able to minimize wasted cycles decently well. Like you said, the amount of instructions has to be small and all of the atom positions will not fit into memory at the same time, requiring shuffling. But I think, for this application, that the Cell will make a big difference.

For other scientific codes, like Fast Fourier Transforms, the situation is very similar. There are a large number of vectorizable loops which do not require branch prediction and typically use a small number of instructions per loop. While the shuffling of data in/out of memory has to be minimized IBM should come up with some automatic ways to do this. Also, I think IBM is trying to put most of the burden of determining out of order execution on the compiler. I don't know how well this will work, considering that Intel tried to do the same thing with Itanium (or something similar), but maybe the status of compiler research has improved since then.

The other staple of scientific computing is the solution of linear equations (LINPACK, LAPACK, etc). If they can perform well in a LAPACK benchmark for large matrices then I think they will be ok. So many scientific applications, including ODE integrators, PDE integrators, etc, use a Ax=b solver that once they provide a fast port of one, it'll encourage the porting of code that heavily uses it.

And, of course, since graduate students do a lot of the grunt work of scientific computing, the fun of working on some exotic hardware like the Cell might be enough to overcome the barrier-to-port. As long as they give away the SDKs for free and are generous with the hardware.

Either way, it's exciting that a completely new type of hardware will soon be available.

--
Favorite /. tagline: "On the eighth day, God created FORTRAN." And it was good.
Re:Good point. Unfortunately ... by be-fan · 2006-02-10 11:48 · Score: 1

Actually, I made a math error. Cell's DP theoretical performance is the same as a dual-core Opteron's (10 gigaflops), since the Opteron has 2 FPU pipelines and 2 cores.

--
A deep unwavering belief is a sure sign you're missing something...

But wait by Pakaran2 · 2006-02-09 06:46 · Score: 3, Funny

Won't the Cell reception be poor inside the metal cabinets?

*looks bright*

MOD PARENT UP! by iroll · 2006-02-09 06:53 · Score: 1

Seriously, first "funny" that's made me laugh in a while. Kudos.

--
Repetition does not transform a lie into the truth. - FDR

IBM already has these tools available by raftpeople · 2006-02-09 06:55 · Score: 2, Informative

To the programmer, communicating with the SPU is abstracted to file i/o operations. Go check out IBM developerworks pages for lots of info.

Cell will live long, but Niagara may not. by reporter · 2006-02-09 07:00 · Score: 1, Interesting

"The Register" has a recent article about building servers based on the IBM Cell.

Since the Cell is now integrated into the military apparatus of the best-funded military aparatus in the world, the Cell will live essentially forever. For the same reason, Ada (i.e. the computer language) will live forever even though few people in industry use the language.

By the way, Cell is also IBM's answer to Sun's Niagara. For years, Sun touted Niagara as a new revolution in computing: Niagara is supposedly the first commercially viable processor to use hordes of cores to quickly executed multithreaded applications.

Yet, Cell also uses hordes of cores. Though the Cell is 1 complex general-purpose POWER core plus 8 simple supporting specialized cores, IBM could easily downgrade the 1 complex core to a simple core (thus yielding additional silicon area) and upgrade the 8 simple specialized cores to 8 simple general-purpose cores. The hard part is linking the 9 cores together, but IBM already solved that problem when it created the Cell. (Intel is also working on a processor with hordes of cores.) If Niagara-based servers ever become popular, IBM is already prepared to launch a general-purpose Cell-based server.

The difference between the Cell and the Niagara is that the American military uses Cell, not Niagara. The American military will subsidize research on Cell.

Re:Cell will live long, but Niagara may not. by antifoidulus · 2006-02-09 07:33 · Score: 1

Whats interesting though is both chips are not super-scalar like all desktop chips have been for quite a while. It will be interesting to see what types of programs will be able to take advantage of lots of cores without speedups like branch predictors etc.

--
Monstar L
Re:Cell will live long, but Niagara may not. by Anonymous Coward · 2006-02-09 13:10 · Score: 0

>IBM could easily downgrade the 1 complex core to a simple core (thus yielding additional silicon area) and upgrade the 8 simple specialized cores to 8 simple general-purpose cores.
The PPU is just way too big and hot to multiply 8 times. And even then you would need to rethink cache. T1 cores are 4-way SMT, single issue, no speculative execution, no FPU (a single FPU is shared by all 8 cores)... - the PPU is nothing like it, you are looking at a full redesign not a simple "downgrade".

And the SPUs? Basicly stream orientated SIMD FPUs with added instructions for program control. Absolutely useless for typical server loads. You can't schedule normal threads on them, and the way they handle memory is foreign to existing server software.

If you remove the PPU and SPUs, what do you have left? External buses for I/O and memory, and an internal stream optimised bus for on-chip communication. Maybe it doesn't suck for cache coherency traffic, but then again maybe it does.

>The hard part is linking the 9 cores together
Right, because the actual processing cores are delivered by a magical fairy that puts a designed-from-scratch-PPC-core* under your pillow when you lose a tooth...

So far Cell and T1 both appear to be fine chips, but they are _not_ the same. The design goals were different, in some cases complete opposites (just look at the FPU power). Despite the high number of cores on both chips, they address very different markets.

This means.. by f8l_0e · 2006-02-09 07:00 · Score: 0, Offtopic

This means exactly jack squat to me until I can buy one. Where, when, and how much. And no, I didn't RTFA.

Re:Sun to use new chips: DragonBall by db32 · 2006-02-09 07:01 · Score: 1

In the DragonBall Z cartoon Gohan is the one who killed Cell.

--
The only change I can believe in is what I find in my couch cushions.

IBM and OSS by db32 · 2006-02-09 07:19 · Score: 1

I am curious what will come from this in the OSS world. IBM seems to be pretty willing and able to play nice with the OSS world. IBM works with alot of *nix things and needs developers to build for this processor... I think throwing this thing, some good tools, and some documentation into the OSS crowd of geeks could definetly help jumpstart this thing. It would also be really interesting to see what kind of things come out of this. If this thing really turns out to be hot for high end graphical stuff it could definetly lead the way in new innovative desktop enviroments in the nix enviroment. Efficient 3D interfaces maybe?

--
The only change I can believe in is what I find in my couch cushions.

Re:IBM and OSS by SirTalon42 · 2006-02-09 08:51 · Score: 1

I guess thats why IBM released the Cell simulator a while back, eh?

Re:Increasingly Popular?! by geekoid · 2006-02-09 07:20 · Score: 1

All statistic are noise until you look for a pattern.

--
The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

Take the onus away from the programmer by Anonymous Coward · 2006-02-09 07:21 · Score: 1, Insightful

The compiler is exactly where the solution should be. Using DSPs as an example, it is virtually impossible to optimize DSP code by hand. The compiler will almost always do a better job. Same thing for Cell. If you put the onus on the programmers, this chip won't get widespread acceptance. If IBM wants people to use this chip then they better get busy writing some decent tools.

Re:Take the onus away from the programmer by fitten · 2006-02-09 07:25 · Score: 1

When you are talking about one thread of code on an individual Cell unit, sure. There haven't been very good auto-parallelizing compilers yet (to run code on multiple SPEs, for example) for homogenous multi-core architectures, much less heterogenous multi-core architectures.
Re:Take the onus away from the programmer by spudgun · 2006-02-09 15:35 · Score: 1

Or give the GCC guys and Linus a cell workstation to play with

--
Type unto others as you would have them type unto you.
Re:Take the onus away from the programmer by inter+alias · 2006-02-10 13:10 · Score: 1

Do you really have any doubt they'll get some?

Re:Sun to use new chips: DragonBall by Tolookah · 2006-02-09 07:24 · Score: 1

yeah, I know that much, I just don't know if he was actively making fun of the dragonball processor

Why SPEs? by Guspaz · 2006-02-09 07:30 · Score: 3, Interesting

Why go with SPEs anyhow? The whole problem with coding for the Cell involves the differences between the PPE and the SPE. The SPE doesn't have branch predictors, making it virtually useless for any sort of flow control.

Why didn't IBM just pack in a lesser number of PPEs? The PPE already seems to be a very lightweight general purpose processing core, unless I'm missing something. It is about the same size as an SPE. So why not just put 9 PPEs on a Cell chip instead of 1 PPE and 8 SPEs?

If you had 9 PPEs on the chip, any multithreaded code (servers for example) would see massive benefits without having to rewrite it to try to find aspects of the program that could run on what is effectively a DSP. While everybody else was fooling around with 2-core processors, they'd have a 9-core processor on the market. Sure, slower per-core, but 9 of them, with that number going up in the future.

Or am I missing something here?

Re:Why SPEs? by Anonymous Coward · 2006-02-09 07:45 · Score: 0

The SPE's each have some memory on the CPU with very high bandwidth, the code is loaded into this memory for execution (as a 256k block).
Re:Why SPEs? by LWATCDR · 2006-02-09 08:00 · Score: 1

Simple Floating point.
These are not for general purpose computing that is what the Power5 and the Power6 will be for. Think DSP, render farms, or simulation and not web or database servers.
You could create a system with a Power5 blade to do database and general purpose type stuff and have that feed multiple Cell blades to do rendering and or DSP.
A render farm jumps to mind but I could see it being used for military functions like Sigint, Radar, and Sonar or any number of scientific simulations.
Not every computer has to be good at running OpenOffice.

--
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
Re:Why SPEs? by Anonymous Coward · 2006-02-09 08:01 · Score: 1, Funny

Because the PPEs are not as synergistic as the SPEs. That's why they call them Synergistic Processing Units instead of DSPs or APUs. Sony/IBM needed more synergy so they made one PPE with 7 "yes men" coprocessors.
Re:Why SPEs? by Anonymous Coward · 2006-02-09 08:10 · Score: 1, Interesting

Your not missing anything here.

If you want general purpose system go with the 4-6 Gigahertz Power6 proccessors they are developing. This will provide very fast multiple 'PPE's your looking for.

Ok, so the SPEs don't have 'branch prediction'.. So what? They are so freaking fast at what they do that it probably won't matter.

Your looking at one cell. A Blade isn't going to have one cell. It's going to have 2-4.

A rackmount of these guys will provide, conservatively, 10 of these blades.

Maxed out 10 blades would be 40 cells, theoretically.

That is 40 PPEs. That is 320 SPEs.

That will give you a supercomputer-level, buy todays standards, number crunching ability (remember SPEs are NOT vector.. they can do more then just floating point) in the roughly same space and probably electrical usage as a common 24 inch CRT Television.

Think about that for a second. Two full racks of this crap side by side would provide enough number crunching power to real-time render a virtual Holodeck.

Look forward to the return of 'software rendering'. Remember that the current MIPS-powered Playstation 2 is fully software rendered...

Early models are already shipping:
http://linuxdevices.com/news/NS3591350722.html

It's a evaluation system. 1-2 or dual core Cell blades.

Oh and of course it runs Linux. Terra Soft (makers of yellowdog linux) will be selling them. They also will ship with Fedora Core installed.
Re:Why SPEs? by Anonymous Coward · 2006-02-09 15:04 · Score: 0

Or am I missing something here?

I think you missed just about everything. The point of the Cell processor is not to be a multi-core Power chip. They already have those. The point is to have a single controller chip (your PPE) with several super-high-performance special purpose number crunchers. This idea is not novel either, as there have been multi-core RISC+DSP chips for quite a while. However, there have been no such chips that let you arrange the DSP cores in a reconfigurable pipelined workflow. In this way, the SPEs act a lot like the ASIC GPU components in your graphics pipelines, but are far more flexible. It's the memory architecture of the Cell processor that makes it interesting.

These are not for general purpose code, and require quite a lot of hand-coding and algorithm design to get the performance out of them. If you just want quad-core Power processors, you can buy those from IBM today.
Re:Why SPEs? by Guspaz · 2006-02-10 17:05 · Score: 1

Yeah, it runs linux, but only the PPE is going to be able to do anything without custom code.

That's my point. While I do not deny that there are uses where the raw number-crunching power of the SPEs is useful for certain tasks, I don't think that their uses are fairly limited.

Take the PS3 for example. From what developers are saying, the vast majority of what they have to do is limited to the single PPE. They have managed to find a few things that can run on the PPE, but not many. Physics engines are one thing. It turns out that while physics engines are highly branch oriented, the branches are so non-deterministic that a branch predictor wouldn't help anyhow.

I also think you're understating how slow branching would make the SPE. It'd be essentially useless on any code with branching compared to the PPE. I understand that compiler optimizations and software branch predictors can make that bearable for the occasional loop, but you still can't really do general purpose tasks on the thing.

So for rendering, or scientific work, sure... But what about in servers?

I think the Cell has a strong future as a DSP. I mean, as I said, it is very good at what it does... Decoding 8 simultaneous HD streams when a normal PC can just barely handle one, that is impressive. Imagine how fast a Cell processor could ENCODE video... Either eight or nine seperate streams encoded per CPU, or one stream encoded eight times faster. But what happens when you want a database server?

Joystiq by Anonymous Coward · 2006-02-09 08:11 · Score: 0

Share and enjoy http://news.google.com/news?q=cell+yield+learning

Re:Sun to use new chips: DragonBall by db32 · 2006-02-09 08:15 · Score: 1

He was probably just trying to capitolize on my very clever and original joke using Dragonball characters. Too bad funny doesn't count for karma so the joke is on him.

--
The only change I can believe in is what I find in my couch cushions.

Re:Sun to use new chips: DragonBall by Soong · 2006-02-09 08:19 · Score: 1

> and killed the line

But, with dragonballs, you can ressurect things, right?

--
Start Running Better Polls

compute per silicon-area/watt/$ by Soong · 2006-02-09 08:30 · Score: 3, Informative

PPEs are bigger. Also, a dedicated slave processor doesn't have to worry about interrupts and context switches and OS crap, it can spend all its cycles on number crunching. Cell SPEs are all about moving large amounts of data and doing a whole lot of compute on that data. They're simpler and more efficient at what they're designed for.

--
Start Running Better Polls

Re:compute per silicon-area/watt/$ by Guspaz · 2006-02-10 16:59 · Score: 1

From the images of the core that I've seen, PPEs are virtually the same size as the SPEs.

I refer you to this image:

http://images.anandtech.com/reviews/cpu/cell/ppehi ghlight.jpg

Perhaps you mean the PPE and it's supporting hardware, such as the cache? That'd ideally be shared among multiple PPEs.

If you look closely at the PPEs, a huge amount of their real estate seems to go to what looks like their 256KB of cache. Cache takes up a lot of space. Since the PPE's wouldn't each have dedicated cache, they're still about the same size.

Not untill by SnarfQuest · 2006-02-09 08:38 · Score: 1

This chip won't be popular until you can package it into a small box, sell it for less than the cost to build it, with high end audio/video built in, and someone develops games for it.

Nobody would be crazy enough to do that!

--
Who would win this election: Andrew Weiner vs Andrew Weiner's weiner.

Re:Not untill by SnarfQuest · 2006-02-09 09:13 · Score: 1

And after that, all it will be used for is beowulf clusters...

--
Who would win this election: Andrew Weiner vs Andrew Weiner's weiner.

"multi-core DSPs" WITH CRIPPLED FPUs!!! by mosel-saar-ruwer · 2006-02-09 08:40 · Score: 4, Informative

Actually, the bigger difference is in how the architecture changed. Cell processor is more along the lines of multi-core DSPs.

Standard computer graphics are RGB color at 24-bits per pixel [2^24 = 16777216], i.e. about 16 million colors.

Standard thinking in the graphics bidness is that: If our triangles will only be displayed in 24-bits worth of color, then why do we need to perform triangle-arithmetic in anything higher than maybe 32-bits worth of floating points?

Hence floating point calculations are 24-bit in the ATi world, and 32-bit in the nVidia and Playstation3/Cell world.

Boy, I hope they're upping that floating point number for these "server" chipsets, cause 32-bit single-precision floats are essentially worthless for even something as trivial as computing interest on a bank statement.

On the other hand, a "Cell" server CPU with a 128-bit FPU would be something to drool over. The problem, though, is that transistor counts on FPU's tend to increase as n^2, so each time you double the FPU bit-count [to 64-bits, then to 128-bits], your transistor count goes through the roof.

Re:"multi-core DSPs" WITH CRIPPLED FPUs!!! by joebebel · 2006-02-09 13:10 · Score: 1

It doesn't make any sense to use floating point arithmetic in most financial calculations, anyway. Using integer arithmetic it's really easy to guarantee that money just doesn't disappear or appear due to rounding errors.

And if it has to handle amounts of money greater than $40 million (~2^32 cents) then you can just use 64-bit ints.
Re:"multi-core DSPs" WITH CRIPPLED FPUs!!! by Anonymous Coward · 2006-02-09 15:55 · Score: 0

Agggh! Damned Cobol programmers and accountants think the world ends at two decimal places. There's more to financial applications than multiplication, addition, and subtraction.
Re:"multi-core DSPs" WITH CRIPPLED FPUs!!! by woolio · 2006-02-09 16:18 · Score: 2, Insightful

Hate to break some painful news to you, but "24-bit" RBG refers to each color getting 8bits -- an UNSIGNED INTEGER value.

No floating point involved -- at all...

Now for 3D Graphics, coordinates may be represented in floating point. But during rendering, the values are converted to 8-bit integer values for Red, Green, and Blue components of each pixel.

And financial calculations are computed using INTEGER arithemetic....

A lot of things that might appear to require floating point, can often be implemented using "fixed-point" integer arithmetic (VERY accurately). They advantage of the latter is reduced hardware cost, increased speed, and lower power consumption.

Do you think a celluar phone performs the voice compression in floating point? Nope!
Re:"multi-core DSPs" WITH CRIPPLED FPUs!!! by DAldredge · 2006-02-10 13:59 · Score: 1

That has to be one of the most ill informed comments that I have seen on this site in the past 6 months.

Two Tutorials by GrEp · 2006-02-09 08:41 · Score: 2, Interesting

IBM needs to release two SIMPLE tutorials if they want programmers to bother porting code specifically to the cell:

1. A cell program that solves linear equations Ax=b efficently using SPE's. This would help those with data intensive problems.

2. A cell program that speeds up depth first search (a la for SAT,GRAPH COLORING, MAX-CLIQUE) by using the SPE's. This would help those programming CPU intensive problems.

--

bash-2.04$
bash-2.04$yes "Don't you hate dialup connections?"| write USERNAME

This means...Cell-ular Service. by Anonymous Coward · 2006-02-09 09:07 · Score: 0

Maybe a Media Server that uses a Cell, would work better? Not as much demands placed on it as a game console, and cheaper. Kind of like what happened to the Linksys.

Give me an ATX board by mnmn · 2006-02-09 09:45 · Score: 3, Insightful

They could come up with ATX or miniATX boards at real cheap prices, able to take your average DDR DIMMs, power supplies and IDE etc. Give it maybe 3 PCI slots... or 1 if its miniATX.

Sold for under $100, and theyre making money off it while spreading the love that will increase the developer market for the cell architecture.

It goes like this. Make a new architecture. Release a good compiler for free.. with awesome documentation and sample programs and libraries. Allow people to buy evaluation boards for low prices. Once you get people hooked enough, sell the chips themselves at high prices. Its the Microchip (tm) model. Their chips dont really do much for the high costs (compared to atmel, TI etc) but since everyone knows how to work them, they sell sell sell. Rabbit semiconductors however are trying hard to get into the market, and their dev tools are cheap. It'll take time.

IBM cant release a couple o PDFs and one tough software suite and expect the world to jump on it. Theres a reason why theres so much momentum behind the Power architecture, and the Cell is different.

--
"Give orange me give eat orange me eat orange give me eat orange give me you." -Nim Chimpsky

Nice names by null+etc. · 2006-02-09 10:28 · Score: 1

Blades, cells, it's getting to be like prison around here.

If you can do threads, then you can do Cell by tepples · 2006-02-09 10:44 · Score: 2, Insightful

If you put the onus on the programmers, this chip won't get widespread acceptance.

If you can write a PC program that uses 10 threads, then you can write a program that uses the Cell processor's PPC and 7 DSPs. Trouble is that most computer science education in universities doesn't cover practical use of threads.

UltraSparc T1 Cell by Dopeskills · 2006-02-09 11:16 · Score: 1

8 general purpose core each capable of executing 4 threads beats cell hands down. I'd take a Niagra server over one of these anyday.

Tutorials 3 and 4 by dch24 · 2006-02-09 12:13 · Score: 2, Interesting

Having been a long-time reader over at the IBM forums, there are a lot of similar questions and answers going on over there.

There were a couple that would be really helpful:
1. An implementation of zlib for the SPE architecture, with a speed comparison to the PPE. (Hopefully, the SPE is very fast...)
2. Examples of direct SPE-to-SPE streaming.

Simple: Just develop a Java VM for Cell platform by Anonymous Coward · 2006-02-09 13:47 · Score: 0

... suddenly it'll have millions of applications which can run on it.

Re:UltraSparc T1 Cell by NutscrapeSucks · 2006-02-09 17:17 · Score: 1

Niagra is focused Integer, corporate workloads. Cell is designed for FP, scientific. There won't be much crosssover.

--
Whenever I hear the word 'Innovation', I reach for my pistol.

Here you go. by Kadin2048 · 2006-02-10 06:59 · Score: 1

Linux:
Yellow Dog Linux runs on Cell. (Link; this is the same military product that is linked to in a Register article further up in the thread.) It's being marketed for semi-embedded uses, like in medical imaging systems, sonar and radar, etc., apparently.

Free Optimizing Compiler:
I have no idea whether there are any compiler optimizations for it in GCC, I suspect not, though. However there is a version of the IBM XL C compiler for it, available here (no idea if registration is required, I didn't attempt to download). I wonder how the IBM compiler is implemented, and whether you could use it in a Linux-based Cell system as a drop-in replacement for GCC. It says "GNU C extensions are welcome."

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."

wide acceptance by marafa · 2006-02-11 19:26 · Score: 1

if ibm wants to give the cell wide acceptance. they should give free access to programmers. not just sell the blades. thats expesive.
so ..
1) donate to sf.net serverfarm
2) have their own test drive program
3) donate servers to universities
4) donate a server to me
5) support linux on cell

--
_ In Egypt Networks: Network Solutions with a Twist

Well inform it then. by mosel-saar-ruwer · 2006-02-12 06:33 · Score: 1

That has to be one of the most ill informed comments that I have seen on this site in the past 6 months.

The bulk of the post is approximately three sentences; there's a further addendum which asserts that transistor counts on FPUs do not double as the bit count on the floats doubles [rather, the transistor count increases at a much larger rate].

If anything asserted here is factually false, then please take the time to correct it:

1) Computer colors are [8 bits for Red] X [8 bits for Green] X [8 bits for Blue] = 24 bits total.
2) Graphics chipset makers believe that high bit-counts in FPU calculations are a waste of transistors.
3) ATi has 24-bit FPUs; nVidia has 32-bit FPUs; Cell-Playstation has 32-bit FPUs.
4) When you double the bit count on an FPU, the transistor count does not double, but increases at a much higher rate [hence the relative paucity of true 128-bit hardware FPUs; by comparison, Sun's "quad precision" 128-bit double is a software fiction that is essentially useless for real-world calculations].

Again, if any of this is false, please correct it.

Slashdot Mirror

IBM to use Cell in Blade Servers

159 comments