Prospects For the CELL Microprocessor Beyond Games
News for nerds writes "The ISSCC 2005, the "Chip Olympics", is over and David T. Wang at Real World Technologies put a very objective review of the CELL processor (the slides for the briefing are also available), covering all the aspects disclosed at the conference. Besides the much touted 256 GFlops single-precision floating point performance the CELL processor has 25-30 GFlops in double-precision, which is useful enough for scientific computation. Linus seems interested in CELL, too."
This is a very positive review for the cell processor. It does seem like a really exciting new piece of technology. It promises a lot, and if it will do everything people say it will do, it really has the possibility to give the entire industry a big leap forward.
That being said, I think it's important not to get too excited about it... it's hard to say if it will live up to everything that people have written about it. I'm a bit skeptical. Until I see some production units doing amazing things, I'm cautiously optimistic.
I store my recipes online (the way nature intended)
Why should Linus be interested in the cell when he has the Transmeta Crusoe?
Comment removed based on user account deletion
...playing The Game of Life.
Sony so badly wants its next-generation game console to offer a super-realistic "virtual reality" experience, the company will design and build its own advanced 128-bit processor to realize this goal.
...
Processors inside game consoles usually toil away in anonymity, derided as as poor cousins to desktop chips such as Intel's Pentium line. But with Sony Computer Entertainment's ambitious plan, its chips could outclass the offerings of the world's largest chipmaker--if all goes well.
The system is so advanced, MicroDesign Resources analyst Keith Diefendorff wrote in a report that the system "has the potential to swipe a chunk of the low-end market from under the noses of PC vendors." He wrote that the platform may "signal the company's intention to move upscale from current game consoles, cutting a wider swath through the living room," with its abilities to function like a stand-alone DVD player and Internet set-top box.
Sony puts on game face with new chip
Published: May 5, 1999, 1:25 PM PDT
By Jim Davis
Staff Writer, CNET News.com
There are places where the networks are not touching,and there are places where they are-Boeing's Lori Gunter
But isn't the point of the cell processor a distributed model.
From the reviews I've seen they are touting it as if the cell communicates with other cells to handle all the processor intensive stuff.
so where one cell would not be as powerful as an x86 cpu two cells would be. And the way they have designed the things is as a seperate computer on a chip so you can basically upgrade your ?? just the same way you upgrade your memmory.
Or have I gotten the wrong end of the stick and they are designing these things for pointless fun.
Some time ago Chuck Moore proposed the 25x , a single chip holding a 5x5 array of simple processors. That's what this reminded me of when I first read about it. As Mr. Moore said in that Slashdot interview, "[...] the 25x is a solution looking for a problem." Cell theoretically has a lot of performance, and we're talking FLOPS not MIPS. It will certainly be useful or even revolutionary in televisions and game computers, as well as for scientific calculations. I don't see it making your desktop or server much faster though. Those don't need more FLOPS, they need more I/O bandwidth and faster peripherals, and perhaps more MIPS. I can see Cell workstations, but in the same way as you have SPARC workstations and laptops now: as development tools for the "real" hardware.
Sheesh, /. might as well make a Cell image & category, they post so many articles about it!
From what I've seen, it will be rather low horsepower compared to the current G5s, since it will be lacking deep pipelines, caches and other bits that give the G5 much of it's speed. That's not to say that it's not really a G5, it sounds like it will support the full G5 instruction set (including Altivec) and be a true 64 bit processor core, just not a particularly fast one.
The role of the G5 cores seems to be to handle higher order logic that prepares and parses out tasks to the very fast vector units (SPEs).
So it probably does make more sense to have it as a coprocessor in a Mac, at least until compilers and software writers routinely target the cell's SPEs -- if that day ever comes. More likely specialized code will need to be written, and particular subtasks pulled out.
I suspect things like physics libraries, sound & video processing libraries, plus apps like SETI@home would be quickly written to use the SPEs, but most other software wouldn't be.
I like the fact that the presenters didn't remember/know what all the acronyms were in the cell diagram. I like the interview technique too. Get em drunk and watch em talk.
I was wondering why the article was so in depth.
Quoth
"
After some discussion (and more wine), it was determined that the ATO unit is most likely the Atomic (memory) unit responsible for coherency observation/interaction with dataflow on the EIB. Then, after the injection of more liquid refreshments (CH3CH2OH), it was theorized that the RTB most likely stood for some sort of Register Translation Block whose precise functionality was unknown to those outside of the SPE. However, this theory would turn out to be incorrect.
"
...might be used to run the PS3 (assuming this is true). Outside of a weighty OS (assuming you use Windows, Mac or a Linux GUI with that nVidia) they should do better.
Besides, 256 GFlops in single-prec. can't be too bad either...can it?
You can hold down the "B" button for continuous firing.
Here you go, I found the source for you :)
liqbase
I use a particle accelerator with Windows all the time.
;)
I can't stand LCD monitors, CRT all the way
liqbase
What good is a new chip, no matter how fast it is, if you can't run anything on it?
There is this really neat group of operating systems called Unix/Linuxes. They have a major advantage in that you only need a small amount of assembler to get going on a new chip, then the rest can be ported over in C/C++. This has been the situation for decades - Unix (and now Linux) has been the initial OS for almost all new chips.
How fast will this chip be at general purpose stuff? Who cares if it can do 100GFLOPS on a couple operations.
Reasonable point, but FLOPs are a good general measure of the speed, as they are pretty complex operations. We all used to measure speed in MIPS (Million Instructions Per Second), but as chips got so diverse, one chip's instruction could not be easily compared with another's (particularly if RISC chips were involved, where the instructions could be very minimal). FLOPs are a better measure, as a divide is a divide and a multiply is a multiply no matter what chip architecture you use.
It would be compatable with PowerPC software.
Which means that the vast majority software I use everyday would work just fine on it.
Although it would be slow... Cell isn't optimized for general purpose and the extra 'SPE's add another 128 registers to the PowerPC and VMX ISA's. Which wouldn't get used by normally compiled PowerPC code.
You would have to have GCC worked over to provide 'vectorized' code to use as much as these SPE's as possible for single threaded applications, and even then you wouldn't get much more performance out of it then a normal G4-class PowerPC proccessor.
Then you have memory managment problems to work out, probably thru a extensive firmware-based controller which would add to execution time and slow things down a little bit more.
The advantage would be if I was doing extensive multimedia or 3d work or special types of scientific research then I could use a familar enviroment (linux) as a platform to run special applications that themselves would benifit from the tremendious performance capabilities of a few of theses cells.
It would make a great chip for embedded multimedia player (at lower clockrates) and would be great for something like a non-linear video editor, but a Wintel killer it definately woudn't be.
Probably would be somewhat usefull for normal desktop usage as more and more applications are multimedia in nature, but it's not going to be substancially faster then a Intel or AMD proccessor to the end user.
But what it can do is provide backup horsepower as a math co-processor.
I see great potential for the STI Cell Processor as a SETI@Home accelerator.
Seriously though, there may be good scientific uses for these exactly as you envisioned - in a coprocessor role. From folding proteins and weather simulations to cryptoanalysis, these could provide a great entry for distributed scientific computing.
I've been reading about the Cell processor for a few weeks now, and there is never any discussion about the operating system architecture necessary to get this thing to perform.
As I see it, its a Power PC of OK quality with 8 subsidiary processors optimised for operating a relatively simple task on a relatively small amount of memory.
So - port Linux to it? But how?. Relatively easily, to make use of the main processor, but what sort of subsystem do you build so that the subsidiary processors get used to their full potential. Perhaps part of X could be configured to run on these processors - but that would be a very manual tweak to make use of the architecture. And with the best will in the world, these processors would then sit around unused for most of the time.
What you need is a more general concept, probably at the programming language level, in which algorithmns can be expressed in such a way that the operating system can detect that they can be loaded into these subsidiary processors to be executed.
But there doesn't seem to be anything about that in the news out there. Presumably Sony are going to do something for the PS/3 - what? and is it going to be general purpose, since much of the benefit from their purposes will be a super motion graphics processor for games.
Until we understand what the software infrastructure to make use of the architecture of this new chip will be, then I can't see how we can make predictions of its success in the more general processor market. Before then its just marketing hype.
Unless you are computing digital orreries, whether it has 256GFlops or 256TFlops makes little difference if the memory bandwidth isn't substantially increased, and people don't increase the memory bandwidth because that has expensive consequences all over the system.
On the whole, my impression is that current mainstream CPUs have a pretty reasonable balance between CPU power and all the other system components. Changing just the CPU without making substantial (and expensive) changes to the rest of the system will not magically give you more performance.
it will be lacking deep pipelines, caches and other bits
And that is the whole point of this processor. The G5 NEEDS those pipelines and caches in order to feed the multiple execution units, reorder instructions and avoid reading slow host memory.
The CELL on the otherhand will have the instruction ordering done in software. All those 'bits' you describe are replaced with software: a much smarter compiler.
Yes this processor will perform poorly with today's code. With appropriately written code it will scream.
This chip is not going to compete with other general purpose CPUs. It's going to compete with custom ASICs and FPGAs.
-S
folks need to keep in mind these are max figures assuming software is perfectly written to take care of parallelization (does that word exist?). this means that most computer programs will hit no where near these rates, but super optimized versions of things like SETI-Home and an mpeg encoder/decoder could take advantage of it.
just remember how many developers complained about the Emotion Engine from the SP2 and how it was such a bitch to program for, this will be worse. it's first gonna require a special compiler or at least a tool to fill the code to all the independent mini-procs and reorder all the instructions to take advantage of it's little quirks. they seem to be a bit different from pipelines, but the some of the same concepts with regards to stalls will apply. so if you're working heavily on one set of data, it's quite possible only one of these mini procs will be used, and the rest will stand there and do nothing.
i think this is something that'll work much better on a video card and a maybe a soundcard than as a main processor, except in the cases where mostly only media processing is requird. settop boxes, game consoles, tvs, stereo systems, etc.
The real promise of these Cells is Internet MPP. IBM (and Sony) claim that Cell PCs will be able to cluster "natively" across Internet-latency TCP/IP networks, like broadband. If they deliver on that, then performance questions will revolve around interoperable network apps, not just the raw CPU HW.
Intel's Pentium architecture was built to accomodate 6-way direct CPU interconnects. The idea was to build "cubic" structures for MPP computers. It took until the P4 to really deliver any of those, almost 10 years after the architecture was released. And the software is still bleeding-edge, and hand-rolled for each install. MPP SW techniques have evolved a lot since then, so perhaps the Cell will actually deliver on these "distributed supercomputer" promises.
--
make install -not war
Besides, 256 GFlops in single-prec. [realworldtech.com] can't be too bad either...can it?
Unfortunately single precision number ignore certain rounding conventions in order to boost the speed. You'll get super fast single precision results, but they won't be as acurate as on other systems. Probably won't matter for physics rendering in a video game (Sony's Emotion Engine did the same thing) but it could make a big difference when applied to general purpose situations.
5 years ago the "Emotion Engine" from Sony was supposed to "steal a chunk" of the PC processing market. Didn't happen. Won't happen.
IF they write/pick the OS/software for the Cell appliances correctly I could see it making some headway as a desktop replacement.
Which is the key, exactly. As Linus wrote in one of his linked form posts (from the blurb) it's gonna be a pain to program general purpose for those vector units (SPEs).
However, judging from the main review, it doesn't look like the PowerPC Element was casterated too much. It looks like it'll suffer from Pentium4 syndrome (boosting the frequency doesn't do as much as it used to) so it might not be as good as an equally clocked Power5 based processor, but I think you're looking too much at the SPEs when considering whether or not it'll compete with the x86 and Power5.
Right now, there aren't x86 and Power5 chips at 4+Ghz, and looking at Intel and AMD's roadmaps, there probably won't be for quite a while. Even if this thing is horribly inefficient for general tasks, it'll be great for Graphical/Video work, great for Physics/Scientific work, and probably at least as fast for everything else as a single core P4 3.8Ghz (which does a better job melting candles than it does holding them, most of the time).
You may not like Michael Kanellos usually, but I think he's hit the nail on the head here.
This is a bigger, hotter, less stable chip with an exotic and hard to write-for architecture. That's fine for a gaming system with a dedicated revenue stream and no competition. It's not gonna make it outside that domain.
Substantial changes, maybe. Expensive? Perhaps not. This all depends on the base assumptions from which you operate. One of the fundamental assumptions in today's existing systems is that any and all work should be done to maximize the utilization of the CPU. However, when considering how to design other types of systems, such may not be true (it may make sense to minimize the memory footprint, for example).
If you've ever done some detailed algorithm work, you will quickly realize that there are many algorithms where you can make tradeoffs between memory and CPU time. The 'simplist' of these are the algorithms that are breadth first vs. depth first, which can trade off exponential in memory vs. exponential in time. [For a 'trivial' example, try forming the list of all operational assignments containing 6 variables and which use %, +, -, *, /, ^, &, ~, and ()... less than 50 lines of perl and you'll quickly blow through the 32-bit memory limit if written depth first, or take overnight to run breadth first]
The significant question which has been brought up - and which remains unanswered - is what software development tools will be made available. Once this is better answered, we will all be in a better position to determine what fundamental assumptions have been changed, and therefore how we can follow the new assumptions through to conclusions about the net performance of the processor and machine in which it is contained.
Since IBM is now involved, should it be called the PS/3 instead of the PS3?
My view of the Cell chip is that it's actually 2 different kinds of chips put together. It has a general processor (the POWER5 core) core, and essentially co-processors that are optimized for a totally different class of programs. The POWER5 chip would let it run your normal office applications, but the SPEs allow the chip to do things like graphics processing, audio processing, simulations, etc. All those problems that lend themselves naturally to a vectorizes solution. Together, the 2 kinds of cores on a single chip has the potential to do a lot. But there has to be tools to allow developers to make use of the potential. Especially as vectorized programs are not easy to write and optimize, that makes the quality of the development tools very important in deciding the success of the chip.
There are 10 kinds of people in the world - those that know binary, and those that don't.
(1) fetching and prefetching (multiple P4 stages) because the extra processors on Cell can directly address their local 256KB of memory.
(2) decoding x86 instructions into microops - since the extra processors are running code directly rather than running kludgy x86 code on a non-x86 microcore
(3) branch prediction (since the load penalty is a lot lower due to local 256KB of memory and shallower pipeline, these stages are unnecessary)
(4) scheduling the microops isn't necessary as Cell will require that to be done in software during compilation (ala VLIW)
(5) retirement (since Cell isn't doing out-of-order execution, no reordering and retranslation from the microop to the x86 world is necessary)
So given that potentially half of the 20 P4 stages (later P4s have 31) are unnecessary, that saves a lot of logic and allows the same clock speed with less stages. There has (apparently) been a lot of architecture work here to think through what adds the extra hardware and how to avoid that... the result is the ability to use higher clock speeds without having the same types of penalties the IA-32 processors encounter.
People seem to think this is leaps and bounds above everything else, but they're missing the details. In order to obtain that much performance, you'll need a task which parallelizes well so it can be broken up into chunks for the 8 SPEs. Graphics rendering falls into this set of tasks, but a lot of general applications just don't gain that much from parallel processors. Even when you have a task that does parallelize, writing parallel code is quite a bit harder than writing code for just a single thread of execution.
I've seen a lot of hype about having the Cell in your laptop talk to the Cells in your desktop, microwave, and TiVo, but you have to consider real-world limitations. When you set up a network like that (presumably wireless), you're going to be limited to around 100Mbps. In computer clusters and supercomputers, one of the main limitations of performance is the communcation bandwidth available between processors, and the latency of the network. To build a "home supercomputer", you not only need a task that parallelizes well, but one that doesn't require so much inter-node communication that it's held back by a slow network. You can't work around this problem with hardware magic - if the task you're working on requires lots of communication bandwidth, you're going to be held back.
So how much beyond a modern PC is 250GFLOPS anyway? Not much! A GeForce FX at 500MHz does 200 gigaflops. An AMD Athlon's peak performance is 2.4 GFLOPS at 600 MHz... if we scale this up to 2.2 GHz (high-end Athlon), that's 8.8GFLOPS (note: As we're talking about theoretical performance, nonlinear factors like bus speeds can be ignored). Basically, if the Cell dedicates most of its power to graphics rendering, you'll have computation power in the same range as a fast PC of today. Given that we're not going to see any products based on the Cell for a while, this isn't going to be the end of the world for Intel and nVidia (let alone the fact that Cell isn't x86).
Consoles using the Cell will have the advantage of only having to render for TV resolutions - at most 1080 lines, while PCs will be rendering at up to 1600x1200, but if you look at recent history, you can compare the xbox to a then-good PC with a GeForce3 (which came out at around the same time) - the xbox looked better, but PCs did catch up and surpass it's performance and it didn't take all that long. Consoles have to be very high-end when they're released, because the platform doesn't change for 2-3 years, and they still need to be "good enough" after a couple years, before the next generation is released.
My server
POWER5 is not the same as PowerPC 970 (G5). POWER5 is a really really expensive high performance mainframe chip. G5 is a server/desktop chip.
If you're going to rip the links out of one of my Ars news posts and submit them to slashdot (in the same order in which I linked them, no less), then at least credit your source.
Senior CPU Editor | Ars Technica | http://arstechnica.com/
Most of you are thinking of today's applications...but what about things like eye/head tracking, voice recognition, face recognition, telepresence, real-time cinema-quality CGI, etc...those are tasks requiring large-scale numerical computation, and they all might appear on your desktop in the not-too-distant future thanks to chips like CELL and its future ancestors.
All is Number -Pythagoras.
Well, considering that there's going to be a dedicated graphics chip from nVidia in the PS3 too, I'd imagine that the SPUs are designed specifically with all that stuff in mind...
Advanced users are users too!