Prospects For the CELL Microprocessor Beyond Games
News for nerds writes "The ISSCC 2005, the "Chip Olympics", is over and David T. Wang at Real World Technologies put a very objective review of the CELL processor (the slides for the briefing are also available), covering all the aspects disclosed at the conference. Besides the much touted 256 GFlops single-precision floating point performance the CELL processor has 25-30 GFlops in double-precision, which is useful enough for scientific computation. Linus seems interested in CELL, too."
Transmeta isn't doing the low heat processors anymore. Quoted from http://arstechnica.com/news.ars/post/20050105-4501 .html .
CPU manufacturer Transmeta, known for their low-power processors, is evaluating an exit from the CPU market. Instead of manufacturing chips themselves, their business focus would shift towards buzzwords: licensing their intellectual property and the formation of strategic alliances to utilize their processor design as well as their research and development skills.
It isn't a POWER5. It is more like a 64-bit variant of the 750VX with SMT, a chip that never appeared but otherwise looks rather similar to what has been described as the PowerPC portion of Cell.
From what I've seen, it will be rather low horsepower compared to the current G5s, since it will be lacking deep pipelines, caches and other bits that give the G5 much of it's speed. That's not to say that it's not really a G5, it sounds like it will support the full G5 instruction set (including Altivec) and be a true 64 bit processor core, just not a particularly fast one.
The role of the G5 cores seems to be to handle higher order logic that prepares and parses out tasks to the very fast vector units (SPEs).
So it probably does make more sense to have it as a coprocessor in a Mac, at least until compilers and software writers routinely target the cell's SPEs -- if that day ever comes. More likely specialized code will need to be written, and particular subtasks pulled out.
I suspect things like physics libraries, sound & video processing libraries, plus apps like SETI@home would be quickly written to use the SPEs, but most other software wouldn't be.
Here you go, I found the source for you :)
liqbase
What good is a new chip, no matter how fast it is, if you can't run anything on it?
There is this really neat group of operating systems called Unix/Linuxes. They have a major advantage in that you only need a small amount of assembler to get going on a new chip, then the rest can be ported over in C/C++. This has been the situation for decades - Unix (and now Linux) has been the initial OS for almost all new chips.
How fast will this chip be at general purpose stuff? Who cares if it can do 100GFLOPS on a couple operations.
Reasonable point, but FLOPs are a good general measure of the speed, as they are pretty complex operations. We all used to measure speed in MIPS (Million Instructions Per Second), but as chips got so diverse, one chip's instruction could not be easily compared with another's (particularly if RISC chips were involved, where the instructions could be very minimal). FLOPs are a better measure, as a divide is a divide and a multiply is a multiply no matter what chip architecture you use.
It would be compatable with PowerPC software.
Which means that the vast majority software I use everyday would work just fine on it.
Although it would be slow... Cell isn't optimized for general purpose and the extra 'SPE's add another 128 registers to the PowerPC and VMX ISA's. Which wouldn't get used by normally compiled PowerPC code.
You would have to have GCC worked over to provide 'vectorized' code to use as much as these SPE's as possible for single threaded applications, and even then you wouldn't get much more performance out of it then a normal G4-class PowerPC proccessor.
Then you have memory managment problems to work out, probably thru a extensive firmware-based controller which would add to execution time and slow things down a little bit more.
The advantage would be if I was doing extensive multimedia or 3d work or special types of scientific research then I could use a familar enviroment (linux) as a platform to run special applications that themselves would benifit from the tremendious performance capabilities of a few of theses cells.
It would make a great chip for embedded multimedia player (at lower clockrates) and would be great for something like a non-linear video editor, but a Wintel killer it definately woudn't be.
Probably would be somewhat usefull for normal desktop usage as more and more applications are multimedia in nature, but it's not going to be substancially faster then a Intel or AMD proccessor to the end user.
Well one reason the PS2 sold like hot cakes was that it was one of the cheapest DVD players at that time (at least in Japan). There is media player software available and it's quite popular the reason it isn't a internet set-top box is that noone wants internet set-top boxes they died a painful death. Now there's no EE desktop PC because it's too slow but the difference between Cell and PS2 in this regard are
(a) Cell was co-designed by IBM which has an interest in selling workstations etc with that chip, Sony didn't it's not their business
(b) Cell is designed for multiprocessor environments so if it becomes too slow for a task you can simply throw more processors at it
(c) 2000 the clockspeeds still doubled every 18 months that stopped. x86 goes the way of multiple cores too so the programmers will have to get used to parallel design anyway
That doesn't mean it will replace x86 or even make a dent but it means that contrary to the EE it's designed for such stuff and one of the companies behind it sells specialized workstations so it's at least a possibility.
And this time you can find more credible sources than CNET (CNET's part of the yellow press of computer news sites. Almost as bad as yahoo news) who'll tell you that.
Don't think of it as a flame---it's more like an argument that does 3d6 fire damage
They licensed technology from Rambus.
It seems like Cell will have more memory bandwidth than the processors commonly used today. From this article:
" The memory and processor bus interfaces designed by Rambus account for 90% of the Cell processor signal pins, providing an unprecedented aggregate processor I/O bandwidth of approximately 100 gigabytes-per-second. "
SIGFAULT
It seemed there was a lot of misinformation/confusion going around because some people heard it supported DP floats and some people heard it used Altivec (which doesn't support DP). So half the people extrapolated that IBM had ditched Altivec (i.e. VMX), and the other half assumed there was no DP support... both of which angered people. The truth (according to this article) is that it uses BOTH: A version of VMX that supports DP. whew!
The article also points out that the SP floats aren't truly 754-compliant, as they round-toward-zero on cast to int. This makes it compatible with that horrible C/C++ truncation cast (If anyone knows why C opts to round-toward-zero, please let me know!). However, rest assured, DPs are 854-compliant.
Also, the article suggests that there is a memory limit (at least initially) of 256MB:
The maximum of 4 DRAM devices means that the CELL processor is limited to 256 MB of memory, given that the highest capacity XDR DRAM device is currently 512 Mbits. Fortunately, XDR DRAM devices could in theory be reconfigured in such a way so that more than 36 XDR devices can be connected to the same 36 bit wide channel and provide 1 bit wide data bus each to the 36 bit wide point-to-point interconnect. In such a configuration, a two channel XDR memory can support upwards of 16 GB of ECC protected memory with 256 Mbit DRAM devices or 32 GB of ECC protected memory with 512 Mbit DRAM devices.
IF they write/pick the OS/software for the Cell appliances correctly I could see it making some headway as a desktop replacement.
Which is the key, exactly. As Linus wrote in one of his linked form posts (from the blurb) it's gonna be a pain to program general purpose for those vector units (SPEs).
However, judging from the main review, it doesn't look like the PowerPC Element was casterated too much. It looks like it'll suffer from Pentium4 syndrome (boosting the frequency doesn't do as much as it used to) so it might not be as good as an equally clocked Power5 based processor, but I think you're looking too much at the SPEs when considering whether or not it'll compete with the x86 and Power5.
Right now, there aren't x86 and Power5 chips at 4+Ghz, and looking at Intel and AMD's roadmaps, there probably won't be for quite a while. Even if this thing is horribly inefficient for general tasks, it'll be great for Graphical/Video work, great for Physics/Scientific work, and probably at least as fast for everything else as a single core P4 3.8Ghz (which does a better job melting candles than it does holding them, most of the time).
You may not like Michael Kanellos usually, but I think he's hit the nail on the head here.
This is a bigger, hotter, less stable chip with an exotic and hard to write-for architecture. That's fine for a gaming system with a dedicated revenue stream and no competition. It's not gonna make it outside that domain.
Why do you think they licensed the XDR interface from RAMBUS?
There are 2 dual XDR interfaces. Each interface is running at 6.4 GB/s. So 4*6.4 = 25.6 GBytes/sec.
So the CELL memory design is at least 4 times faster than current DDR2 memory systems.
Substantial changes, maybe. Expensive? Perhaps not. This all depends on the base assumptions from which you operate. One of the fundamental assumptions in today's existing systems is that any and all work should be done to maximize the utilization of the CPU. However, when considering how to design other types of systems, such may not be true (it may make sense to minimize the memory footprint, for example).
If you've ever done some detailed algorithm work, you will quickly realize that there are many algorithms where you can make tradeoffs between memory and CPU time. The 'simplist' of these are the algorithms that are breadth first vs. depth first, which can trade off exponential in memory vs. exponential in time. [For a 'trivial' example, try forming the list of all operational assignments containing 6 variables and which use %, +, -, *, /, ^, &, ~, and ()... less than 50 lines of perl and you'll quickly blow through the 32-bit memory limit if written depth first, or take overnight to run breadth first]
The significant question which has been brought up - and which remains unanswered - is what software development tools will be made available. Once this is better answered, we will all be in a better position to determine what fundamental assumptions have been changed, and therefore how we can follow the new assumptions through to conclusions about the net performance of the processor and machine in which it is contained.
POWER5 is not the same as PowerPC 970 (G5). POWER5 is a really really expensive high performance mainframe chip. G5 is a server/desktop chip.
If you're going to rip the links out of one of my Ars news posts and submit them to slashdot (in the same order in which I linked them, no less), then at least credit your source.
Senior CPU Editor | Ars Technica | http://arstechnica.com/
This is essentially what happened with the PS2. 1st gen game teams thought the compiler would handle more of the task of keeping the vector units and GS busy. Didn't happen. However, PS2 teams have learned a lot of valuable lessons in the past 5 years to prepare for this jump. PC developers are going to have a horrible time trying to get performance out of the Cell.
Most notably, devs learned that the PS2 is bus-bound. With only 16kb caches, memory-layout is paramount to avoid requisitioning the bus every 100 cycles to refill both caches just for a vtable look-up and jump.
So the Cell forces programmers to think in this paradigm. No caches, just 256k local storage for threads. So your performance will only suffer if you fail to learn the new principles... no more cache-agnostic coding. No more memory accesses to any random place in memory.
Then programmers figured out that the VU0 was basically sitting dormant the entire time. That's a third of the total proccessing power wasted. Little by little, they started moving tasks to the VU0-- skeletal animation, particle dynamics. The problem was that the VU0 only had 4kb of local memory, so between loading a microprogram, double-buffering memory for DMA in/out, and running the damn thing, the EE couldn't do anything useful besides babysit.
The VU1, OTOH, was a totally different beast. It took over responsibility for the T&L stage in rendering. It had 16kb of memory and could consume a chained DMA stream of microprograms to run and the associated memory (basically a series of display lists). Once you wrote your display-list chain to a buffer and began DMA, it required no babysitting at all. Without a doubt, You can see where the design decisions behind the Cell are coming from. The PS2 was basically just a prototype Cell to see what worked.
And here are the results. They ditched the VU0, multiplied the number of VU1s by 8, and gave each one 16x the memory, jacked up memory bandwidth. The EE (PPC) is now officially the arbiter of threads... Except now the SPEs are capable of generating execution chains-- i.e. one produces, another consumes, so the PPC doesn't need to have all the brains.
Another interesting thing is that while certain large portions of the render loop need to be executed serially (Game Logic, Animation, Collision/Dynamics, then Render), many operations within those category can be parallelized. For instance, devs have resorted to huge hacks to make AIs look like they're running in threads when they're really not. It's simply been the case that games were definitely only running on one processor, and context switches are expensive. It's actually much more convenient to have multiple cores and turn these hacks into actual LWPs. The real question is how is Sonyibm going to handle concurrency? Are they going to write a special pthreads for the PPC threadmaster? Are they going to use chainable microprograms, where the PPC is just a glorified VIF? That is the big question eating at me right now.
But if you look at the latest games-- Jak3, GT4... they're hitting pretty much near the theoretical limit of the PS2, no matter how unlikely that seemed just 5 years ago. The first half of the Cell learning curve has already been traversed by those brave PS2 freaks, it's up to the rest of us to learn from where they've been.