Ars Technica's Hannibal on IBM's Cell
endersdouble writes "Ars Technica's Jon "Hannibal" Stokes, known for
his many articles on CPU technology, has posted a new article on IBM's new Cell processor. This one is the first part of a series, and covers the processor's approach to caching and control logic. Good read."
Why do I have the sneaking suspicion that, if successful, this processor will eclipse the PowerPC on the Mac in the next few years?
I want 2 of them, yesterday.
... on cell... likely?
Aside from my own (competent) review of the cell processor, the article possibly the most insightful and technically nicely balanced articles posted on slashdot in a long while!
I'll cover more of the Cell's basic architecture, including the mysterious 64-bit POWERPC core that forms the "brains" of this design.
Looking forward to that... I think that many people will be moving to Mac
#hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
Is that the 386 instruction set and arcitecture is so non proprietary. What made it so popular certainly wasn't that it was better. If I had the dough, I can literally make one and my own fab without asking a single soul. Alot of times it seems companies try to gather into consortiums to mimic the same effect and gather market momentum, but these are doomed to failure because the more valuable the technology becomes - the greater the pressure to diferentiate and fence off some "teritory" for themselves. We saw this happen first hand with UNIX, where all the flavors would constantly try to group under these unified standards - and they made little progress until Linux came along. The CPU world needs somthing similar to protect people from patent harassment. for design, cores, and fabrication.
Who would conceivably have enough money to build microchip fabrication facilities but not enough money to license the powerpc architecture?
"Reverse engineered implementations exist" is not really much of a meaningful strength if you don't own one such reverse engineered implementation already. You say you can potentially build a 386 chip fab, but the thing is you aren't going to build a 386 chip fab, you're going to just keep on buying Intel and AMD chips, the only noteworthy people currently making x86 chips, because if you built a 386 what would you do with it? It's a 386. The ISA has moved on.
No subsidies required. PS3 will sell enough to write its own ticket. No need to hope others pick up the slack.
Sony may be able to do that with the 65nm final design, when it arrives some time in 2006. Then we'll see.
Even then, there are other considerations that may make it a less-than-ideal fit for a general purpose computer - all those vector units are great for number crunching, but how much of that do you do each day? And when you're not, that's 3/4 of the cost of your chip sitting around idle. There are more cost-effective alternatives.
64-bit PPC on it has VMX. That's Altivec, baby. Sure, the SPE's don't have the full functionality of VMX but so what.
Read Part II of the article - it's not a full implementation of VMX (the SPEs don't have VMX at all - they have a different instruction set altogether). Hannibal believes the weak VMX implementation will be a major downside for Apple. Then there's the lack of out-of-order execution etc.
The biggest issue I see is that the Cell's design requires the programmer to have full control of the machine.
Not so. That's what operating systems are for. SPEs would be treated as a shared resource - you ask the OS to loan you one, and if you get it, you run your code on it. Or, you ask the OS to run your code, and it schedules it onto an available SPE when it can.
Why would anyone engrave "Elbereth"?
If that turns out to be the case, then PS2 programming is a hint towards how it'll work. On the PS2, you generally configured the DMA controller to upload mini programs to the vector units, then DMA-chained data as streams from RAM through the just-uploaded program and onto the destination (usually the GS which rasterised the display).
Sounds a lot like pixel/vertex shaders. Is this how we're going to get around all our bandwidth problems now? Slice up our programs into little independent fragments and upload them to the CPU to run concurrently?
I got my Linux laptop at System76.
The difference is that instead of the compiler taking up the slack (as in RISC), a combination of the compiler, the programmer, some very smart scheduling software
Requiring programmers to learn how to write parallel code that makes good use of this processor seems pretty dicey to me. Few programmers have been trained to write parallel code (most struggle with threading). The fact that no popular programming language has a good parallel model is also a big stumbling block.
This problem seems to be looming for all the dual core processors, but I havent seen a big effort to teach programmers how to adapt.
The target market is not home users but rather scientists, animators, engineers, and others who need raw power and aren't concerned with the fact that Word won't work on it; many customers will probably have a cheap PC sitting next to it for office tasks, freeing up the workstation to do nothing but grind through computations. In this world, various unicies are the only serious choice; SGIs run IRIX or Linux, Suns run Solaris or Linux, and IBMs run AIX or Linux.
Take into account IBM's commitment to Linux, and the fact that many of their customers already use it, and it's almost certain that Linux will be a major OS choice for Cell workstation customers, particularly those working in a mixed-architecture environment. While it's likely to run AIX and a Windows port is possible, it's almost certain that a majority of Cell workstations will be running Linux.
That's it. I'm no longer part of Team Sanity.
the problem is that a multiplier's size is proportional to roughly the square of the things being multiplied - assuming the 64 fp's mantissa is twice the size of a 32-bit one it's going to take 4 times the area (or twice the area of a pair of them) and of course it will eat into your cycle time (both in gates and in wire delay)
Apple at the moment is two companies. One is primarily a computer hardware company that makes software to drive hardware sales and sells the entire package as user experience. The other is a consumer electronics company. Last year, the profits made by both companies were about the same. Whether they wish to transition to being a software and consumer electronics company that also makes some niche hardware is a decision they will have to make.
I am TheRaven on Soylent News
I am not convinced by this argument. A lot of OS X code uses AltiVec, but very little actually uses it directly. Apple has spent a lot of effort producing libraries that people can use which wrap AltiVec into something higher level (e.g. QuickTime, vDSP). Most of these could potentially be ported to the SPEs. Things like CoreVideo could also make use of the SPEs.
all those vector units are great for number crunching, but how much of that do you do each day? And when you're not, that's 3/4 of the cost of your chip sitting around idle.
90% of the time, my 1.5GHz G4 is sitting at 20% utilisation or less. You could argue that 80% of the power of the chip is wasted. However, when I am doing things that tax it they are almost always things that would support a large degree of parallelism.
I am TheRaven on Soylent News
First, you will use a language that supports a vector type. The languages used for GPU programming do, and there is a vector extension to C supported by GCC. You will write code that manipulates vectors instead of scalars. And that's about it. You try to keep your working set small, and your compiler will try to fit in the local memory.
I am TheRaven on Soylent News
Reading the article, it reminds me of the typical mainframe architecture, where you have a central supervisory CPU, but most of the specialized work is done by the channel processors.
In the Cell, the main PPC CPU appears to identify a piece of work that needs to be done, schedules it to run on a SPE, uploads the code snippet to the SPE's LS via DMA transfer, and then goes off and does something else worthwhile while the SPE munches on it. I presume there's an interrupt mechanism to let the PPC know that a SPE has some results to return.
Compiler writers ought to be able to handle this new architecture well enough -- it's sort of like the current CPU/GPU split, where you've got the main program running on the system CPU, and specialized graphical transform programlets running on the GPU. There may need to be macros or code section identifiers in the source to let the compiler know which to target for that bit of code.
Obviously, this is just the first iteration of the Cell processor. I can see them widening the SPE from single precision to double precision (for the scientific market -- the game market probably doesn't need it), and going to a multi-core design to reduce the die size.
Chip H.
Comparing it with trying to work with threads definitely brings up nightmare conditions. But I don't think it has to be a nightmare. We use mammoth parallelization all the time and with great success. We hand off all the rendering chores to the GPU when we give it a pointer to data and say "hey, display this", or more modernly a bunch of vectors n' stuff to send down the hardware accelerated pipeline.
The Cell hardware has the capability to get a developer in trouble, especially if you're trying to write data concurrently, and because you started from a design not specifically made for this chip. But if you focus on pipelines, with a design to avoid simultaneous writes, a lot of problems should vanish, and I believe this is the path people will take, if only because everyone seems to be viewing it as a glorified vector processor from a GPU.
That last myth is a good one. I had no idea!