Ars Technica's Hannibal on IBM's Cell
endersdouble writes "Ars Technica's Jon "Hannibal" Stokes, known for
his many articles on CPU technology, has posted a new article on IBM's new Cell processor. This one is the first part of a series, and covers the processor's approach to caching and control logic. Good read."
Why do I have the sneaking suspicion that, if successful, this processor will eclipse the PowerPC on the Mac in the next few years?
I want 2 of them, yesterday.
... on cell... likely?
Aside from my own (competent) review of the cell processor, the article possibly the most insightful and technically nicely balanced articles posted on slashdot in a long while!
I'll cover more of the Cell's basic architecture, including the mysterious 64-bit POWERPC core that forms the "brains" of this design.
Looking forward to that... I think that many people will be moving to Mac
#hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
Part II is up as well.
.. made of risc components.
" Last fall, IBM and Sony said they were developing a workstation based on Cell chips, which is the first product IBM will ship based on Cell."
Regardless if this is the first product shipped or not, a workstation is coming. I can't see it running anything but linux. Given the mass market targeting of the cell, I hope Sony makes a strong go at grabbing the market with cheap hardware, rather than trying to milk the high-end content creation market first.
"A language that doesn't affect the way you think about programming, is not worth knowing" - Alan Perlis
e.g. 234 M transistors (!) That's why I don't think this will be replacing the G5 any time soon. The die size (at the current prototype's 90nm) is over 200 mm2.
It'll have to get a fair bit smaller/cheaper before the PS3 can use it without major subsidies, and I don't know why they think general consumer devices will want it. God knows how much power it dissipates with all 8 SPEs clocking over at 4 GHz...
Why would anyone engrave "Elbereth"?
Thank god. I've enjoyed his articles in the past, and if experience is any indication, I will have the false impression that I understand this stuff in a nontrivial way for up to three hours. This is not meant to rag on Hannibal, BTW.
WIth a name like that, I expect to see pictures of him eating those Cell processors, and describing how they taste.
// file: mice.h
#include "frickin_lasers.h"
...clicking on this link also attempts to install a trojan (SARC's name: ByteVerify). I agree: this link should be removed and the poster's IP should be reported to the relevant authorities.
Although the article (which is quite clear) indicates that the AltiVec architecture is closer to G4 than G5, won't the speed increase of having 8 fully-parallel processors (9 if you count the main CPU) more than make up for the issues associated with the loss of the G5's advanced features? It seems to me that this is a natural for Apple - it will give them a 5x - 10x performance boost over anything that's on the drawing boards over at Intel.
Even so, I doubt we'd see Cell-based Macs until at least 2007 - but wouldn't it be great to run PS3 games on your Mac? (As if that'll ever happen.) But then again, given the Cell architecture, your PS3 could use your Mac to make its games run faster! A whole new reason to have an XServe-based supercomputer...
The one thing I don't understand is how I would code for this thing. As best as I understand it, I now have some instructions for controlling the cache (or LAM, whatever) which sounds cool, but are there any details yet of how I'd write code for this? I'm also disappointed that the article didn't explain how one would use their SIMD instructions if they aren't using any of the existing standards. So I load my vectors with the cache control and ask the processors to ever so kindly add them?
Anybody out there with experience on this architecture or even attended the presentation itself can give us mere coders details? Preferably a website.
The bitter lessons of a veteran coder: http://bitterprogrammer.blogspot.com
Is that the 386 instruction set and arcitecture is so non proprietary. What made it so popular certainly wasn't that it was better. If I had the dough, I can literally make one and my own fab without asking a single soul. Alot of times it seems companies try to gather into consortiums to mimic the same effect and gather market momentum, but these are doomed to failure because the more valuable the technology becomes - the greater the pressure to diferentiate and fence off some "teritory" for themselves. We saw this happen first hand with UNIX, where all the flavors would constantly try to group under these unified standards - and they made little progress until Linux came along. The CPU world needs somthing similar to protect people from patent harassment. for design, cores, and fabrication.
What I find interesting is that the vector processor are restricted to single precision floating point calculations.
This isn't terribly useful for scientific computations (there is the same problem with the GPU): currently the IEEE is working on a standard for 128bit precision floating point calculations!
Of course for 3D, video and sound, 32bit precision is good enough and *if* programmers (a big if) manage to overcome the pain of 'parallel programming' then it could be a big success.
Cradle Semiconductor has been working for a while on a similar technology.
Of course, it's all a matter of scale - TI had a 4 DSP, 1 CPU processor a while ago, but it only made 100 MFLOPS. Cradle's first product has 8 DSPs and 6 CPUs - depending on if you can get your data to properly pipeline through the processors, you can achieve up to 3.6 GFLOPs peak with only a 230 MHz clock.
HIV Crosses Species Barrier... into Muppets
Who would conceivably have enough money to build microchip fabrication facilities but not enough money to license the powerpc architecture?
"Reverse engineered implementations exist" is not really much of a meaningful strength if you don't own one such reverse engineered implementation already. You say you can potentially build a 386 chip fab, but the thing is you aren't going to build a 386 chip fab, you're going to just keep on buying Intel and AMD chips, the only noteworthy people currently making x86 chips, because if you built a 386 what would you do with it? It's a 386. The ISA has moved on.
Hurd might be an interesting candidate for running on Cell because of the highly threaded design. Hurd servers might be able to swap in and out of cells as they require cycles. It seems a good match; i.e. L4 runs in the main core, and various translators and other processes run on the cells. If a cell could be programmed to run the filesystem, for instance, it would totally free up the core for other business.
Because the PS/3 will have a highly fixed hardware set, implementing a minimal driver set might be feasible given enough reverse-engineering effort.
I'm not saying that L4/Hurd will kick the nuts off of Linux on an Opteron, I'm just noting that it might be pretty cool to experiment with Hurd on Cell technology. The L4/Hurd team is real close to getting the last peices in place to compile Mach based Hurd under L4, and if you ever tried Debian GNU/Hurd, you know its pretty near feature-complete and a pretty neat system to run. The next task for L4/Hurd is a driver infrastructure, and it might be wise to look at what Cell is bringing to the table before it gets too far along. Know what I mean.
Clickety Click
Linux on Intel: Think Dead Man Walking and Grid vs. SMP: The Empire Tries Again and Fast, Faster and IBM's PlayStation 3 Processor.
Another article on the Cell design at http://www.theregister.co.uk/2005/02/03/cell_analy sis_part_two/ seems to indicate that there is some sort of DRM built in.
Hannibal doesn't say anything about this (that I noticed) - anyone have more info?
Don't save Windows XP! http://www.petitiononline.com/jjw1xp/petition.html
This RAM functions in the role of the L1 cache, but the fact that it is under the explicit control of the programmer means that it can be simpler than an L1 cache. The burden of managing the cache has been moved into software, with the result that the cache design has been greatly simplified. There is no tag RAM to search on each access, no prefetch, and none of the other overhead that accompanies a normal L1 cache. The SPEs also move the burden of branch prediction and code scheduling into software, much like a VLIW design.
Why? The reason for the instruction window was to simplify software development.
Of course, I like to play devil's advocate with myself, so I'll answer that question.
The purpose of the Cell processor is to enhance home appliances, which have a greater reliance upon low-latency than they do on precision, accuracy, and performane bandwidth. Thus, one can very safely say that the Cell processor will likely have little purpose in scientific calculations.
// file: mice.h
#include "frickin_lasers.h"
Sony may be able to do that with the 65nm final design, when it arrives some time in 2006. Then we'll see.
Even then, there are other considerations that may make it a less-than-ideal fit for a general purpose computer - all those vector units are great for number crunching, but how much of that do you do each day? And when you're not, that's 3/4 of the cost of your chip sitting around idle. There are more cost-effective alternatives.
64-bit PPC on it has VMX. That's Altivec, baby. Sure, the SPE's don't have the full functionality of VMX but so what.
Read Part II of the article - it's not a full implementation of VMX (the SPEs don't have VMX at all - they have a different instruction set altogether). Hannibal believes the weak VMX implementation will be a major downside for Apple. Then there's the lack of out-of-order execution etc.
The biggest issue I see is that the Cell's design requires the programmer to have full control of the machine.
Not so. That's what operating systems are for. SPEs would be treated as a shared resource - you ask the OS to loan you one, and if you get it, you run your code on it. Or, you ask the OS to run your code, and it schedules it onto an available SPE when it can.
Why would anyone engrave "Elbereth"?
gcc autovectorization page.
A proposal for Apple
I don't have an account, but this is an honest idea.
Why doesn't Apple include a Playstation 2 support card into their Macintosh line?
Problem: The OSX platform has almost no games. I own several macs, I love my macs, and I sincerely enjoy OSX. But it has no games, and that will never get better, especially as simpler games migrate to the web and the complex ones bail for the console market. The PC gaming market has essentially peaked.
Solution: Embed (or include as a BTO option) a PS2 chipset to a Macintosh. Run the generated display straight through to the graphical overlay plane. Done.
Everything works. The controllers are trivially converted to use USB. The DVD drive is already there. The display is already there. The USB and Firewire is already there. The harddrive is already there. The "memory cards" are already there.
Reason: The Macintosh game library explodes instantly to encompass something like 3,000 PS1 and PS2 games. With no need for emulation, the games are guaranteed to work out of the box and provide the Apple ease of use everyone loves. Sony increases their marketshare, Apple gets a viable expanding game library, and users get a vastly better gaming experience on OSX for maybe $40 of parts and engineering.
Why won't this work?
The difference is that instead of the compiler taking up the slack (as in RISC), a combination of the compiler, the programmer, some very smart scheduling software
Requiring programmers to learn how to write parallel code that makes good use of this processor seems pretty dicey to me. Few programmers have been trained to write parallel code (most struggle with threading). The fact that no popular programming language has a good parallel model is also a big stumbling block.
This problem seems to be looming for all the dual core processors, but I havent seen a big effort to teach programmers how to adapt.
New consoles are sold at a loss, but there's a limit to how muc of a loss companies can take. If the CPU itself ends up costing Sony $300+, they'd be looking at a massive loss on the consoles, probably larger than they are willing to take. That was actually a noted problem with the X-box, the loss per unit was large so they had to sell quite a few games per unit to make it up. I'm not even sure if they made any money on it.
Well, in MS's case, they can pull shit like that. Microsoft makes loads of cash off their software division, and has loads already in the bank. They can afford to operate a new division at a loss, even a pretty substanital loss (if the X-box division did lose money, it wasn't a large amount).
Sony, not to much. Their Playstation divison is their biggest money maker these days. So they can afford to take a loss on console hardware, but only so much that they know they'll make it back on games. They can't risk operating the division at a loss because it'd spell serious trouble for the company. They also aren't flush with cash. They've about $10 Billion, but have $12 Billion or so in debt (Microsoft has $34 Billion and no debt to speak of). They have to keep the money rolling in or things get ugly.
Also we know from history that having the fastest processor or shinest graphics isn't what wins a given round of the console wars. It's all about games, and perception.
Now who knows on pricing at this point, but the grandparent has a good point. That is a massive god damn die, like P4EE sized or so. Hot and expensive. As die size goes up, so do failure rates and thus cost, espically at high clock speeds. Hence why the EEs cost so damn much. I'd say it's a safe bet that this cell processor isn't going to be cheap.
From the sounds of it, it's not going to need to be. Sounds like it's a high end calculation chip for badass number crunchers. Given that Power4/5s and Itanium 2s are popular for that sort of thing, people in those apps won't bat an eye at a $1000+ price tag.
A budget-class PC laptop of that time might have been about 900 MHz to 1.1 GHz. I wouldn't consider such a laptop anything near useable. They tended to have poor quality sound systems that bottlenecked the processor and atrociously short battery times. The ibook was legendary for its excellent battery performance
Get off what you 'assume', assumption is just intuition for idiots.
We have test 200mhz laptops with 80mb of ram 5gb hard drives, released 1997 all running WindowsXP Professional (yes even the themes turned on) and they benchmark faster than they did when they shipped with Windows 95.
Secondly, they can do full 30fps video as long as it is uncompressed AVI or even WMA 9. QuickTime (MPEG4), MPEG2, and real stutter horribly on video playback unfortunately.
As for battery, don't know, these laptops hold for 3hrs with a single charge, and yes techs are REQUIRED and have no problems using them daily in test scenarios.
Now if you really want to compare laptops to laptops, why don't I show you our 900mhz AMD Compaq laptops, they have JBL sound systems in them, and there isn't a single feature the cannot perform with the exception of running a T&L based video game, as the integrated video doesn't handle it, oh wait, the 900mhz PowerBook video didn't support such features either. (BTW, This is not to say that there are not several 900-1000mhz class laptops that have upper end video features), I am just using what we have in our test labs for comparison.
The 900mhz laptop has a DVD/CDRW, came out late 2000 early 2001 (trying to remember if we got them before holidays or not). They do full software DVD decoding with less than 20% CPU utilization and pretty much do anything fairly fast that we through at them. We even have a beta version of Windows 2003 server running on one with 256mb of RAM. (Yes we are always pushing the limits, but it works as fast as the WindowsXP pro version of the machine sitting next to it.)
Now off my rant... Macs truly are great, and the PowerBooks of the time were great, but that DOES NOT MEAN they were the BEST, WILL ALWAYS BE THE BEST, or you should be complacent listening to Apple tell you what you are getting is the best when it might not be. It is time for us as MAC users to stand up and DEMAND that technology becomes as much a part of what a MAC is as the EASE of USE in the Interface.
The time is now, we need to STOP accepting what they tell us and give us and force them to truly give us the LATEST technological concepts, not just the above average concepts when compared to the PC world. These are Macs, they SHOULD BE BETTER. IT shouldn't even be subjected to a debate they should be so far advanced a debate should not be possible. PERIOD.
Sadly, it just isn't true now, and has not been for many years. OSX has giving the Mac world some credibility backing OS technology, but not Apple needs to take Macs to the next level.
Even if my comment inspires one Mac user to say hey Apple, we want better, then maybe we all can be the symbolic person with the hammer from their 1984 video and WAKE THEM UP this time.
Not quite. The Cell is 9 complete yet simple CPU's in one. Each handles its own tasks with its own memory. Imagine 9 computers each with a really fast network connection to the other 8. You could problably treat them as extra vector processors, but you'd then miss out on a lot of potential applications. For instance, the small processors can talk to each other rather than work with the PowerPC at all.
Hardly. Sony is following the same game plan as they did with their Emotion Engine in the PS2. Everyone thought that they were losing 1-200 bucks per machine at launch, but financial records have shown that besides the initial R&D (the cost of which is hard to figure out), they were only selling the PS2 at a small loss initially, and were breaking even by the end of the first year. By fabbing their own units, they took a huge risk, but they reaped huge benefits. Their risk and reward is roughly the same now as it was then.
Doubtful. The problem is that though the main CPU is PowerPC-based like current Apple chips, it is stripped down, and the Altivec support will be much lower than in current G5s. Unoptomized, Apple code would run like a G4 on this hardware. They would have to commit to a lot of R&D for their OS to use the additional 8 processors on the chip, and redesign all their tweaked Altivec code. It would not be a simple port. A couple of years to complete, at least.
This is half-true. While it will be hard, most game logic will be performed on the traditional PowerPC part of the Cell, and thus normal to program. The difficult part will be concentrated in specific algorithms, like a physics engine, or certain AI. The modular nature of this code will mean that you could buy a physics engine already designed to fit into the 128k limitation of the subprocessor, and add the hooks into your code. Easy as pie.
Bwahahaha! No way. This is a delicate bit of coding that is going to need to be tweaked by highly-paid coders for every single game. Letting on OS predictively determine what code needs to get sent to what processor to run is insane in this case. The cost of switching out instructions is going to be very high, so any switch will need to be carefully considered by the designer, or the frame-rate will hit rock-bottom.
This is one myth that could be correct. The Cell is huge (relatively), and given IBM's problems in the recent past with making large, fast PowerPC chips, it's a huge gamble on the part of all parties involved that they can fab enough of these things.
I am not convinced by this argument. A lot of OS X code uses AltiVec, but very little actually uses it directly. Apple has spent a lot of effort producing libraries that people can use which wrap AltiVec into something higher level (e.g. QuickTime, vDSP). Most of these could potentially be ported to the SPEs. Things like CoreVideo could also make use of the SPEs.
all those vector units are great for number crunching, but how much of that do you do each day? And when you're not, that's 3/4 of the cost of your chip sitting around idle.
90% of the time, my 1.5GHz G4 is sitting at 20% utilisation or less. You could argue that 80% of the power of the chip is wasted. However, when I am doing things that tax it they are almost always things that would support a large degree of parallelism.
I am TheRaven on Soylent News
Well, it certainly might seem that he is being a hypocrite. See:
"In another part of the article, Blachford claims that the cell processing units have no "cache." Instead, they each have a "local memory" that fetches data from main memory in 1024-bit blocks. Well, that's sort of like saying that an iMac doesn't have a "monitor," but it does have a surface on which visual output is displayed. In other words, the Cell "local memories," which are roughly analogous to the vector units' "scratchpad RAM" on the PS2's Emotion Engine, function as caches for the PUs. What has thrown the author for a loop is that they're small, and the fact that they're tied to each cellular processing unit means that they don't function in the memory heirarchy in the exact same way that an L1 does in a traditional processor design. They do, however, cache things. But maybe I'm being nitpicky with this."
and
"Finally, to address something more specific to the Cell architecture itself, on page 1 we find this claim:
It has been speculated that the vector units are the same as the AltiVec units found in the PowerPC G4 and G5 processors. I consider this highly unlikely as there are several differences. Firstly the number of registers is 128 instead of AltiVec's 32, secondly the APUs use a local memory whereas AltiVec does not, thirdly Altivec is an add-on to the existing PowerPC instruction set and operates as part of a PowerPC processor, the APUs are completely independent processors.
The author appears to be confusing an instruction set with an implementation. The 128-register detail is a problem, because, as the author correctly points out, conventional Altivec has only 32 vector registers. So obviously it's a given that Cell won't be using straight-up Altivec. But it's entirely possible that it'll use some kind of 128-register derivative of the Altivec instruction set. The fact that the individual processing units have a local cache has little to do with whether or not the PUs themselves implement some hypothetical Altivec derivative. Finally, the statement, "Altivec is an add-on to the existing PowerPC instruction set," is correct, but the rest of that sentence--"and operates as part of a PowerPC processor"--doesn't make a whole lot of sense to me in this context. Altivec is an ISA extension that is implemented in different ways on different PowerPC processors. The Cell processor's PUs could very well implement a hypothetical 128-register Altivec2 ISA extension, or they could implement some other SIMD ISA extension. The fact that SIMD code, written to whatever ISA, is farmed out to individual PUs has nothing to do with it. (If what I just said confuses you, you might check out this article.) "
compared to
"The main differences between an individual SPE and an early RISC machine are twofold. First, and most obvious, is the fact that the Cell SPE is geared for single-precision SIMD computation. Most of its arithmetic instructions operate on 128-bit vectors of four 32-bit elements. So the execution core is packed with vector ALUs, instead of the traditional fixed-point ALUs. The second difference, and this is perhaps the most important, is that the L1 cache has been replaced by 256K of locally addressable memory. The SPE's ISA, which is not VMX/Altivec-derivative (more on this below), includes instructions for using the DMA controller to move data between main memory and local storage. The end result is that each SPE is like a very small vector computer, with its own "CPU" and RAM."
But if you read closely you will see that Blachford, to generalize, was "right" (e.g. local memory and no AltiVec on SPE) for the wrong reasons, and even then some of the info was factually incorrect (e.g. SPE fetches blocks of 1024 bits). I do think that Hannibal was too hard on the guy (probably because of his completely unsubstantied claims about performance) and I think Hannibal should've cut Blachford some slack based on the source material that Blachford had available to him (although Blachford's
Reading the article, it reminds me of the typical mainframe architecture, where you have a central supervisory CPU, but most of the specialized work is done by the channel processors.
In the Cell, the main PPC CPU appears to identify a piece of work that needs to be done, schedules it to run on a SPE, uploads the code snippet to the SPE's LS via DMA transfer, and then goes off and does something else worthwhile while the SPE munches on it. I presume there's an interrupt mechanism to let the PPC know that a SPE has some results to return.
Compiler writers ought to be able to handle this new architecture well enough -- it's sort of like the current CPU/GPU split, where you've got the main program running on the system CPU, and specialized graphical transform programlets running on the GPU. There may need to be macros or code section identifiers in the source to let the compiler know which to target for that bit of code.
Obviously, this is just the first iteration of the Cell processor. I can see them widening the SPE from single precision to double precision (for the scientific market -- the game market probably doesn't need it), and going to a multi-core design to reduce the die size.
Chip H.
A fair question, but no. Consider for example an iterative factorial agorithm:
Totally unparallelizable.
This is a case where to execute the next step, you absolutely need the results of the previous step to be completed. There can be other kinds of reasons for this:In this case you don't even know how many times the loop is going to execute in advance. Now, maybe if you're clever you can figure it out, but what if f() is return (rand() * i);? Ick.
To make matters worse, C lets you use pointers and do whatever you want. So given some set of instructions, there could be side affects on i (or n) that are totally unpredictable without executing the program.
What you're looking for - the problem I'm describing - is not a problem with gcc. It's a problem with the C language. If you want to get rid of side-effects and make parallelization easy, try using a pure functional language. But people don't like programming in pure functional languages (well, I don't), they like programming in C (or other procedural-style language).
I'm not a smorgasbord.
XCode 2.0 is actually supposed to automatically "vectorize" programs for better optimization with altivec (check the Tiger page for it).