Ars Technica's Hannibal on IBM's Cell
endersdouble writes "Ars Technica's Jon "Hannibal" Stokes, known for
his many articles on CPU technology, has posted a new article on IBM's new Cell processor. This one is the first part of a series, and covers the processor's approach to caching and control logic. Good read."
Why do I have the sneaking suspicion that, if successful, this processor will eclipse the PowerPC on the Mac in the next few years?
Part II is up as well.
.. made of risc components.
" Last fall, IBM and Sony said they were developing a workstation based on Cell chips, which is the first product IBM will ship based on Cell."
Regardless if this is the first product shipped or not, a workstation is coming. I can't see it running anything but linux. Given the mass market targeting of the cell, I hope Sony makes a strong go at grabbing the market with cheap hardware, rather than trying to milk the high-end content creation market first.
"A language that doesn't affect the way you think about programming, is not worth knowing" - Alan Perlis
e.g. 234 M transistors (!) That's why I don't think this will be replacing the G5 any time soon. The die size (at the current prototype's 90nm) is over 200 mm2.
It'll have to get a fair bit smaller/cheaper before the PS3 can use it without major subsidies, and I don't know why they think general consumer devices will want it. God knows how much power it dissipates with all 8 SPEs clocking over at 4 GHz...
Why would anyone engrave "Elbereth"?
Thank god. I've enjoyed his articles in the past, and if experience is any indication, I will have the false impression that I understand this stuff in a nontrivial way for up to three hours. This is not meant to rag on Hannibal, BTW.
WIth a name like that, I expect to see pictures of him eating those Cell processors, and describing how they taste.
// file: mice.h
#include "frickin_lasers.h"
The one thing I don't understand is how I would code for this thing. As best as I understand it, I now have some instructions for controlling the cache (or LAM, whatever) which sounds cool, but are there any details yet of how I'd write code for this? I'm also disappointed that the article didn't explain how one would use their SIMD instructions if they aren't using any of the existing standards. So I load my vectors with the cache control and ask the processors to ever so kindly add them?
Anybody out there with experience on this architecture or even attended the presentation itself can give us mere coders details? Preferably a website.
The bitter lessons of a veteran coder: http://bitterprogrammer.blogspot.com
Is that the 386 instruction set and arcitecture is so non proprietary. What made it so popular certainly wasn't that it was better. If I had the dough, I can literally make one and my own fab without asking a single soul. Alot of times it seems companies try to gather into consortiums to mimic the same effect and gather market momentum, but these are doomed to failure because the more valuable the technology becomes - the greater the pressure to diferentiate and fence off some "teritory" for themselves. We saw this happen first hand with UNIX, where all the flavors would constantly try to group under these unified standards - and they made little progress until Linux came along. The CPU world needs somthing similar to protect people from patent harassment. for design, cores, and fabrication.
What I find interesting is that the vector processor are restricted to single precision floating point calculations.
This isn't terribly useful for scientific computations (there is the same problem with the GPU): currently the IEEE is working on a standard for 128bit precision floating point calculations!
Of course for 3D, video and sound, 32bit precision is good enough and *if* programmers (a big if) manage to overcome the pain of 'parallel programming' then it could be a big success.
Cradle Semiconductor has been working for a while on a similar technology.
Of course, it's all a matter of scale - TI had a 4 DSP, 1 CPU processor a while ago, but it only made 100 MFLOPS. Cradle's first product has 8 DSPs and 6 CPUs - depending on if you can get your data to properly pipeline through the processors, you can achieve up to 3.6 GFLOPs peak with only a 230 MHz clock.
HIV Crosses Species Barrier... into Muppets
Another article on the Cell design at http://www.theregister.co.uk/2005/02/03/cell_analy sis_part_two/ seems to indicate that there is some sort of DRM built in.
Hannibal doesn't say anything about this (that I noticed) - anyone have more info?
Don't save Windows XP! http://www.petitiononline.com/jjw1xp/petition.html
A proposal for Apple
I don't have an account, but this is an honest idea.
Why doesn't Apple include a Playstation 2 support card into their Macintosh line?
Problem: The OSX platform has almost no games. I own several macs, I love my macs, and I sincerely enjoy OSX. But it has no games, and that will never get better, especially as simpler games migrate to the web and the complex ones bail for the console market. The PC gaming market has essentially peaked.
Solution: Embed (or include as a BTO option) a PS2 chipset to a Macintosh. Run the generated display straight through to the graphical overlay plane. Done.
Everything works. The controllers are trivially converted to use USB. The DVD drive is already there. The display is already there. The USB and Firewire is already there. The harddrive is already there. The "memory cards" are already there.
Reason: The Macintosh game library explodes instantly to encompass something like 3,000 PS1 and PS2 games. With no need for emulation, the games are guaranteed to work out of the box and provide the Apple ease of use everyone loves. Sony increases their marketshare, Apple gets a viable expanding game library, and users get a vastly better gaming experience on OSX for maybe $40 of parts and engineering.
Why won't this work?
The difference is that instead of the compiler taking up the slack (as in RISC), a combination of the compiler, the programmer, some very smart scheduling software
Requiring programmers to learn how to write parallel code that makes good use of this processor seems pretty dicey to me. Few programmers have been trained to write parallel code (most struggle with threading). The fact that no popular programming language has a good parallel model is also a big stumbling block.
This problem seems to be looming for all the dual core processors, but I havent seen a big effort to teach programmers how to adapt.
A budget-class PC laptop of that time might have been about 900 MHz to 1.1 GHz. I wouldn't consider such a laptop anything near useable. They tended to have poor quality sound systems that bottlenecked the processor and atrociously short battery times. The ibook was legendary for its excellent battery performance
Get off what you 'assume', assumption is just intuition for idiots.
We have test 200mhz laptops with 80mb of ram 5gb hard drives, released 1997 all running WindowsXP Professional (yes even the themes turned on) and they benchmark faster than they did when they shipped with Windows 95.
Secondly, they can do full 30fps video as long as it is uncompressed AVI or even WMA 9. QuickTime (MPEG4), MPEG2, and real stutter horribly on video playback unfortunately.
As for battery, don't know, these laptops hold for 3hrs with a single charge, and yes techs are REQUIRED and have no problems using them daily in test scenarios.
Now if you really want to compare laptops to laptops, why don't I show you our 900mhz AMD Compaq laptops, they have JBL sound systems in them, and there isn't a single feature the cannot perform with the exception of running a T&L based video game, as the integrated video doesn't handle it, oh wait, the 900mhz PowerBook video didn't support such features either. (BTW, This is not to say that there are not several 900-1000mhz class laptops that have upper end video features), I am just using what we have in our test labs for comparison.
The 900mhz laptop has a DVD/CDRW, came out late 2000 early 2001 (trying to remember if we got them before holidays or not). They do full software DVD decoding with less than 20% CPU utilization and pretty much do anything fairly fast that we through at them. We even have a beta version of Windows 2003 server running on one with 256mb of RAM. (Yes we are always pushing the limits, but it works as fast as the WindowsXP pro version of the machine sitting next to it.)
Now off my rant... Macs truly are great, and the PowerBooks of the time were great, but that DOES NOT MEAN they were the BEST, WILL ALWAYS BE THE BEST, or you should be complacent listening to Apple tell you what you are getting is the best when it might not be. It is time for us as MAC users to stand up and DEMAND that technology becomes as much a part of what a MAC is as the EASE of USE in the Interface.
The time is now, we need to STOP accepting what they tell us and give us and force them to truly give us the LATEST technological concepts, not just the above average concepts when compared to the PC world. These are Macs, they SHOULD BE BETTER. IT shouldn't even be subjected to a debate they should be so far advanced a debate should not be possible. PERIOD.
Sadly, it just isn't true now, and has not been for many years. OSX has giving the Mac world some credibility backing OS technology, but not Apple needs to take Macs to the next level.
Even if my comment inspires one Mac user to say hey Apple, we want better, then maybe we all can be the symbolic person with the hammer from their 1984 video and WAKE THEM UP this time.
This isn't even a general purpose processor (no MMUs on the cells either in the traditional sense) nor have they gone superscalar - they have enough registers to keep the thing busy, software can figure that out - this isn't even that new an idea, a cell looks a lot like one of the media processors that was being sold 5-6 years ago
You're right it's not designed to be a scientific processor - but then high precision scientific processing is a tiny market these days - way more people want to pay for fast gaming platforms than want to do fluid dynamics or what have you
Not quite. The Cell is 9 complete yet simple CPU's in one. Each handles its own tasks with its own memory. Imagine 9 computers each with a really fast network connection to the other 8. You could problably treat them as extra vector processors, but you'd then miss out on a lot of potential applications. For instance, the small processors can talk to each other rather than work with the PowerPC at all.
Hardly. Sony is following the same game plan as they did with their Emotion Engine in the PS2. Everyone thought that they were losing 1-200 bucks per machine at launch, but financial records have shown that besides the initial R&D (the cost of which is hard to figure out), they were only selling the PS2 at a small loss initially, and were breaking even by the end of the first year. By fabbing their own units, they took a huge risk, but they reaped huge benefits. Their risk and reward is roughly the same now as it was then.
Doubtful. The problem is that though the main CPU is PowerPC-based like current Apple chips, it is stripped down, and the Altivec support will be much lower than in current G5s. Unoptomized, Apple code would run like a G4 on this hardware. They would have to commit to a lot of R&D for their OS to use the additional 8 processors on the chip, and redesign all their tweaked Altivec code. It would not be a simple port. A couple of years to complete, at least.
This is half-true. While it will be hard, most game logic will be performed on the traditional PowerPC part of the Cell, and thus normal to program. The difficult part will be concentrated in specific algorithms, like a physics engine, or certain AI. The modular nature of this code will mean that you could buy a physics engine already designed to fit into the 128k limitation of the subprocessor, and add the hooks into your code. Easy as pie.
Bwahahaha! No way. This is a delicate bit of coding that is going to need to be tweaked by highly-paid coders for every single game. Letting on OS predictively determine what code needs to get sent to what processor to run is insane in this case. The cost of switching out instructions is going to be very high, so any switch will need to be carefully considered by the designer, or the frame-rate will hit rock-bottom.
This is one myth that could be correct. The Cell is huge (relatively), and given IBM's problems in the recent past with making large, fast PowerPC chips, it's a huge gamble on the part of all parties involved that they can fab enough of these things.
I am not convinced by this argument. A lot of OS X code uses AltiVec, but very little actually uses it directly. Apple has spent a lot of effort producing libraries that people can use which wrap AltiVec into something higher level (e.g. QuickTime, vDSP). Most of these could potentially be ported to the SPEs. Things like CoreVideo could also make use of the SPEs.
all those vector units are great for number crunching, but how much of that do you do each day? And when you're not, that's 3/4 of the cost of your chip sitting around idle.
90% of the time, my 1.5GHz G4 is sitting at 20% utilisation or less. You could argue that 80% of the power of the chip is wasted. However, when I am doing things that tax it they are almost always things that would support a large degree of parallelism.
I am TheRaven on Soylent News
The L4/Hurd guys are talking about "Deva" which is their vaporous specification for a driver interface. Since Hurd's drivers are all userland, this specification which nobody is working on is probably one of the most important things in the development of computer science right now. Hell, I should go back to university and take some classes so I could work on it. Talk about making history.
Slashdotters constantly bitch and moan about how slow Hurd's progress has been, but all they have to do is send in a patch or write a doc or something. I personally ported GNU Pth to Hurd some years back making me (in my mind) one of the first people to ever compile and run a pthread app on Hurd (slooooowww). Hehe, but I did make pseudo-history in the world of computer science because of that stupid couple days I spend fiddling around with autoconf.
L4/Hurd development is total anarchy. Work on whatever you feel like and send in patches. You don't have to "join GNU" or any such nonsense. In fact I have never ever seen RMS post to any Hurd developer list ever. He's more likely to post here.
Slashdotters seem to think that Hurd is RMS's little empire, but in fact he has about nothing to to with it. Marcus Brinkman right now is probably the unofficial leader of Hurd just because he has personally written most of the really hardcore stuff.
Clickety Click
Reading the article, it reminds me of the typical mainframe architecture, where you have a central supervisory CPU, but most of the specialized work is done by the channel processors.
In the Cell, the main PPC CPU appears to identify a piece of work that needs to be done, schedules it to run on a SPE, uploads the code snippet to the SPE's LS via DMA transfer, and then goes off and does something else worthwhile while the SPE munches on it. I presume there's an interrupt mechanism to let the PPC know that a SPE has some results to return.
Compiler writers ought to be able to handle this new architecture well enough -- it's sort of like the current CPU/GPU split, where you've got the main program running on the system CPU, and specialized graphical transform programlets running on the GPU. There may need to be macros or code section identifiers in the source to let the compiler know which to target for that bit of code.
Obviously, this is just the first iteration of the Cell processor. I can see them widening the SPE from single precision to double precision (for the scientific market -- the game market probably doesn't need it), and going to a multi-core design to reduce the die size.
Chip H.
A fair question, but no. Consider for example an iterative factorial agorithm:
Totally unparallelizable.
This is a case where to execute the next step, you absolutely need the results of the previous step to be completed. There can be other kinds of reasons for this:In this case you don't even know how many times the loop is going to execute in advance. Now, maybe if you're clever you can figure it out, but what if f() is return (rand() * i);? Ick.
To make matters worse, C lets you use pointers and do whatever you want. So given some set of instructions, there could be side affects on i (or n) that are totally unpredictable without executing the program.
What you're looking for - the problem I'm describing - is not a problem with gcc. It's a problem with the C language. If you want to get rid of side-effects and make parallelization easy, try using a pure functional language. But people don't like programming in pure functional languages (well, I don't), they like programming in C (or other procedural-style language).
I'm not a smorgasbord.