Ars Technica's Hannibal on IBM's Cell
endersdouble writes "Ars Technica's Jon "Hannibal" Stokes, known for
his many articles on CPU technology, has posted a new article on IBM's new Cell processor. This one is the first part of a series, and covers the processor's approach to caching and control logic. Good read."
Part II is up as well.
My 7 year old PC (300mhz PII) runs everything I need on a daily basis pretty well.
Firefox, wily, gcc, python, perl, MS office, gimp and so on.
...clicking on this link also attempts to install a trojan (SARC's name: ByteVerify). I agree: this link should be removed and the poster's IP should be reported to the relevant authorities.
From the article:
The Cell and Apple
Finally, before signing off, I should clarify my earlier remarks to the effect that I don't think that Apple will use this CPU. I originally based this assessment on the fact that I knew that the SPUs would not use VMX/Altivec. However, the PPC core does have a VMX unit. Nonetheless, I expect this VMX to be very simple, and roughly comparable to the Altivec unit o the first G4. Everything on this processor is stripped down to the bare minimum, so don't expect a ton of VMX performance out of it, and definitely not anything comparable to the G5. Furthermore, any Altivec code written for the new G4 or G5 would have to be completely reoptimized due to inorder nature of the PPC core's issue.
So the short answer is, Apple's use of this chip is within the realm of concievability, but it's extremely unlikely in the short- and medium-term. Apple is just too heavily invested in Altivec, and this processor is going to be a relative weakling in that department. Sure, it'll pack a major SIMD punch, but that will not be a double-precision Alitvec-type punch.
A larger chip is more expensive to produce: less chips on a single wafer.
For browsing simple websites or writing emails it works acceptably. For anything even remotely multimedia related, it is rendered useless.
Meanwhile a 400Mhz PII running Windows 2K can play flash, mp3s, and Divx files just fine.
The architecture of the Cell look like a much-improved PS2 system, with the PS2's vu0 and vu1 (vector units 0 and 1) replaced by 8 SPE's. Also, the programmable DMA (with chaining ability, allowing it to sequence multiple DMA events one after the other etc.) looks very similar to the PS2's.
If that turns out to be the case, then PS2 programming is a hint towards how it'll work. On the PS2, you generally configured the DMA controller to upload mini programs to the vector units, then DMA-chained data as streams from RAM through the just-uploaded program and onto the destination (usually the GS which rasterised the display).
On the Cell, it looks as though you can DMA-chain code & data through multiple SPE's and ultimately back to RAM/the PPC core/whatever is memory mapped. This is cool - it's software pipelining
So, my guess is that the PPC acts as a (DMA, IO, etc.) controller (much like the mips chip did in the PS2), and the heavy lifting goes on in the vector units, with code and data being streamed in on demand.
It's a different model to normal programming, and as far as I can see it encourages you to be closer to the metal (ie: it's harder, I normally expect my L1 cache to take care of itself...), but assuming they release/port gcc for the SPE's, it might not be too hard if you're used to event-driven highly-threaded programming. Let's just hope they release a Linux port and 'vcl' so we can do something useful with the vector units...
Oh, and if the xbox was a target for a self-hosting linux solution, I think the Cell will be irrestible
Simon
Physicists get Hadrons!
Cradle Semiconductor has been working for a while on a similar technology.
Of course, it's all a matter of scale - TI had a 4 DSP, 1 CPU processor a while ago, but it only made 100 MFLOPS. Cradle's first product has 8 DSPs and 6 CPUs - depending on if you can get your data to properly pipeline through the processors, you can achieve up to 3.6 GFLOPs peak with only a 230 MHz clock.
HIV Crosses Species Barrier... into Muppets
My old 600mhz g3 ibook runs panther, safari, quicktime, iphoto, itunes and everything else I need on a daily basis pretty well. Try saying that about a five year old PC.
5 year old? Your 600mhz g3 ibook came out October 2001. That machine is just a few months older than 3 years old.
In October of 2001, the P4 was at 2.0ghz, and the Athlon 2000+ was just coming out. Are you going to tell me that a 2ghz P4 isn't adequate for browsing the web, listing to mp3s and importing digital photos?!
We'd do our skeletal animation skinning with this. DMA a bunch of verts to scratchpad, transform and weight them on the VU, DMA back to a display list. The thing is, there's really no high-level language support for this... the onus is on the programmer to schedule and memory map everything, mostly in assembly.
The design of the cell-- it's incredible. It's every game programmer's wet dream. I just don't see how it's going to be as useful in other areas though. It's going to be a compiler-writer's nightmare, and to get real performance frome the SPEs is going to take a lot of assembly or a high-level language construct that I haven't seen yet.
Linux on Intel: Think Dead Man Walking and Grid vs. SMP: The Empire Tries Again and Fast, Faster and IBM's PlayStation 3 Processor.
Zen, your Google-fu is weak: http://en.wikipedia.org/wiki/Michel_Lotito :)
New consoles are sold at a loss, but there's a limit to how muc of a loss companies can take. If the CPU itself ends up costing Sony $300+, they'd be looking at a massive loss on the consoles, probably larger than they are willing to take. That was actually a noted problem with the X-box, the loss per unit was large so they had to sell quite a few games per unit to make it up. I'm not even sure if they made any money on it.
Well, in MS's case, they can pull shit like that. Microsoft makes loads of cash off their software division, and has loads already in the bank. They can afford to operate a new division at a loss, even a pretty substanital loss (if the X-box division did lose money, it wasn't a large amount).
Sony, not to much. Their Playstation divison is their biggest money maker these days. So they can afford to take a loss on console hardware, but only so much that they know they'll make it back on games. They can't risk operating the division at a loss because it'd spell serious trouble for the company. They also aren't flush with cash. They've about $10 Billion, but have $12 Billion or so in debt (Microsoft has $34 Billion and no debt to speak of). They have to keep the money rolling in or things get ugly.
Also we know from history that having the fastest processor or shinest graphics isn't what wins a given round of the console wars. It's all about games, and perception.
Now who knows on pricing at this point, but the grandparent has a good point. That is a massive god damn die, like P4EE sized or so. Hot and expensive. As die size goes up, so do failure rates and thus cost, espically at high clock speeds. Hence why the EEs cost so damn much. I'd say it's a safe bet that this cell processor isn't going to be cheap.
From the sounds of it, it's not going to need to be. Sounds like it's a high end calculation chip for badass number crunchers. Given that Power4/5s and Itanium 2s are popular for that sort of thing, people in those apps won't bat an eye at a $1000+ price tag.
In CPU sizes, 200mm is pretty big. IIRC, newer Athlons bump around 100mm depending on the cache size. P4's are somewhat larger than the Athlons. Bigger chips use more material and fab space, plus, the defect rate rises (it only takes a single error in a critical part of the chip to ruin it).
This isn't even a general purpose processor (no MMUs on the cells either in the traditional sense) nor have they gone superscalar - they have enough registers to keep the thing busy, software can figure that out - this isn't even that new an idea, a cell looks a lot like one of the media processors that was being sold 5-6 years ago
You're right it's not designed to be a scientific processor - but then high precision scientific processing is a tiny market these days - way more people want to pay for fast gaming platforms than want to do fluid dynamics or what have you
Not quite. The Cell is 9 complete yet simple CPU's in one. Each handles its own tasks with its own memory. Imagine 9 computers each with a really fast network connection to the other 8. You could problably treat them as extra vector processors, but you'd then miss out on a lot of potential applications. For instance, the small processors can talk to each other rather than work with the PowerPC at all.
Hardly. Sony is following the same game plan as they did with their Emotion Engine in the PS2. Everyone thought that they were losing 1-200 bucks per machine at launch, but financial records have shown that besides the initial R&D (the cost of which is hard to figure out), they were only selling the PS2 at a small loss initially, and were breaking even by the end of the first year. By fabbing their own units, they took a huge risk, but they reaped huge benefits. Their risk and reward is roughly the same now as it was then.
Doubtful. The problem is that though the main CPU is PowerPC-based like current Apple chips, it is stripped down, and the Altivec support will be much lower than in current G5s. Unoptomized, Apple code would run like a G4 on this hardware. They would have to commit to a lot of R&D for their OS to use the additional 8 processors on the chip, and redesign all their tweaked Altivec code. It would not be a simple port. A couple of years to complete, at least.
This is half-true. While it will be hard, most game logic will be performed on the traditional PowerPC part of the Cell, and thus normal to program. The difficult part will be concentrated in specific algorithms, like a physics engine, or certain AI. The modular nature of this code will mean that you could buy a physics engine already designed to fit into the 128k limitation of the subprocessor, and add the hooks into your code. Easy as pie.
Bwahahaha! No way. This is a delicate bit of coding that is going to need to be tweaked by highly-paid coders for every single game. Letting on OS predictively determine what code needs to get sent to what processor to run is insane in this case. The cost of switching out instructions is going to be very high, so any switch will need to be carefully considered by the designer, or the frame-rate will hit rock-bottom.
This is one myth that could be correct. The Cell is huge (relatively), and given IBM's problems in the recent past with making large, fast PowerPC chips, it's a huge gamble on the part of all parties involved that they can fab enough of these things.
The L4/Hurd guys are talking about "Deva" which is their vaporous specification for a driver interface. Since Hurd's drivers are all userland, this specification which nobody is working on is probably one of the most important things in the development of computer science right now. Hell, I should go back to university and take some classes so I could work on it. Talk about making history.
Slashdotters constantly bitch and moan about how slow Hurd's progress has been, but all they have to do is send in a patch or write a doc or something. I personally ported GNU Pth to Hurd some years back making me (in my mind) one of the first people to ever compile and run a pthread app on Hurd (slooooowww). Hehe, but I did make pseudo-history in the world of computer science because of that stupid couple days I spend fiddling around with autoconf.
L4/Hurd development is total anarchy. Work on whatever you feel like and send in patches. You don't have to "join GNU" or any such nonsense. In fact I have never ever seen RMS post to any Hurd developer list ever. He's more likely to post here.
Slashdotters seem to think that Hurd is RMS's little empire, but in fact he has about nothing to to with it. Marcus Brinkman right now is probably the unofficial leader of Hurd just because he has personally written most of the really hardcore stuff.
Clickety Click
A fair question, but no. Consider for example an iterative factorial agorithm:
Totally unparallelizable.
This is a case where to execute the next step, you absolutely need the results of the previous step to be completed. There can be other kinds of reasons for this:In this case you don't even know how many times the loop is going to execute in advance. Now, maybe if you're clever you can figure it out, but what if f() is return (rand() * i);? Ick.
To make matters worse, C lets you use pointers and do whatever you want. So given some set of instructions, there could be side affects on i (or n) that are totally unpredictable without executing the program.
What you're looking for - the problem I'm describing - is not a problem with gcc. It's a problem with the C language. If you want to get rid of side-effects and make parallelization easy, try using a pure functional language. But people don't like programming in pure functional languages (well, I don't), they like programming in C (or other procedural-style language).
I'm not a smorgasbord.
SPEs (CELL SIMD processors..) have double precision units! IBM will discuss DP units for CELL today or tomorrow at ISSCC.
The reason it has so many transistors is because of the amount of onboard memory. Memory uses a lot more transistors than the logic circuits do.
A complicated CPU may have tens or hundreds of millions of transistors, but a single memory chip has billions.
So when you bump up the cache size on a CPU, the transistor count goes up greatly.
XCode 2.0 is actually supposed to automatically "vectorize" programs for better optimization with altivec (check the Tiger page for it).