Slashdot Mirror


Ars Technica's Hannibal on IBM's Cell

endersdouble writes "Ars Technica's Jon "Hannibal" Stokes, known for his many articles on CPU technology, has posted a new article on IBM's new Cell processor. This one is the first part of a series, and covers the processor's approach to caching and control logic. Good read."

13 of 449 comments (clear)

  1. Part II is up now by Anonymous Coward · · Score: 5, Informative

    Part II is up as well.

  2. Re:Apple? by Tropaios · · Score: 5, Informative

    From the article:

    The Cell and Apple

    Finally, before signing off, I should clarify my earlier remarks to the effect that I don't think that Apple will use this CPU. I originally based this assessment on the fact that I knew that the SPUs would not use VMX/Altivec. However, the PPC core does have a VMX unit. Nonetheless, I expect this VMX to be very simple, and roughly comparable to the Altivec unit o the first G4. Everything on this processor is stripped down to the bare minimum, so don't expect a ton of VMX performance out of it, and definitely not anything comparable to the G5. Furthermore, any Altivec code written for the new G4 or G5 would have to be completely reoptimized due to inorder nature of the PPC core's issue.

    So the short answer is, Apple's use of this chip is within the realm of concievability, but it's extremely unlikely in the short- and medium-term. Apple is just too heavily invested in Altivec, and this processor is going to be a relative weakling in that department. Sure, it'll pack a major SIMD punch, but that will not be a double-precision Alitvec-type punch.

  3. Re:How do I code this thing?? by Space+cowboy · · Score: 4, Informative


    The architecture of the Cell look like a much-improved PS2 system, with the PS2's vu0 and vu1 (vector units 0 and 1) replaced by 8 SPE's. Also, the programmable DMA (with chaining ability, allowing it to sequence multiple DMA events one after the other etc.) looks very similar to the PS2's.

    If that turns out to be the case, then PS2 programming is a hint towards how it'll work. On the PS2, you generally configured the DMA controller to upload mini programs to the vector units, then DMA-chained data as streams from RAM through the just-uploaded program and onto the destination (usually the GS which rasterised the display).

    On the Cell, it looks as though you can DMA-chain code & data through multiple SPE's and ultimately back to RAM/the PPC core/whatever is memory mapped. This is cool - it's software pipelining :-)

    So, my guess is that the PPC acts as a (DMA, IO, etc.) controller (much like the mips chip did in the PS2), and the heavy lifting goes on in the vector units, with code and data being streamed in on demand.

    It's a different model to normal programming, and as far as I can see it encourages you to be closer to the metal (ie: it's harder, I normally expect my L1 cache to take care of itself...), but assuming they release/port gcc for the SPE's, it might not be too hard if you're used to event-driven highly-threaded programming. Let's just hope they release a Linux port and 'vcl' so we can do something useful with the vector units...

    Oh, and if the xbox was a target for a self-hosting linux solution, I think the Cell will be irrestible :-)

    Simon

    --
    Physicists get Hadrons!
  4. similar technology... by morcheeba · · Score: 3, Informative

    Cradle Semiconductor has been working for a while on a similar technology.

    Of course, it's all a matter of scale - TI had a 4 DSP, 1 CPU processor a while ago, but it only made 100 MFLOPS. Cradle's first product has 8 DSPs and 6 CPUs - depending on if you can get your data to properly pipeline through the processors, you can achieve up to 3.6 GFLOPs peak with only a 230 MHz clock.

  5. Re:Apple? by prockcore · · Score: 4, Informative

    My old 600mhz g3 ibook runs panther, safari, quicktime, iphoto, itunes and everything else I need on a daily basis pretty well. Try saying that about a five year old PC.

    5 year old? Your 600mhz g3 ibook came out October 2001. That machine is just a few months older than 3 years old.

    In October of 2001, the P4 was at 2.0ghz, and the Athlon 2000+ was just coming out. Are you going to tell me that a 2ghz P4 isn't adequate for browsing the web, listing to mp3s and importing digital photos?!

  6. Re:How do I code this thing?? by adam31 · · Score: 4, Informative
    This is similar to the 'scratchpad' RAM that Sony used in the PS2 and PS1. It's 16kb of on-chip (super-fast) memory that can be loaded and manipulated by the programmer, completely separate from the jurisdiction of the cache (which can cause big headaches-- think cache writeback with stale data).

    We'd do our skeletal animation skinning with this. DMA a bunch of verts to scratchpad, transform and weight them on the VU, DMA back to a display list. The thing is, there's really no high-level language support for this... the onus is on the programmer to schedule and memory map everything, mostly in assembly.

    The design of the cell-- it's incredible. It's every game programmer's wet dream. I just don't see how it's going to be as useful in other areas though. It's going to be a compiler-writer's nightmare, and to get real performance frome the SPEs is going to take a lot of assembly or a high-level language construct that I haven't seen yet.

  7. Re:More info in these slides by WoTG · · Score: 3, Informative

    In CPU sizes, 200mm is pretty big. IIRC, newer Athlons bump around 100mm depending on the cache size. P4's are somewhat larger than the Athlons. Bigger chips use more material and fab space, plus, the defect rate rises (it only takes a single error in a critical part of the chip to ruin it).

  8. Re:Eliminating Instruction Window by taniwha · · Score: 3, Informative
    read it more carefully - they don't eliminate the instruction window - they set it to 2. They can decode exactly 2 instructions/clock (provided they meet some simple dependency rules between the instructions) makes for easy decode trees, fast cycle times.

    This isn't even a general purpose processor (no MMUs on the cells either in the traditional sense) nor have they gone superscalar - they have enough registers to keep the thing busy, software can figure that out - this isn't even that new an idea, a cell looks a lot like one of the media processors that was being sold 5-6 years ago

    You're right it's not designed to be a scientific processor - but then high precision scientific processing is a tiny market these days - way more people want to pay for fast gaming platforms than want to do fluid dynamics or what have you

  9. Top 7 Myths of the New Cell Processor: by Modab · · Score: 5, Informative
    There are so many people saying dumb things about the Cell and the upcoming PS3, I have to set some things straight. Here goes:
    1. The Cell is just a PowerPC with some extra vector processing.
      Not quite. The Cell is 9 complete yet simple CPU's in one. Each handles its own tasks with its own memory. Imagine 9 computers each with a really fast network connection to the other 8. You could problably treat them as extra vector processors, but you'd then miss out on a lot of potential applications. For instance, the small processors can talk to each other rather than work with the PowerPC at all.
    2. Sony will have to sell the PS3 at an incredible loss to make it competitive.
      Hardly. Sony is following the same game plan as they did with their Emotion Engine in the PS2. Everyone thought that they were losing 1-200 bucks per machine at launch, but financial records have shown that besides the initial R&D (the cost of which is hard to figure out), they were only selling the PS2 at a small loss initially, and were breaking even by the end of the first year. By fabbing their own units, they took a huge risk, but they reaped huge benefits. Their risk and reward is roughly the same now as it was then.
    3. Apple is going to use this processor in their new machine.
      Doubtful. The problem is that though the main CPU is PowerPC-based like current Apple chips, it is stripped down, and the Altivec support will be much lower than in current G5s. Unoptomized, Apple code would run like a G4 on this hardware. They would have to commit to a lot of R&D for their OS to use the additional 8 processors on the chip, and redesign all their tweaked Altivec code. It would not be a simple port. A couple of years to complete, at least.
    4. The parallel nature will make it impossible to program.
      This is half-true. While it will be hard, most game logic will be performed on the traditional PowerPC part of the Cell, and thus normal to program. The difficult part will be concentrated in specific algorithms, like a physics engine, or certain AI. The modular nature of this code will mean that you could buy a physics engine already designed to fit into the 128k limitation of the subprocessor, and add the hooks into your code. Easy as pie.
    5. The Cell will do the graphics processing, leaving only rasterezation to the video card. Most likely false. The high-end video cards coming out now can process the rendering chain as fast as the Cell can, looking at the raw specs of 256Gflops from the Cell, as opposed to about 200GFlops from video cards. In two years, video cards will be capable of much more, and they are already optomized for this, where the Cell is not, so video cards will perform closer to the theoretical limits.
    6. The OS will handle the 8 additional vector processors so the programmer doesn't need to.
      Bwahahaha! No way. This is a delicate bit of coding that is going to need to be tweaked by highly-paid coders for every single game. Letting on OS predictively determine what code needs to get sent to what processor to run is insane in this case. The cost of switching out instructions is going to be very high, so any switch will need to be carefully considered by the designer, or the frame-rate will hit rock-bottom.
    7. The Cell chip is too large to fab efficiently.
      This is one myth that could be correct. The Cell is huge (relatively), and given IBM's problems in the recent past with making large, fast PowerPC chips, it's a huge gamble on the part of all parties involved that they can fab enough of these things.
    1. Re:Top 7 Myths of the New Cell Processor: by fitten · · Score: 3, Informative

      Your points #4 and #6 almost conflict...

      "Easy as pie."

      and

      "This is a delicate bit of coding that is going to need to be tweaked by highly-paid coders for every single game."

      I know that you are talking, sort of, about two different things, but they are related. While it may be "easy as pie" to add the hooks into your code to call what is essentially a library, making sure that library is scheduled, running, running in the right place and on the right data, and synchronized with everything else in the right ways, is the hard part (which you kind of glossed over in #4).

      Another myth:

      X. This architecture is "brand new" Personally, I worked on a system that was very similar to this but a little more discrete. The board had a single PPC microcontroller type CPU (integer only 32-bit) that was the 'boss' and also a single chip package of eight DSPs, all with their own local share of memory (not cache, but memory just like here) and each had some high speed DMA engines that connected each DSP to other DSPs in the package in a certain configuration. The 'boss PPC' would farm out tasks to the DSPs, which could work either singularly or in parallel with other DSPs (given the code as written) to crunch numbers. Other than advances in processes that have made the cores in the Cell have more features and functionality and the fact that the PPC was on a seperate chip from the DSPs, the architecture is very, very similar and, I will bet, the programming will be similar (it wasn't easy).

  10. Re:Golden oppourtunity for L4/Hurd by The_Dougster · · Score: 3, Informative
    Like everything else with the Hurd, it'll come in time. I'd do something with it, but I don't have a clue as how I'd write a device driver, much less an interface for one.
    Likewise. I'm in kind of a strange position as I am keenly interested in stuff like this, yet this really isn't my personal genre.

    The L4/Hurd guys are talking about "Deva" which is their vaporous specification for a driver interface. Since Hurd's drivers are all userland, this specification which nobody is working on is probably one of the most important things in the development of computer science right now. Hell, I should go back to university and take some classes so I could work on it. Talk about making history.

    Slashdotters constantly bitch and moan about how slow Hurd's progress has been, but all they have to do is send in a patch or write a doc or something. I personally ported GNU Pth to Hurd some years back making me (in my mind) one of the first people to ever compile and run a pthread app on Hurd (slooooowww). Hehe, but I did make pseudo-history in the world of computer science because of that stupid couple days I spend fiddling around with autoconf.

    L4/Hurd development is total anarchy. Work on whatever you feel like and send in patches. You don't have to "join GNU" or any such nonsense. In fact I have never ever seen RMS post to any Hurd developer list ever. He's more likely to post here.

    Slashdotters seem to think that Hurd is RMS's little empire, but in fact he has about nothing to to with it. Marcus Brinkman right now is probably the unofficial leader of Hurd just because he has personally written most of the really hardcore stuff.

    --
    Clickety Click ...
  11. Re:As a total Cell/PS2-coding n00b... by Herbmaster · · Score: 3, Informative
    [Re: any given for loop being parallelizable]

    A fair question, but no. Consider for example an iterative factorial agorithm:
    for (i=1;i<n;i++) {
    m = m * i;
    }
    Totally unparallelizable.
    This is a case where to execute the next step, you absolutely need the results of the previous step to be completed. There can be other kinds of reasons for this:
    for (i=0;i<n;i++) {
    i = f(i);
    }
    In this case you don't even know how many times the loop is going to execute in advance. Now, maybe if you're clever you can figure it out, but what if f() is return (rand() * i);? Ick.
    To make matters worse, C lets you use pointers and do whatever you want. So given some set of instructions, there could be side affects on i (or n) that are totally unpredictable without executing the program.
    What you're looking for - the problem I'm describing - is not a problem with gcc. It's a problem with the C language. If you want to get rid of side-effects and make parallelization easy, try using a pure functional language. But people don't like programming in pure functional languages (well, I don't), they like programming in C (or other procedural-style language).
    --
    I'm not a smorgasbord.
  12. Re:Not useful for scientific computing by marcoz76 · · Score: 3, Informative

    SPEs (CELL SIMD processors..) have double precision units! IBM will discuss DP units for CELL today or tomorrow at ISSCC.