Slashdot Mirror


Ars Technica's Hannibal on IBM's Cell

endersdouble writes "Ars Technica's Jon "Hannibal" Stokes, known for his many articles on CPU technology, has posted a new article on IBM's new Cell processor. This one is the first part of a series, and covers the processor's approach to caching and control logic. Good read."

24 of 449 comments (clear)

  1. Part II is up now by Anonymous Coward · · Score: 5, Informative

    Part II is up as well.

  2. Re:Apple? by sholden · · Score: 2, Informative

    My 7 year old PC (300mhz PII) runs everything I need on a daily basis pretty well.

    Firefox, wily, gcc, python, perl, MS office, gimp and so on.

  3. In addition to the pornography... by Anonymous Coward · · Score: 2, Informative

    ...clicking on this link also attempts to install a trojan (SARC's name: ByteVerify). I agree: this link should be removed and the poster's IP should be reported to the relevant authorities.

    1. Re:In addition to the pornography... by mfreed · · Score: 2, Informative

      nyud.net refers to a semi-open, peer-to-peer content distribution network called CoralCDN that is essentially a distributed web cache. We serve > 10 M requests daily for 100,000s of clients. For more information about this research project, please see:

      http://www.coralcdn.org/

      Basically, when you see a URL like you reported, it means that the content is actually from (stripping out the .nyud.net:8090):

      http://minigirls.biz/

      Thus, if you think you've seen evidence of child abuse, you should get in touch with the operators of minigirls.biz.

      > whois minigirls.biz
      Domain Name: MINIGIRLS.BIZ
      Domain ID: D8278609-BIZ
      Sponsoring Registrar: DIRECT INFORMATION PVT. LTD.,
      Sponsoring Registrar IANA ID: 303
      Registrant ID: DI_356733
      Registrant Name: Michael Pirson
      Registrant Organization: Megaaliance Inc
      Registrant Address1: 386 West Side St.
      Registrant City: Chicago
      Registrant State/Province: Il
      Registrant Postal Code: 26549
      Registrant Country: United States
      Registrant Phone Number: +91.226370256
      Registrant Email: mr.b_m@rambler.ru

      Note that CoralCDN does not provide archival storage of content, like google.com's cache or archive.org. Much like a web cache or "content accelerator" at ISPs, CoralCDN only keeps data temporarily in its file caches, either until the data expires or the is evicted (as may occur for unpopular data).

      If the origin site is no longer online or the particular content returns some HTTP error message, CoralCDN will only serve the old data for at most a short time (24 hours). Thus, if you believe that a website is making infringing/illegal content available, please direct any notices to that particular website. When that origin site complies with the notice, the content in question will naturally be removed from CoralCDN's caches through purely automated technical means in at most 24 hours.

  4. Re:Apple? by Tropaios · · Score: 5, Informative

    From the article:

    The Cell and Apple

    Finally, before signing off, I should clarify my earlier remarks to the effect that I don't think that Apple will use this CPU. I originally based this assessment on the fact that I knew that the SPUs would not use VMX/Altivec. However, the PPC core does have a VMX unit. Nonetheless, I expect this VMX to be very simple, and roughly comparable to the Altivec unit o the first G4. Everything on this processor is stripped down to the bare minimum, so don't expect a ton of VMX performance out of it, and definitely not anything comparable to the G5. Furthermore, any Altivec code written for the new G4 or G5 would have to be completely reoptimized due to inorder nature of the PPC core's issue.

    So the short answer is, Apple's use of this chip is within the realm of concievability, but it's extremely unlikely in the short- and medium-term. Apple is just too heavily invested in Altivec, and this processor is going to be a relative weakling in that department. Sure, it'll pack a major SIMD punch, but that will not be a double-precision Alitvec-type punch.

  5. Re:More info in these slides by Anonymous Coward · · Score: 1, Informative

    A larger chip is more expensive to produce: less chips on a single wafer.

  6. Re:Apple? by Chemical · · Score: 2, Informative
    I've got a 400Mhz iMac that a friend gave me, and while it does run Panther, Safari, Quicktime, and iTunes, it struggles with all of them. Flash animations stutter, iTunes skips if you try and do anything else while using it. It is incapable of decoding a 640x480 Divx file fast enough to actually play it.

    For browsing simple websites or writing emails it works acceptably. For anything even remotely multimedia related, it is rendered useless.

    Meanwhile a 400Mhz PII running Windows 2K can play flash, mp3s, and Divx files just fine.

  7. Re:How do I code this thing?? by Space+cowboy · · Score: 4, Informative


    The architecture of the Cell look like a much-improved PS2 system, with the PS2's vu0 and vu1 (vector units 0 and 1) replaced by 8 SPE's. Also, the programmable DMA (with chaining ability, allowing it to sequence multiple DMA events one after the other etc.) looks very similar to the PS2's.

    If that turns out to be the case, then PS2 programming is a hint towards how it'll work. On the PS2, you generally configured the DMA controller to upload mini programs to the vector units, then DMA-chained data as streams from RAM through the just-uploaded program and onto the destination (usually the GS which rasterised the display).

    On the Cell, it looks as though you can DMA-chain code & data through multiple SPE's and ultimately back to RAM/the PPC core/whatever is memory mapped. This is cool - it's software pipelining :-)

    So, my guess is that the PPC acts as a (DMA, IO, etc.) controller (much like the mips chip did in the PS2), and the heavy lifting goes on in the vector units, with code and data being streamed in on demand.

    It's a different model to normal programming, and as far as I can see it encourages you to be closer to the metal (ie: it's harder, I normally expect my L1 cache to take care of itself...), but assuming they release/port gcc for the SPE's, it might not be too hard if you're used to event-driven highly-threaded programming. Let's just hope they release a Linux port and 'vcl' so we can do something useful with the vector units...

    Oh, and if the xbox was a target for a self-hosting linux solution, I think the Cell will be irrestible :-)

    Simon

    --
    Physicists get Hadrons!
  8. similar technology... by morcheeba · · Score: 3, Informative

    Cradle Semiconductor has been working for a while on a similar technology.

    Of course, it's all a matter of scale - TI had a 4 DSP, 1 CPU processor a while ago, but it only made 100 MFLOPS. Cradle's first product has 8 DSPs and 6 CPUs - depending on if you can get your data to properly pipeline through the processors, you can achieve up to 3.6 GFLOPs peak with only a 230 MHz clock.

  9. Re:Apple? by prockcore · · Score: 4, Informative

    My old 600mhz g3 ibook runs panther, safari, quicktime, iphoto, itunes and everything else I need on a daily basis pretty well. Try saying that about a five year old PC.

    5 year old? Your 600mhz g3 ibook came out October 2001. That machine is just a few months older than 3 years old.

    In October of 2001, the P4 was at 2.0ghz, and the Athlon 2000+ was just coming out. Are you going to tell me that a 2ghz P4 isn't adequate for browsing the web, listing to mp3s and importing digital photos?!

  10. Re:How do I code this thing?? by adam31 · · Score: 4, Informative
    This is similar to the 'scratchpad' RAM that Sony used in the PS2 and PS1. It's 16kb of on-chip (super-fast) memory that can be loaded and manipulated by the programmer, completely separate from the jurisdiction of the cache (which can cause big headaches-- think cache writeback with stale data).

    We'd do our skeletal animation skinning with this. DMA a bunch of verts to scratchpad, transform and weight them on the VU, DMA back to a display list. The thing is, there's really no high-level language support for this... the onus is on the programmer to schedule and memory map everything, mostly in assembly.

    The design of the cell-- it's incredible. It's every game programmer's wet dream. I just don't see how it's going to be as useful in other areas though. It's going to be a compiler-writer's nightmare, and to get real performance frome the SPEs is going to take a lot of assembly or a high-level language construct that I haven't seen yet.

  11. Re:Hannibal by cooldev · · Score: 2, Informative
    What are you talking about? Obviously you can't eat a jet aircraft.

    Zen, your Google-fu is weak: http://en.wikipedia.org/wiki/Michel_Lotito :)

    Lotito's performances are the consumption of metal, glass, rubber and so on in items such as bicycles, televisions, a Cessna 150, and smaller items which are disassembled, cut-up and swallowed. The aircraft took roughly two years to be 'eaten' from 1978 to 1980. He began eating unusual material while a child and has been performing publicly since 1966.
  12. Not if the CPU is too expensive. by Sycraft-fu · · Score: 2, Informative

    New consoles are sold at a loss, but there's a limit to how muc of a loss companies can take. If the CPU itself ends up costing Sony $300+, they'd be looking at a massive loss on the consoles, probably larger than they are willing to take. That was actually a noted problem with the X-box, the loss per unit was large so they had to sell quite a few games per unit to make it up. I'm not even sure if they made any money on it.

    Well, in MS's case, they can pull shit like that. Microsoft makes loads of cash off their software division, and has loads already in the bank. They can afford to operate a new division at a loss, even a pretty substanital loss (if the X-box division did lose money, it wasn't a large amount).

    Sony, not to much. Their Playstation divison is their biggest money maker these days. So they can afford to take a loss on console hardware, but only so much that they know they'll make it back on games. They can't risk operating the division at a loss because it'd spell serious trouble for the company. They also aren't flush with cash. They've about $10 Billion, but have $12 Billion or so in debt (Microsoft has $34 Billion and no debt to speak of). They have to keep the money rolling in or things get ugly.

    Also we know from history that having the fastest processor or shinest graphics isn't what wins a given round of the console wars. It's all about games, and perception.

    Now who knows on pricing at this point, but the grandparent has a good point. That is a massive god damn die, like P4EE sized or so. Hot and expensive. As die size goes up, so do failure rates and thus cost, espically at high clock speeds. Hence why the EEs cost so damn much. I'd say it's a safe bet that this cell processor isn't going to be cheap.

    From the sounds of it, it's not going to need to be. Sounds like it's a high end calculation chip for badass number crunchers. Given that Power4/5s and Itanium 2s are popular for that sort of thing, people in those apps won't bat an eye at a $1000+ price tag.

  13. Re:More info in these slides by WoTG · · Score: 3, Informative

    In CPU sizes, 200mm is pretty big. IIRC, newer Athlons bump around 100mm depending on the cache size. P4's are somewhat larger than the Athlons. Bigger chips use more material and fab space, plus, the defect rate rises (it only takes a single error in a critical part of the chip to ruin it).

  14. Re:Eliminating Instruction Window by taniwha · · Score: 3, Informative
    read it more carefully - they don't eliminate the instruction window - they set it to 2. They can decode exactly 2 instructions/clock (provided they meet some simple dependency rules between the instructions) makes for easy decode trees, fast cycle times.

    This isn't even a general purpose processor (no MMUs on the cells either in the traditional sense) nor have they gone superscalar - they have enough registers to keep the thing busy, software can figure that out - this isn't even that new an idea, a cell looks a lot like one of the media processors that was being sold 5-6 years ago

    You're right it's not designed to be a scientific processor - but then high precision scientific processing is a tiny market these days - way more people want to pay for fast gaming platforms than want to do fluid dynamics or what have you

  15. Top 7 Myths of the New Cell Processor: by Modab · · Score: 5, Informative
    There are so many people saying dumb things about the Cell and the upcoming PS3, I have to set some things straight. Here goes:
    1. The Cell is just a PowerPC with some extra vector processing.
      Not quite. The Cell is 9 complete yet simple CPU's in one. Each handles its own tasks with its own memory. Imagine 9 computers each with a really fast network connection to the other 8. You could problably treat them as extra vector processors, but you'd then miss out on a lot of potential applications. For instance, the small processors can talk to each other rather than work with the PowerPC at all.
    2. Sony will have to sell the PS3 at an incredible loss to make it competitive.
      Hardly. Sony is following the same game plan as they did with their Emotion Engine in the PS2. Everyone thought that they were losing 1-200 bucks per machine at launch, but financial records have shown that besides the initial R&D (the cost of which is hard to figure out), they were only selling the PS2 at a small loss initially, and were breaking even by the end of the first year. By fabbing their own units, they took a huge risk, but they reaped huge benefits. Their risk and reward is roughly the same now as it was then.
    3. Apple is going to use this processor in their new machine.
      Doubtful. The problem is that though the main CPU is PowerPC-based like current Apple chips, it is stripped down, and the Altivec support will be much lower than in current G5s. Unoptomized, Apple code would run like a G4 on this hardware. They would have to commit to a lot of R&D for their OS to use the additional 8 processors on the chip, and redesign all their tweaked Altivec code. It would not be a simple port. A couple of years to complete, at least.
    4. The parallel nature will make it impossible to program.
      This is half-true. While it will be hard, most game logic will be performed on the traditional PowerPC part of the Cell, and thus normal to program. The difficult part will be concentrated in specific algorithms, like a physics engine, or certain AI. The modular nature of this code will mean that you could buy a physics engine already designed to fit into the 128k limitation of the subprocessor, and add the hooks into your code. Easy as pie.
    5. The Cell will do the graphics processing, leaving only rasterezation to the video card. Most likely false. The high-end video cards coming out now can process the rendering chain as fast as the Cell can, looking at the raw specs of 256Gflops from the Cell, as opposed to about 200GFlops from video cards. In two years, video cards will be capable of much more, and they are already optomized for this, where the Cell is not, so video cards will perform closer to the theoretical limits.
    6. The OS will handle the 8 additional vector processors so the programmer doesn't need to.
      Bwahahaha! No way. This is a delicate bit of coding that is going to need to be tweaked by highly-paid coders for every single game. Letting on OS predictively determine what code needs to get sent to what processor to run is insane in this case. The cost of switching out instructions is going to be very high, so any switch will need to be carefully considered by the designer, or the frame-rate will hit rock-bottom.
    7. The Cell chip is too large to fab efficiently.
      This is one myth that could be correct. The Cell is huge (relatively), and given IBM's problems in the recent past with making large, fast PowerPC chips, it's a huge gamble on the part of all parties involved that they can fab enough of these things.
    1. Re:Top 7 Myths of the New Cell Processor: by Anonymous Coward · · Score: 2, Informative

      Another myth: The SPUs have 128k of local ram. It's actually 256k ;)

    2. Re:Top 7 Myths of the New Cell Processor: by fitten · · Score: 3, Informative

      Your points #4 and #6 almost conflict...

      "Easy as pie."

      and

      "This is a delicate bit of coding that is going to need to be tweaked by highly-paid coders for every single game."

      I know that you are talking, sort of, about two different things, but they are related. While it may be "easy as pie" to add the hooks into your code to call what is essentially a library, making sure that library is scheduled, running, running in the right place and on the right data, and synchronized with everything else in the right ways, is the hard part (which you kind of glossed over in #4).

      Another myth:

      X. This architecture is "brand new" Personally, I worked on a system that was very similar to this but a little more discrete. The board had a single PPC microcontroller type CPU (integer only 32-bit) that was the 'boss' and also a single chip package of eight DSPs, all with their own local share of memory (not cache, but memory just like here) and each had some high speed DMA engines that connected each DSP to other DSPs in the package in a certain configuration. The 'boss PPC' would farm out tasks to the DSPs, which could work either singularly or in parallel with other DSPs (given the code as written) to crunch numbers. Other than advances in processes that have made the cores in the Cell have more features and functionality and the fact that the PPC was on a seperate chip from the DSPs, the architecture is very, very similar and, I will bet, the programming will be similar (it wasn't easy).

  16. Re:Golden oppourtunity for L4/Hurd by The_Dougster · · Score: 3, Informative
    Like everything else with the Hurd, it'll come in time. I'd do something with it, but I don't have a clue as how I'd write a device driver, much less an interface for one.
    Likewise. I'm in kind of a strange position as I am keenly interested in stuff like this, yet this really isn't my personal genre.

    The L4/Hurd guys are talking about "Deva" which is their vaporous specification for a driver interface. Since Hurd's drivers are all userland, this specification which nobody is working on is probably one of the most important things in the development of computer science right now. Hell, I should go back to university and take some classes so I could work on it. Talk about making history.

    Slashdotters constantly bitch and moan about how slow Hurd's progress has been, but all they have to do is send in a patch or write a doc or something. I personally ported GNU Pth to Hurd some years back making me (in my mind) one of the first people to ever compile and run a pthread app on Hurd (slooooowww). Hehe, but I did make pseudo-history in the world of computer science because of that stupid couple days I spend fiddling around with autoconf.

    L4/Hurd development is total anarchy. Work on whatever you feel like and send in patches. You don't have to "join GNU" or any such nonsense. In fact I have never ever seen RMS post to any Hurd developer list ever. He's more likely to post here.

    Slashdotters seem to think that Hurd is RMS's little empire, but in fact he has about nothing to to with it. Marcus Brinkman right now is probably the unofficial leader of Hurd just because he has personally written most of the really hardcore stuff.

    --
    Clickety Click ...
  17. Re:As a total Cell/PS2-coding n00b... by Herbmaster · · Score: 3, Informative
    [Re: any given for loop being parallelizable]

    A fair question, but no. Consider for example an iterative factorial agorithm:
    for (i=1;i<n;i++) {
    m = m * i;
    }
    Totally unparallelizable.
    This is a case where to execute the next step, you absolutely need the results of the previous step to be completed. There can be other kinds of reasons for this:
    for (i=0;i<n;i++) {
    i = f(i);
    }
    In this case you don't even know how many times the loop is going to execute in advance. Now, maybe if you're clever you can figure it out, but what if f() is return (rand() * i);? Ick.
    To make matters worse, C lets you use pointers and do whatever you want. So given some set of instructions, there could be side affects on i (or n) that are totally unpredictable without executing the program.
    What you're looking for - the problem I'm describing - is not a problem with gcc. It's a problem with the C language. If you want to get rid of side-effects and make parallelization easy, try using a pure functional language. But people don't like programming in pure functional languages (well, I don't), they like programming in C (or other procedural-style language).
    --
    I'm not a smorgasbord.
  18. Re:Not useful for scientific computing by marcoz76 · · Score: 3, Informative

    SPEs (CELL SIMD processors..) have double precision units! IBM will discuss DP units for CELL today or tomorrow at ISSCC.

  19. Re:More info in these slides by i41Overlord · · Score: 2, Informative

    The reason it has so many transistors is because of the amount of onboard memory. Memory uses a lot more transistors than the logic circuits do.

    A complicated CPU may have tens or hundreds of millions of transistors, but a single memory chip has billions.

    So when you bump up the cache size on a CPU, the transistor count goes up greatly.

  20. Re:If Sony can, Apple can by mrseigen · · Score: 2, Informative

    XCode 2.0 is actually supposed to automatically "vectorize" programs for better optimization with altivec (check the Tiger page for it).