Slashdot Mirror


IBM Full-System Simulator Team Speaks Out

Shell writes "The IBM Full-System Simulator for the Cell Broadband Engine (Cell BE) processor, known inside IBM as codeword Mambo, is a key component of the newly posted offerings on alphaWorks. Meet some of the members of the team that pulled it together, and hear about the simulator in their own words."

18 of 115 comments (clear)

  1. PS3? by raingrove · · Score: 3, Funny

    Does this mean we can emulate PS3? lol

    1. Re:PS3? by garrett714 · · Score: 5, Informative

      Yes and no.

      While this "simulator" is basically an emulation of the Cell hardware, it won't allow people to run games at full speed. It's more of a developer tool, that allows programmers to start coding for the PS3 when they don't actually have the hardware yet. Still, it is reasonable to believe that emulation of the PS3 will be viable in the future (although not for a long time)

  2. Re:mambo? by donour · · Score: 2, Informative

    I'm a moron. I should have read the link closer.

  3. Re:You WANT A Cell System... by smashr · · Score: 3, Insightful

    Running Linux on one of these things is simply INSANE.

    I have been through a lot of chip transitions over the years and been impressed with the leaps each new generation has made.

    But Cell is something entirely different. It is such a HUGE leap in performance beyond x86 systems that to go back to using a x86 machine is unthinkable now for me. I almost feel drunk from the power I have at my hands...

    Read up all the Cell info you can at IBM's site and read the various patents IBM, Toshiba, and Sony have out there. And find some way to get your hands on one of these...

    I can now see why the PS3 stuff we are seeing is so amazing...


    Sure, the cell is amazing, IF you are doing the right things. You say that you simply want to leave the old x86 architecture behind but the truth of the matter is that the two do not even begin to compare.

    It is not simply a matter of saying "OMG my cell has 8 cores at 4ghz". The main Power Processing Element is crippled at best for simple single threaded applications -- roughly equivalent to a PowerPC of the G3 era, but specifically in-order execution. The SPEs (the other 8 cores) are essentially mini vector computers. They can perform a massive amount of floating point calculations in parrallel, however they do not enjoy an inante ability to deal well with all sorts of code as a standard x86 cpu could.

    The cell designers have comptley sacrificed instruction level parrallelism in exchange for thread level parrallelism. It is certainly a valid and interesting way to achieve speed, but not for single threaded applications. -- Don't throw out your x86 just yet.

  4. True, however by geekoid · · Score: 3, Insightful

    when the speed is fast enough that the single threaded applications run fast enough, even if technically crippled, will it matter?

    If cell is what what it claims to be, developers will create new applications use multi threaed applications. Compared to 15 years ago, multi-threading is a snap.

    --
    The Kruger Dunning explains most post on /. http://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
  5. Re:You WANT A Cell System... by adisakp · · Score: 5, Informative

    Running Linux on one of these things is simply INSANE.

    I almost feel drunk from the power I have at my hands

    Here's some advice from someone who has access to a REAL CELL chip. I hate to disappoint you but aside from custom libraries specifically optimized for CELL, Linux ain't going to run fast on this machine. All the generic open source code targeted towards the general CPU is going to run faster on a Dual-Core Intel or Dual-Proc/Dual-Core Mac. The actual CPU's in this machine are simple pipelined (think Pentium I level of optimizations) vs current gen CPUs (P4 has out-of-order execution, speculative execution, register renaming, branch prediction, etc). While simple C code runs roughly the same speed, complicated C++ constructs are running 2-10X slower on CELL's simplified PowerPC core versus the G5's you'll find in a Mac.

    Code needs to be rewritten specifically to take advantage of the actual SPE/SPU's (Synergistic Processing Engines/Units - I prefer SPE since Sony calls their PS1/PS2 sound chip the SPE). Until those Linux libraries appear, CELL isn't going to run anything faster. Not to mention that it will have to be custom code libraries that DON'T run on the MAIN CPU since the SPE's execute different machine code.

  6. Re:Mambo - LOL by FooAtWFU · · Score: 2, Funny
    It's called a 'codename'. The real name is apparently 'IBM Full-System Simulator for the Cell Broadband Engine processor'.

    Yes, all of IBM's products are named like that. I mean, every now and again they try to go for something neat and spiffy sounding like "WebSphere", but then they have to munge it all up with "Websphere Application Server" (WAS) and "Websphere Client Technologies Mobile Edition" (WCTME) and so on and so forth. This is normal for IBM, and this is why they really need code-names.
    A related story out of IBM from a distinguished engineer I once made the acquaintance of... He's walking along one day and runs into one of his boss's boss's bosses or something like that. So he says, "I know how we can win the war on drugs." He explains: "We make all drugs legal... and assign exclusive marketing rights to the OS/2 marketing team." Boss-dude tells him he's an asshole; he shrugs: "But you got my point."

    --
    The World Wide Web is dying. Soon, we shall have only the Internet.
  7. Re:Where is my workstation! by garrett714 · · Score: 2, Informative

    I stand corrected. Here is a link to info about the cell based blade servers. One interesting thing to note is at the bottom of the page: "The OS used was Linux 2.6.11" So I guess that kinda disproves all the people saying Linux won't run well on the Cell.

  8. Praise for Cell by acidblood · · Score: 5, Informative

    I've been running the simulator here, and managed to port the distributed.net client to it. The performance of current cores in the PPE is so-so (worse than the G4 in my Mac Mini), although I'm sure it would improve by proper optimization. The SPE is a completely different matter though. I wrote an RC5-72 core for it that should achieve ~190 Mkeys/s on 8 SPEs at 3.2 GHz, which is by itself almost ten times faster than the current fastest processor (G5 at 2.7 GHz, which clocks at 20 Mkeys/s, IIRC). For embarassingly parallel applications like key cracking, this thing is a dream.

    Some technical details: the SPE's instruction set could be though of as `Altivec plus'. It has most of the functionality of Altivec (so far I've only missed a byte addition instruction), but quite a few improvements, like immediate operands for many instructions, immediate loads with much better range than Altivec's splat instruction, the addition of double precision floating point operations, etc. I'm sure there are more improvements, but these are the ones I noticed from my limited experience with Altivec. Instruction scheduling for this processor is remarkably similar to that of the first Pentium: it's dual issue with static scheduling, there are some conditions on pairable instructions and their ordering to ensure dual issue, and so on. The high latencies for instructions (2 for most integer arithmetic, 4 for shifts and rotates) are problematic, but the huge register file of 128 entries is very helpful to implement techniques like software pipelining which help mask these latencies. The local store is a mixed bag -- dealing with arrays larger than the local store should be challenging, but if you don't have to worry about it, it's great to have a fixed latency of 6 cycles for loads and stores, no need to worry about cache effects and so on. Actually, the local store behaves a lot like a programmer-addressable cache, which has some benefits compared to traditional cache: specifically, less control overhead per memory cell (so more logic can be packed in the same space) and, as a consequence, the potential for higher speeds and/or smaller latencies.

    Overall, I'm very impressed with Cell, but for now I've only programmed toy examples and I'm sure to hit some limits of the architecture once I start looking at real-world code.

    --

    Join the NFSNET. Our prime goal is making little numbers out of big ones. http://www.nfsnet.org/

    1. Re:Praise for Cell by acidblood · · Score: 2, Informative
      Could you speak more to performance issues when dealing with code/data that exceeds the 256K SPU local store?

      I'll try, but take my opinion with a grain of salt as I didn't do anything beyond coding an RC5-72 core, which doesn't involve external memory accesses.

      It looks to me like fetches from RAM are a real bottleneck, so if you want performance you need to keep code/data within each SPU. If you can chain a series of algorithms and move data down the chain this is a win. But if you need to manipulate a huge data block you're SOL.

      Sure it'd be impossible to keep this thing completely fed, but I hear the RAM specs are pretty impressive, using some new-fangled XD-RAM technology from Rambus. Still, the computational power of the SPEs is huge and it's sure to be RAM-starved unless the programmers take a lot of care.

      Do realize though that this thing has a monster 100 GB/s interconnect. I would gather sending reasonable amounts of data back and forth between the SPUs is feasible, so perhaps operating on 8*256 KB = 2 MB datasets might be possible.

      Beyond this, I think programmers would look at the Cell like they do at a NUMA box or clusters -- assume fetching remote data is costly and program to that paradigm. Not as costly as it is for clusters, even those with fancy interconnects; more like NUMA boxes. Hence, lots of blocking algorithms and stuff like 4-step FFTs. IBM is suggesting techniques using double-buffering which seem to be working well.

      I can see the Cell being a huge win for say a series of Monte Carlo sims running in each SPU, but am it looks like a lose once you exceed local store.

      That depends on your workloads, in particular your access patterns. Sequential and blocking access patterns should do just fine.

      What makes me pretty hopeful about the potential performance of Cell is that we're currently getting by pretty well with our CPUs with fast L2 cache of similar size (256 KB was pretty common 3 or 4 years ago) and slow memory accesses. The situation is pretty similar with Cell, save that the local store is directly addressable as opposed to transparent like caches are, and I see that as a big win actually -- being able to manage the local store and only make explicit memory accesses should help spot and fix bottlenecks, without the need to worry whether the target CPU will have 512 KB or 1 MB or 2 MB of cache. Of course, having 8 high-clocked SPEs processing 128-bit vectors will impose a much higher burden on memory than your run-of-the-mill Pentium 4 currently does, but I'm hoping that XD-RAM will be up to the challenge.

      But you seem to be saying that idle fetch cycles aren't so bad.

      You may be mixing things up. What I said was that local store accesses had a fixed latency of 6 cycles.

      you seem to be one of the few with real world experience posting here

      I don't think a couple of afternoons writing code qualifies as real-world experience, but there you go.
      --

      Join the NFSNET. Our prime goal is making little numbers out of big ones. http://www.nfsnet.org/

  9. Amazing Cell Demo by doctor_no · · Score: 5, Interesting

    Here is an impressive "virtual mirror" demo using the Cell processor put on by Toshiba. Basically, using a video camera, it can make a 3D model of the person in front of a the camera on the fly. Then it can manipulate the 3D model to change make-up, hair-styles, etc, basically a virtual magic mirror. Really demonstrates the truly unique features these more powerful processors will offer.

    http://techon.nikkeibp.co.jp/lsi/images/toshiba_ce ll.mpg

    http://techon.nikkeibp.co.jp/english/NEWS_EN/20051 013/109623/

    1. Re:Amazing Cell Demo by DigiShaman · · Score: 2, Funny

      Damn!! Was that really real-time? I'm almost wanting to call it's bluff and say it was all choreographed. With the right AI program optimized for multi-threading, we could have HAL if enough CELL chips thrown at it. It may be crude, but it's worth a shot! Imagine the real-world application.

      "HAL: how much unread e-mail do I have?"

      "HAL: please set my alarm for 7:30am"

      "HAL: using google maps, please tell me how many miles and ETA it will be going from X to Y"

      and my favorite...

      "HAL: based on historical trends in the stock market, what do you calculate as being the best investment for quick returns"

      --
      Life is not for the lazy.
  10. Re:You WANT A Cell System... by epine · · Score: 4, Insightful

    The cell designers have comptley sacrificed instruction level parallelism in exchange for thread level parrallelism. It is certainly a valid and interesting way to achieve speed, but not for single threaded applications.

    This analysis is incorrect, because it fails to recognize the fixed point. By sacrificing the out-of-order (OOO) mechanisms (which are brutal for heat production) they gained enough thermal headroom to effectively the double the clock rate. In the same thermal envelop, you either get an OOO processor running at 2GHz with three or four issues pathways (three has been the rule under x86) and a very deep pipeline, or you get a processor running at 4GHz with two issue pathways and a relatively short pipeline.

    A deep pipeline grants (partial) immunity from stalls and bubbles. A short pipeline grants (partial) immunity from branch misprediction effects. To make the deep pipelines work well, huge investments are required in the branch-prediction unit, which is also infamous for throwing off a lot of heat.

    The main Power Processing Element is crippled at best for simple single threaded applications ...

    Fortunately for Cell, this is also the wrong denominator for use in this discussion. Applications might be single threaded, but systems are hardly ever single threaded. While the SPU processors handle audio, video, encryption, block I/O and other compute/bandwidth intensive primitives that most systems engage, they also off-loading cache pollution from the main Cell processor threads, both in the data space and in the task scheduling space.

    Nothing will ever best the Pentium IV for single thread peak performance with no calorie spared. News flash: Intel has already given up on this flawed approach. The Pentium IV could easily beat the Opteron by cranking itself up to 6GHz if there was any practical way to extract 200W from a small core with no hot spots.

    OOO served its purpose in the era where cycle time was paramount and the processor to cache cycle time ratios were in closer balance. Now that heat has become the limiting factor, we'll be seeing a lot less of that from all parties.

    The reality in silicon is that we need to start rethinking those portions of the code base which only perform well under an OOO execution regime.

    This can be accomplished at so many different levels. The entire OpenSSL library can be recoded for SPU coprocessors with massive speed gains. Existing code can be recompiled with modern compilers which exploit large register sets to offset lack of hardware-level OOO. Key algorithms in system libraries can be recoded using better algorithms or memory access patterns.

    Those of you who insist on putting all your eggs into one 100W single threaded basket, it's time to step off the Moore's law express train. Hope you enjoy the milk run.

  11. Re:x86, x86_64, or PPC best for mambo simulator? by jjd1_dement · · Score: 2, Informative

    Actually the 2GHz requirement is overstated. We (ich bin ein IBMer) have run the simulator on laptops in the 1GHz range without any problems. But don't let me ruin your excuse to get a nice new computer!

  12. Re:You WANT A Cell System... by RzUpAnmsCwrds · · Score: 2, Interesting


    "The Pentium IV could easily beat the Opteron by cranking itself up to 6GHz if there was any practical way to extract 200W from a small core with no hot spots."

    Not the case. Among other things, modern code is highly dependant on memory latency. P4 as of late hasn't even been getting 60% of clock; Opteron gets nearly 95%.

    Your whole argument is why Intel developed the Itanium. The idea of producing a simpler CPU that is thermally more efficent is a novel one, but time and again we find that you can't erase the last 15 years of CPU innovation. We're still driving gasoline cars, we're still using paper money, and the Opteron still remians highly competitive with the Itanium at a fraction of the transistor count.

  13. OOO isn't going away... by YesIAmAScript · · Score: 2, Insightful

    I do agree with your assessments of the value of non-OOO processors.

    But there's one thing OOO does that these processors will never do. That is efficiently run code that was not properly scheduled.

    Now, why would you generate code with the wrong scheduling? Well, you wouldn't do so on purpose. But in the field PCs frequently encounter it. This code is code that was scheduled for a different processor. As instruction latencies, CPU clocks and memory latencies change the optimal instruction order changes.

    So on any system which has to run legacy code, OOO is necessary to have good performance.

    And that means PCs are unlikely to go to non-OOO processors soon. No company wants to have to be afraid to release a new processor because it won't run existing versions of Windows (or Mac OS X) as well as older machines because it hasn't been recompiled with a new scheduling. Remember what happened to Pentium Pro? It didn't run legacy code well, and unfortunately the popular OS at the time (Windows 95) was all legacy code.

    On the other hand, it makes total sense for a system like PS3 or Xbox 360 where there are a large number of examples of a system which are exactly the same, down to the RAM timings, and the code run on it was compiled specifically for it.

    Addtionally, to mix in other arguments, I agree P IV could generate significant performance if it didn't run out of thermal headroom. You would need good caches and such but despite what the other poster says both Intel and AMD are affected similarly with memory latency and bandwidth issues. Perhaps AMD fares somewhat better. But not so much better that if the P4 were running at double its current clock rate that it wouldn't mop the floor with the AMD.

    --
    http://lkml.org/lkml/2005/8/20/95
  14. Re:You WANT A Cell System... by TheRaven64 · · Score: 2, Informative
    All of those features were introduced with the Pentium Pro, which was savaged at the time relative to the Pentium

    The Pentium Pro ran Windows NT much faster than an equivalent speed Pentium. A lot of the old 16-bit instructions, however, were microcoded rather than being natively executed, and took a few clocks longer. Since much legacy code at the time (games, anything with win16 roots including Window 95) made use of 16 bit instructions, they ran slower. Comparing Windows NT 4 on a 200MHz Pentium Pro and a 200MHz Pentium (which wasn't available for a few years), the Pentium Pro won hands down. By the time the Pentium II (i.e. Pentium Pro MMX) was released, everyone was running 32-bit apps - the only 16-bit apps left were so old that people didn't mind that they were slower than native ones, since they were still much faster than they had been on any CPU designed to run them.

    The only differences between the Pentium Pro and the Pentium II were the addition of MMX, and the removal of the cache from a separate die in the same package to a separate package on the same board, which allowed cache and CPU cores to be tested inedpendently, improving yields.

    --
    I am TheRaven on Soylent News
  15. Any chance of seeing Cell on a PCI-X card? by CTachyon · · Score: 2

    As everyone seems to agree that running general-purpose code (e.g. Linux) on a Cell is going to be unpleasant thanks to the dumbing down of the PowerPC at the core, I was wondering what the odds are of seeing this as an add-on for doing vector-friendly operations. While I don't see people rushing out to install a Cell just for the hell of it, what are the chances that e.g. future crypto-offload accelerators or even 3D video cards might use one of these puppies?

    --
    Range Voting: preference intensity matters