IBM Full-System Simulator Team Speaks Out

← Back to Stories (view on slashdot.org)

IBM Full-System Simulator Team Speaks Out

Posted by ScuttleMonkey on Tuesday November 29, 2005 @11:04AM from the from-the-horses-mouth dept.

Shell writes "The IBM Full-System Simulator for the Cell Broadband Engine (Cell BE) processor, known inside IBM as codeword Mambo, is a key component of the newly posted offerings on alphaWorks. Meet some of the members of the team that pulled it together, and hear about the simulator in their own words."

10 of 115 comments (clear)

Min score:

Reason:

Sort:

Re:PS3? by garrett714 · 2005-11-29 11:16 · Score: 5, Informative

Yes and no.

While this "simulator" is basically an emulation of the Cell hardware, it won't allow people to run games at full speed. It's more of a developer tool, that allows programmers to start coding for the PS3 when they don't actually have the hardware yet. Still, it is reasonable to believe that emulation of the PS3 will be viable in the future (although not for a long time)
Re:mambo? by donour · 2005-11-29 11:27 · Score: 2, Informative

I'm a moron. I should have read the link closer.
Re:PS3? by Anonymous Coward · 2005-11-29 11:31 · Score: 0, Informative

There is virtually zero chance that any x86 system will ever be able to emulate even the first generation Cell chip that is in the PS3 and IBM and other company's server products that are starting to show up now.

First, neither Intel nor AMD will be shipping any thing that even come close to the ~256 Gflops and whatever the Int performance number is of the latest version of the Broadband Engine does.

Second, x86 chips will never be able to emulate the internal ring bus in Cell chips. The killer ring bus inside the chip is really the key to the crazy performance people are getting out of Cell systems.

Intel and AMD pretty much have nothing but slapping additional cores together for the next decade on their roadmaps. And even if they could finally manage to get enough of their x86 cores onto one chip with the same amount of computational performance years from now, they will have nothing like the internal ring bus.

In other words, don't hold your breath waiting to emulate PS3 games on any x86 system...ever.
Re:You WANT A Cell System... by adisakp · 2005-11-29 11:37 · Score: 5, Informative

Running Linux on one of these things is simply INSANE.

I almost feel drunk from the power I have at my hands

Here's some advice from someone who has access to a REAL CELL chip. I hate to disappoint you but aside from custom libraries specifically optimized for CELL, Linux ain't going to run fast on this machine. All the generic open source code targeted towards the general CPU is going to run faster on a Dual-Core Intel or Dual-Proc/Dual-Core Mac. The actual CPU's in this machine are simple pipelined (think Pentium I level of optimizations) vs current gen CPUs (P4 has out-of-order execution, speculative execution, register renaming, branch prediction, etc). While simple C code runs roughly the same speed, complicated C++ constructs are running 2-10X slower on CELL's simplified PowerPC core versus the G5's you'll find in a Mac.

Code needs to be rewritten specifically to take advantage of the actual SPE/SPU's (Synergistic Processing Engines/Units - I prefer SPE since Sony calls their PS1/PS2 sound chip the SPE). Until those Linux libraries appear, CELL isn't going to run anything faster. Not to mention that it will have to be custom code libraries that DON'T run on the MAIN CPU since the SPE's execute different machine code.
Re:Where is my workstation! by garrett714 · 2005-11-29 12:24 · Score: 2, Informative

I stand corrected. Here is a link to info about the cell based blade servers. One interesting thing to note is at the bottom of the page: "The OS used was Linux 2.6.11" So I guess that kinda disproves all the people saying Linux won't run well on the Cell.
Praise for Cell by acidblood · 2005-11-29 12:34 · Score: 5, Informative

I've been running the simulator here, and managed to port the distributed.net client to it. The performance of current cores in the PPE is so-so (worse than the G4 in my Mac Mini), although I'm sure it would improve by proper optimization. The SPE is a completely different matter though. I wrote an RC5-72 core for it that should achieve ~190 Mkeys/s on 8 SPEs at 3.2 GHz, which is by itself almost ten times faster than the current fastest processor (G5 at 2.7 GHz, which clocks at 20 Mkeys/s, IIRC). For embarassingly parallel applications like key cracking, this thing is a dream.

Some technical details: the SPE's instruction set could be though of as `Altivec plus'. It has most of the functionality of Altivec (so far I've only missed a byte addition instruction), but quite a few improvements, like immediate operands for many instructions, immediate loads with much better range than Altivec's splat instruction, the addition of double precision floating point operations, etc. I'm sure there are more improvements, but these are the ones I noticed from my limited experience with Altivec. Instruction scheduling for this processor is remarkably similar to that of the first Pentium: it's dual issue with static scheduling, there are some conditions on pairable instructions and their ordering to ensure dual issue, and so on. The high latencies for instructions (2 for most integer arithmetic, 4 for shifts and rotates) are problematic, but the huge register file of 128 entries is very helpful to implement techniques like software pipelining which help mask these latencies. The local store is a mixed bag -- dealing with arrays larger than the local store should be challenging, but if you don't have to worry about it, it's great to have a fixed latency of 6 cycles for loads and stores, no need to worry about cache effects and so on. Actually, the local store behaves a lot like a programmer-addressable cache, which has some benefits compared to traditional cache: specifically, less control overhead per memory cell (so more logic can be packed in the same space) and, as a consequence, the potential for higher speeds and/or smaller latencies.

Overall, I'm very impressed with Cell, but for now I've only programmed toy examples and I'm sure to hit some limits of the architecture once I start looking at real-world code.

--
Join the NFSNET. Our prime goal is making little numbers out of big ones. http://www.nfsnet.org/
1. Re:Praise for Cell by acidblood · 2005-11-29 14:28 · Score: 2, Informative
  
  Could you speak more to performance issues when dealing with code/data that exceeds the 256K SPU local store?
  
  I'll try, but take my opinion with a grain of salt as I didn't do anything beyond coding an RC5-72 core, which doesn't involve external memory accesses.
  
  It looks to me like fetches from RAM are a real bottleneck, so if you want performance you need to keep code/data within each SPU. If you can chain a series of algorithms and move data down the chain this is a win. But if you need to manipulate a huge data block you're SOL.
  
  Sure it'd be impossible to keep this thing completely fed, but I hear the RAM specs are pretty impressive, using some new-fangled XD-RAM technology from Rambus. Still, the computational power of the SPEs is huge and it's sure to be RAM-starved unless the programmers take a lot of care.
  
  Do realize though that this thing has a monster 100 GB/s interconnect. I would gather sending reasonable amounts of data back and forth between the SPUs is feasible, so perhaps operating on 8*256 KB = 2 MB datasets might be possible.
  
  Beyond this, I think programmers would look at the Cell like they do at a NUMA box or clusters -- assume fetching remote data is costly and program to that paradigm. Not as costly as it is for clusters, even those with fancy interconnects; more like NUMA boxes. Hence, lots of blocking algorithms and stuff like 4-step FFTs. IBM is suggesting techniques using double-buffering which seem to be working well.
  
  I can see the Cell being a huge win for say a series of Monte Carlo sims running in each SPU, but am it looks like a lose once you exceed local store.
  
  That depends on your workloads, in particular your access patterns. Sequential and blocking access patterns should do just fine.
  
  What makes me pretty hopeful about the potential performance of Cell is that we're currently getting by pretty well with our CPUs with fast L2 cache of similar size (256 KB was pretty common 3 or 4 years ago) and slow memory accesses. The situation is pretty similar with Cell, save that the local store is directly addressable as opposed to transparent like caches are, and I see that as a big win actually -- being able to manage the local store and only make explicit memory accesses should help spot and fix bottlenecks, without the need to worry whether the target CPU will have 512 KB or 1 MB or 2 MB of cache. Of course, having 8 high-clocked SPEs processing 128-bit vectors will impose a much higher burden on memory than your run-of-the-mill Pentium 4 currently does, but I'm hoping that XD-RAM will be up to the challenge.
  
  But you seem to be saying that idle fetch cycles aren't so bad.
  
  You may be mixing things up. What I said was that local store accesses had a fixed latency of 6 cycles.
  
  you seem to be one of the few with real world experience posting here
  
  I don't think a couple of afternoons writing code qualifies as real-world experience, but there you go.
  
  --
  Join the NFSNET. Our prime goal is making little numbers out of big ones. http://www.nfsnet.org/
Re:x86, x86_64, or PPC best for mambo simulator? by jjd1_dement · 2005-11-29 16:44 · Score: 2, Informative

Actually the 2GHz requirement is overstated. We (ich bin ein IBMer) have run the simulator on laptops in the 1GHz range without any problems. But don't let me ruin your excuse to get a nice new computer!
Re:Amazing Cell Demo by Anonymous Coward · 2005-11-29 21:01 · Score: 1, Informative

Apologies for A/C. This is probably a little less than a full 3D model construction. Having seen a real-time demo of a "morphable model" the almost certainly use priors on face shape.

"First, the applications capture a user's face with a camera and detect the position of key features of the face, including the eyes, nose and mouth, using image recognition technology."

this can be done real time quite effectively right now:

http://citeseer.ist.psu.edu/rd/95418640%2C476373%2 C1%2C0.25%2CDownload/http%3AqSqqSqwww.merl.comqSqp eopleqSqviolaqSqresearchqSqpublicationsqSqICCV01-V iola-Jones.pdf

"By matching the 2D positions of these key features to a computer graphic image using a 3D face model, the applications estimate what direction the user is facing and the 3D positions of the face's 500 features."

Having seen a real-time morphable model demo from Toshba at ICCV2003 this is probably a similar approach to this:

http://gravis.cs.unibas.ch/Sigg99.html

(my PhD thesis includes this area - not on my site yet, but I have a paper on MM fitting at )
http://www.robots.ox.ac.uk/~jamie/paterson03.html

Cheers.
Re:You WANT A Cell System... by TheRaven64 · 2005-11-30 00:22 · Score: 2, Informative

All of those features were introduced with the Pentium Pro, which was savaged at the time relative to the Pentium
The Pentium Pro ran Windows NT much faster than an equivalent speed Pentium. A lot of the old 16-bit instructions, however, were microcoded rather than being natively executed, and took a few clocks longer. Since much legacy code at the time (games, anything with win16 roots including Window 95) made use of 16 bit instructions, they ran slower. Comparing Windows NT 4 on a 200MHz Pentium Pro and a 200MHz Pentium (which wasn't available for a few years), the Pentium Pro won hands down. By the time the Pentium II (i.e. Pentium Pro MMX) was released, everyone was running 32-bit apps - the only 16-bit apps left were so old that people didn't mind that they were slower than native ones, since they were still much faster than they had been on any CPU designed to run them.
The only differences between the Pentium Pro and the Pentium II were the addition of MMX, and the removal of the cache from a separate die in the same package to a separate package on the same board, which allowed cache and CPU cores to be tested inedpendently, improving yields.

--
I am TheRaven on Soylent News