IBM Full-System Simulator Team Speaks Out
Shell writes "The IBM Full-System Simulator for the Cell Broadband Engine (Cell BE) processor, known inside IBM as codeword Mambo, is a key component of the newly posted offerings on alphaWorks. Meet some of the members of the team that pulled it together, and hear about the simulator in their own words."
Yes and no.
While this "simulator" is basically an emulation of the Cell hardware, it won't allow people to run games at full speed. It's more of a developer tool, that allows programmers to start coding for the PS3 when they don't actually have the hardware yet. Still, it is reasonable to believe that emulation of the PS3 will be viable in the future (although not for a long time)
I'm a moron. I should have read the link closer.
There is virtually zero chance that any x86 system will ever be able to emulate even the first generation Cell chip that is in the PS3 and IBM and other company's server products that are starting to show up now.
First, neither Intel nor AMD will be shipping any thing that even come close to the ~256 Gflops and whatever the Int performance number is of the latest version of the Broadband Engine does.
Second, x86 chips will never be able to emulate the internal ring bus in Cell chips. The killer ring bus inside the chip is really the key to the crazy performance people are getting out of Cell systems.
Intel and AMD pretty much have nothing but slapping additional cores together for the next decade on their roadmaps. And even if they could finally manage to get enough of their x86 cores onto one chip with the same amount of computational performance years from now, they will have nothing like the internal ring bus.
In other words, don't hold your breath waiting to emulate PS3 games on any x86 system...ever.
Running Linux on one of these things is simply INSANE.
I almost feel drunk from the power I have at my hands
Here's some advice from someone who has access to a REAL CELL chip. I hate to disappoint you but aside from custom libraries specifically optimized for CELL, Linux ain't going to run fast on this machine. All the generic open source code targeted towards the general CPU is going to run faster on a Dual-Core Intel or Dual-Proc/Dual-Core Mac. The actual CPU's in this machine are simple pipelined (think Pentium I level of optimizations) vs current gen CPUs (P4 has out-of-order execution, speculative execution, register renaming, branch prediction, etc). While simple C code runs roughly the same speed, complicated C++ constructs are running 2-10X slower on CELL's simplified PowerPC core versus the G5's you'll find in a Mac.
Code needs to be rewritten specifically to take advantage of the actual SPE/SPU's (Synergistic Processing Engines/Units - I prefer SPE since Sony calls their PS1/PS2 sound chip the SPE). Until those Linux libraries appear, CELL isn't going to run anything faster. Not to mention that it will have to be custom code libraries that DON'T run on the MAIN CPU since the SPE's execute different machine code.
I stand corrected. Here is a link to info about the cell based blade servers. One interesting thing to note is at the bottom of the page: "The OS used was Linux 2.6.11" So I guess that kinda disproves all the people saying Linux won't run well on the Cell.
I've been running the simulator here, and managed to port the distributed.net client to it. The performance of current cores in the PPE is so-so (worse than the G4 in my Mac Mini), although I'm sure it would improve by proper optimization. The SPE is a completely different matter though. I wrote an RC5-72 core for it that should achieve ~190 Mkeys/s on 8 SPEs at 3.2 GHz, which is by itself almost ten times faster than the current fastest processor (G5 at 2.7 GHz, which clocks at 20 Mkeys/s, IIRC). For embarassingly parallel applications like key cracking, this thing is a dream.
Some technical details: the SPE's instruction set could be though of as `Altivec plus'. It has most of the functionality of Altivec (so far I've only missed a byte addition instruction), but quite a few improvements, like immediate operands for many instructions, immediate loads with much better range than Altivec's splat instruction, the addition of double precision floating point operations, etc. I'm sure there are more improvements, but these are the ones I noticed from my limited experience with Altivec. Instruction scheduling for this processor is remarkably similar to that of the first Pentium: it's dual issue with static scheduling, there are some conditions on pairable instructions and their ordering to ensure dual issue, and so on. The high latencies for instructions (2 for most integer arithmetic, 4 for shifts and rotates) are problematic, but the huge register file of 128 entries is very helpful to implement techniques like software pipelining which help mask these latencies. The local store is a mixed bag -- dealing with arrays larger than the local store should be challenging, but if you don't have to worry about it, it's great to have a fixed latency of 6 cycles for loads and stores, no need to worry about cache effects and so on. Actually, the local store behaves a lot like a programmer-addressable cache, which has some benefits compared to traditional cache: specifically, less control overhead per memory cell (so more logic can be packed in the same space) and, as a consequence, the potential for higher speeds and/or smaller latencies.
Overall, I'm very impressed with Cell, but for now I've only programmed toy examples and I'm sure to hit some limits of the architecture once I start looking at real-world code.
Join the NFSNET. Our prime goal is making little numbers out of big ones. http://www.nfsnet.org/
Actually the 2GHz requirement is overstated. We (ich bin ein IBMer) have run the simulator on laptops in the 1GHz range without any problems. But don't let me ruin your excuse to get a nice new computer!
Apologies for A/C. This is probably a little less than a full 3D model construction. Having seen a real-time demo of a "morphable model" the almost certainly use priors on face shape.
2 C1%2C0.25%2CDownload/http%3AqSqqSqwww.merl.comqSqp eopleqSqviolaqSqresearchqSqpublicationsqSqICCV01-V iola-Jones.pdf
"First, the applications capture a user's face with a camera and detect the position of key features of the face, including the eyes, nose and mouth, using image recognition technology."
this can be done real time quite effectively right now:
http://citeseer.ist.psu.edu/rd/95418640%2C476373%
"By matching the 2D positions of these key features to a computer graphic image using a 3D face model, the applications estimate what direction the user is facing and the 3D positions of the face's 500 features."
Having seen a real-time morphable model demo from Toshba at ICCV2003 this is probably a similar approach to this:
http://gravis.cs.unibas.ch/Sigg99.html
(my PhD thesis includes this area - not on my site yet, but I have a paper on MM fitting at )
http://www.robots.ox.ac.uk/~jamie/paterson03.html
Cheers.
The Pentium Pro ran Windows NT much faster than an equivalent speed Pentium. A lot of the old 16-bit instructions, however, were microcoded rather than being natively executed, and took a few clocks longer. Since much legacy code at the time (games, anything with win16 roots including Window 95) made use of 16 bit instructions, they ran slower. Comparing Windows NT 4 on a 200MHz Pentium Pro and a 200MHz Pentium (which wasn't available for a few years), the Pentium Pro won hands down. By the time the Pentium II (i.e. Pentium Pro MMX) was released, everyone was running 32-bit apps - the only 16-bit apps left were so old that people didn't mind that they were slower than native ones, since they were still much faster than they had been on any CPU designed to run them.
The only differences between the Pentium Pro and the Pentium II were the addition of MMX, and the removal of the cache from a separate die in the same package to a separate package on the same board, which allowed cache and CPU cores to be tested inedpendently, improving yields.
I am TheRaven on Soylent News