IBM Full-System Simulator Team Speaks Out
Shell writes "The IBM Full-System Simulator for the Cell Broadband Engine (Cell BE) processor, known inside IBM as codeword Mambo, is a key component of the newly posted offerings on alphaWorks. Meet some of the members of the team that pulled it together, and hear about the simulator in their own words."
Does this mean we can emulate PS3? lol
I'm a moron. I should have read the link closer.
Running Linux on one of these things is simply INSANE.
I have been through a lot of chip transitions over the years and been impressed with the leaps each new generation has made.
But Cell is something entirely different. It is such a HUGE leap in performance beyond x86 systems that to go back to using a x86 machine is unthinkable now for me. I almost feel drunk from the power I have at my hands...
Read up all the Cell info you can at IBM's site and read the various patents IBM, Toshiba, and Sony have out there. And find some way to get your hands on one of these...
I can now see why the PS3 stuff we are seeing is so amazing...
Sure, the cell is amazing, IF you are doing the right things. You say that you simply want to leave the old x86 architecture behind but the truth of the matter is that the two do not even begin to compare.
It is not simply a matter of saying "OMG my cell has 8 cores at 4ghz". The main Power Processing Element is crippled at best for simple single threaded applications -- roughly equivalent to a PowerPC of the G3 era, but specifically in-order execution. The SPEs (the other 8 cores) are essentially mini vector computers. They can perform a massive amount of floating point calculations in parrallel, however they do not enjoy an inante ability to deal well with all sorts of code as a standard x86 cpu could.
The cell designers have comptley sacrificed instruction level parrallelism in exchange for thread level parrallelism. It is certainly a valid and interesting way to achieve speed, but not for single threaded applications. -- Don't throw out your x86 just yet.
when the speed is fast enough that the single threaded applications run fast enough, even if technically crippled, will it matter?
If cell is what what it claims to be, developers will create new applications use multi threaed applications. Compared to 15 years ago, multi-threading is a snap.
The Kruger Dunning explains most post on
Running Linux on one of these things is simply INSANE.
I almost feel drunk from the power I have at my hands
Here's some advice from someone who has access to a REAL CELL chip. I hate to disappoint you but aside from custom libraries specifically optimized for CELL, Linux ain't going to run fast on this machine. All the generic open source code targeted towards the general CPU is going to run faster on a Dual-Core Intel or Dual-Proc/Dual-Core Mac. The actual CPU's in this machine are simple pipelined (think Pentium I level of optimizations) vs current gen CPUs (P4 has out-of-order execution, speculative execution, register renaming, branch prediction, etc). While simple C code runs roughly the same speed, complicated C++ constructs are running 2-10X slower on CELL's simplified PowerPC core versus the G5's you'll find in a Mac.
Code needs to be rewritten specifically to take advantage of the actual SPE/SPU's (Synergistic Processing Engines/Units - I prefer SPE since Sony calls their PS1/PS2 sound chip the SPE). Until those Linux libraries appear, CELL isn't going to run anything faster. Not to mention that it will have to be custom code libraries that DON'T run on the MAIN CPU since the SPE's execute different machine code.
Yes, all of IBM's products are named like that. I mean, every now and again they try to go for something neat and spiffy sounding like "WebSphere", but then they have to munge it all up with "Websphere Application Server" (WAS) and "Websphere Client Technologies Mobile Edition" (WCTME) and so on and so forth. This is normal for IBM, and this is why they really need code-names.
A related story out of IBM from a distinguished engineer I once made the acquaintance of... He's walking along one day and runs into one of his boss's boss's bosses or something like that. So he says, "I know how we can win the war on drugs." He explains: "We make all drugs legal... and assign exclusive marketing rights to the OS/2 marketing team." Boss-dude tells him he's an asshole; he shrugs: "But you got my point."
The World Wide Web is dying. Soon, we shall have only the Internet.
I stand corrected. Here is a link to info about the cell based blade servers. One interesting thing to note is at the bottom of the page: "The OS used was Linux 2.6.11" So I guess that kinda disproves all the people saying Linux won't run well on the Cell.
I've been running the simulator here, and managed to port the distributed.net client to it. The performance of current cores in the PPE is so-so (worse than the G4 in my Mac Mini), although I'm sure it would improve by proper optimization. The SPE is a completely different matter though. I wrote an RC5-72 core for it that should achieve ~190 Mkeys/s on 8 SPEs at 3.2 GHz, which is by itself almost ten times faster than the current fastest processor (G5 at 2.7 GHz, which clocks at 20 Mkeys/s, IIRC). For embarassingly parallel applications like key cracking, this thing is a dream.
Some technical details: the SPE's instruction set could be though of as `Altivec plus'. It has most of the functionality of Altivec (so far I've only missed a byte addition instruction), but quite a few improvements, like immediate operands for many instructions, immediate loads with much better range than Altivec's splat instruction, the addition of double precision floating point operations, etc. I'm sure there are more improvements, but these are the ones I noticed from my limited experience with Altivec. Instruction scheduling for this processor is remarkably similar to that of the first Pentium: it's dual issue with static scheduling, there are some conditions on pairable instructions and their ordering to ensure dual issue, and so on. The high latencies for instructions (2 for most integer arithmetic, 4 for shifts and rotates) are problematic, but the huge register file of 128 entries is very helpful to implement techniques like software pipelining which help mask these latencies. The local store is a mixed bag -- dealing with arrays larger than the local store should be challenging, but if you don't have to worry about it, it's great to have a fixed latency of 6 cycles for loads and stores, no need to worry about cache effects and so on. Actually, the local store behaves a lot like a programmer-addressable cache, which has some benefits compared to traditional cache: specifically, less control overhead per memory cell (so more logic can be packed in the same space) and, as a consequence, the potential for higher speeds and/or smaller latencies.
Overall, I'm very impressed with Cell, but for now I've only programmed toy examples and I'm sure to hit some limits of the architecture once I start looking at real-world code.
Join the NFSNET. Our prime goal is making little numbers out of big ones. http://www.nfsnet.org/
Here is an impressive "virtual mirror" demo using the Cell processor put on by Toshiba. Basically, using a video camera, it can make a 3D model of the person in front of a the camera on the fly. Then it can manipulate the 3D model to change make-up, hair-styles, etc, basically a virtual magic mirror. Really demonstrates the truly unique features these more powerful processors will offer.
e ll.mpg
1 013/109623/
http://techon.nikkeibp.co.jp/lsi/images/toshiba_c
http://techon.nikkeibp.co.jp/english/NEWS_EN/2005
The cell designers have comptley sacrificed instruction level parallelism in exchange for thread level parrallelism. It is certainly a valid and interesting way to achieve speed, but not for single threaded applications.
...
This analysis is incorrect, because it fails to recognize the fixed point. By sacrificing the out-of-order (OOO) mechanisms (which are brutal for heat production) they gained enough thermal headroom to effectively the double the clock rate. In the same thermal envelop, you either get an OOO processor running at 2GHz with three or four issues pathways (three has been the rule under x86) and a very deep pipeline, or you get a processor running at 4GHz with two issue pathways and a relatively short pipeline.
A deep pipeline grants (partial) immunity from stalls and bubbles. A short pipeline grants (partial) immunity from branch misprediction effects. To make the deep pipelines work well, huge investments are required in the branch-prediction unit, which is also infamous for throwing off a lot of heat.
The main Power Processing Element is crippled at best for simple single threaded applications
Fortunately for Cell, this is also the wrong denominator for use in this discussion. Applications might be single threaded, but systems are hardly ever single threaded. While the SPU processors handle audio, video, encryption, block I/O and other compute/bandwidth intensive primitives that most systems engage, they also off-loading cache pollution from the main Cell processor threads, both in the data space and in the task scheduling space.
Nothing will ever best the Pentium IV for single thread peak performance with no calorie spared. News flash: Intel has already given up on this flawed approach. The Pentium IV could easily beat the Opteron by cranking itself up to 6GHz if there was any practical way to extract 200W from a small core with no hot spots.
OOO served its purpose in the era where cycle time was paramount and the processor to cache cycle time ratios were in closer balance. Now that heat has become the limiting factor, we'll be seeing a lot less of that from all parties.
The reality in silicon is that we need to start rethinking those portions of the code base which only perform well under an OOO execution regime.
This can be accomplished at so many different levels. The entire OpenSSL library can be recoded for SPU coprocessors with massive speed gains. Existing code can be recompiled with modern compilers which exploit large register sets to offset lack of hardware-level OOO. Key algorithms in system libraries can be recoded using better algorithms or memory access patterns.
Those of you who insist on putting all your eggs into one 100W single threaded basket, it's time to step off the Moore's law express train. Hope you enjoy the milk run.
Actually the 2GHz requirement is overstated. We (ich bin ein IBMer) have run the simulator on laptops in the 1GHz range without any problems. But don't let me ruin your excuse to get a nice new computer!
"The Pentium IV could easily beat the Opteron by cranking itself up to 6GHz if there was any practical way to extract 200W from a small core with no hot spots."
Not the case. Among other things, modern code is highly dependant on memory latency. P4 as of late hasn't even been getting 60% of clock; Opteron gets nearly 95%.
Your whole argument is why Intel developed the Itanium. The idea of producing a simpler CPU that is thermally more efficent is a novel one, but time and again we find that you can't erase the last 15 years of CPU innovation. We're still driving gasoline cars, we're still using paper money, and the Opteron still remians highly competitive with the Itanium at a fraction of the transistor count.
I do agree with your assessments of the value of non-OOO processors.
But there's one thing OOO does that these processors will never do. That is efficiently run code that was not properly scheduled.
Now, why would you generate code with the wrong scheduling? Well, you wouldn't do so on purpose. But in the field PCs frequently encounter it. This code is code that was scheduled for a different processor. As instruction latencies, CPU clocks and memory latencies change the optimal instruction order changes.
So on any system which has to run legacy code, OOO is necessary to have good performance.
And that means PCs are unlikely to go to non-OOO processors soon. No company wants to have to be afraid to release a new processor because it won't run existing versions of Windows (or Mac OS X) as well as older machines because it hasn't been recompiled with a new scheduling. Remember what happened to Pentium Pro? It didn't run legacy code well, and unfortunately the popular OS at the time (Windows 95) was all legacy code.
On the other hand, it makes total sense for a system like PS3 or Xbox 360 where there are a large number of examples of a system which are exactly the same, down to the RAM timings, and the code run on it was compiled specifically for it.
Addtionally, to mix in other arguments, I agree P IV could generate significant performance if it didn't run out of thermal headroom. You would need good caches and such but despite what the other poster says both Intel and AMD are affected similarly with memory latency and bandwidth issues. Perhaps AMD fares somewhat better. But not so much better that if the P4 were running at double its current clock rate that it wouldn't mop the floor with the AMD.
http://lkml.org/lkml/2005/8/20/95
The Pentium Pro ran Windows NT much faster than an equivalent speed Pentium. A lot of the old 16-bit instructions, however, were microcoded rather than being natively executed, and took a few clocks longer. Since much legacy code at the time (games, anything with win16 roots including Window 95) made use of 16 bit instructions, they ran slower. Comparing Windows NT 4 on a 200MHz Pentium Pro and a 200MHz Pentium (which wasn't available for a few years), the Pentium Pro won hands down. By the time the Pentium II (i.e. Pentium Pro MMX) was released, everyone was running 32-bit apps - the only 16-bit apps left were so old that people didn't mind that they were slower than native ones, since they were still much faster than they had been on any CPU designed to run them.
The only differences between the Pentium Pro and the Pentium II were the addition of MMX, and the removal of the cache from a separate die in the same package to a separate package on the same board, which allowed cache and CPU cores to be tested inedpendently, improving yields.
I am TheRaven on Soylent News
As everyone seems to agree that running general-purpose code (e.g. Linux) on a Cell is going to be unpleasant thanks to the dumbing down of the PowerPC at the core, I was wondering what the odds are of seeing this as an add-on for doing vector-friendly operations. While I don't see people rushing out to install a Cell just for the hell of it, what are the chances that e.g. future crypto-offload accelerators or even 3D video cards might use one of these puppies?
Range Voting: preference intensity matters