IBM Full-System Simulator Team Speaks Out
Shell writes "The IBM Full-System Simulator for the Cell Broadband Engine (Cell BE) processor, known inside IBM as codeword Mambo, is a key component of the newly posted offerings on alphaWorks. Meet some of the members of the team that pulled it together, and hear about the simulator in their own words."
Running Linux on one of these things is simply INSANE.
I have been through a lot of chip transitions over the years and been impressed with the leaps each new generation has made.
But Cell is something entirely different. It is such a HUGE leap in performance beyond x86 systems that to go back to using a x86 machine is unthinkable now for me. I almost feel drunk from the power I have at my hands...
Read up all the Cell info you can at IBM's site and read the various patents IBM, Toshiba, and Sony have out there. And find some way to get your hands on one of these...
I can now see why the PS3 stuff we are seeing is so amazing...
Sure, the cell is amazing, IF you are doing the right things. You say that you simply want to leave the old x86 architecture behind but the truth of the matter is that the two do not even begin to compare.
It is not simply a matter of saying "OMG my cell has 8 cores at 4ghz". The main Power Processing Element is crippled at best for simple single threaded applications -- roughly equivalent to a PowerPC of the G3 era, but specifically in-order execution. The SPEs (the other 8 cores) are essentially mini vector computers. They can perform a massive amount of floating point calculations in parrallel, however they do not enjoy an inante ability to deal well with all sorts of code as a standard x86 cpu could.
The cell designers have comptley sacrificed instruction level parrallelism in exchange for thread level parrallelism. It is certainly a valid and interesting way to achieve speed, but not for single threaded applications. -- Don't throw out your x86 just yet.
when the speed is fast enough that the single threaded applications run fast enough, even if technically crippled, will it matter?
If cell is what what it claims to be, developers will create new applications use multi threaed applications. Compared to 15 years ago, multi-threading is a snap.
The Kruger Dunning explains most post on
The cell designers have comptley sacrificed instruction level parallelism in exchange for thread level parrallelism. It is certainly a valid and interesting way to achieve speed, but not for single threaded applications.
...
This analysis is incorrect, because it fails to recognize the fixed point. By sacrificing the out-of-order (OOO) mechanisms (which are brutal for heat production) they gained enough thermal headroom to effectively the double the clock rate. In the same thermal envelop, you either get an OOO processor running at 2GHz with three or four issues pathways (three has been the rule under x86) and a very deep pipeline, or you get a processor running at 4GHz with two issue pathways and a relatively short pipeline.
A deep pipeline grants (partial) immunity from stalls and bubbles. A short pipeline grants (partial) immunity from branch misprediction effects. To make the deep pipelines work well, huge investments are required in the branch-prediction unit, which is also infamous for throwing off a lot of heat.
The main Power Processing Element is crippled at best for simple single threaded applications
Fortunately for Cell, this is also the wrong denominator for use in this discussion. Applications might be single threaded, but systems are hardly ever single threaded. While the SPU processors handle audio, video, encryption, block I/O and other compute/bandwidth intensive primitives that most systems engage, they also off-loading cache pollution from the main Cell processor threads, both in the data space and in the task scheduling space.
Nothing will ever best the Pentium IV for single thread peak performance with no calorie spared. News flash: Intel has already given up on this flawed approach. The Pentium IV could easily beat the Opteron by cranking itself up to 6GHz if there was any practical way to extract 200W from a small core with no hot spots.
OOO served its purpose in the era where cycle time was paramount and the processor to cache cycle time ratios were in closer balance. Now that heat has become the limiting factor, we'll be seeing a lot less of that from all parties.
The reality in silicon is that we need to start rethinking those portions of the code base which only perform well under an OOO execution regime.
This can be accomplished at so many different levels. The entire OpenSSL library can be recoded for SPU coprocessors with massive speed gains. Existing code can be recompiled with modern compilers which exploit large register sets to offset lack of hardware-level OOO. Key algorithms in system libraries can be recoded using better algorithms or memory access patterns.
Those of you who insist on putting all your eggs into one 100W single threaded basket, it's time to step off the Moore's law express train. Hope you enjoy the milk run.
I do agree with your assessments of the value of non-OOO processors.
But there's one thing OOO does that these processors will never do. That is efficiently run code that was not properly scheduled.
Now, why would you generate code with the wrong scheduling? Well, you wouldn't do so on purpose. But in the field PCs frequently encounter it. This code is code that was scheduled for a different processor. As instruction latencies, CPU clocks and memory latencies change the optimal instruction order changes.
So on any system which has to run legacy code, OOO is necessary to have good performance.
And that means PCs are unlikely to go to non-OOO processors soon. No company wants to have to be afraid to release a new processor because it won't run existing versions of Windows (or Mac OS X) as well as older machines because it hasn't been recompiled with a new scheduling. Remember what happened to Pentium Pro? It didn't run legacy code well, and unfortunately the popular OS at the time (Windows 95) was all legacy code.
On the other hand, it makes total sense for a system like PS3 or Xbox 360 where there are a large number of examples of a system which are exactly the same, down to the RAM timings, and the code run on it was compiled specifically for it.
Addtionally, to mix in other arguments, I agree P IV could generate significant performance if it didn't run out of thermal headroom. You would need good caches and such but despite what the other poster says both Intel and AMD are affected similarly with memory latency and bandwidth issues. Perhaps AMD fares somewhat better. But not so much better that if the P4 were running at double its current clock rate that it wouldn't mop the floor with the AMD.
http://lkml.org/lkml/2005/8/20/95