IBM Full-System Simulator Team Speaks Out
Shell writes "The IBM Full-System Simulator for the Cell Broadband Engine (Cell BE) processor, known inside IBM as codeword Mambo, is a key component of the newly posted offerings on alphaWorks. Meet some of the members of the team that pulled it together, and hear about the simulator in their own words."
Does this mean we can emulate PS3? lol
Running Linux on one of these things is simply INSANE.
I have been through a lot of chip transitions over the years and been impressed with the leaps each new generation has made.
But Cell is something entirely different. It is such a HUGE leap in performance beyond x86 systems that to go back to using a x86 machine is unthinkable now for me. I almost feel drunk from the power I have at my hands...
Read up all the Cell info you can at IBM's site and read the various patents IBM, Toshiba, and Sony have out there. And find some way to get your hands on one of these...
I can now see why the PS3 stuff we are seeing is so amazing...
I thought mambo was just a generic powerpc machine emulator. Not the cell...
when the speed is fast enough that the single threaded applications run fast enough, even if technically crippled, will it matter?
If cell is what what it claims to be, developers will create new applications use multi threaed applications. Compared to 15 years ago, multi-threading is a snap.
The Kruger Dunning explains most post on
It's great that we keep hearing about these things and we know they're out there with some great PS3 demos but all of this comes down to the point that I'm tired of hearing about them until I can turn on my cell workstation. The news I want is the workstation release!
Mambo is the name of an opensource CMS http://www.mamboserver.com./ You would think these guys get out on the net and do a little research before naming a product.
...that 256KB local store for each SPU looks like a pretty severe bottleneck. You'll have to limit your execution code and data to this window, otherwise you'll take a severe penalty on fetch to main memory. The PPU isn't much to brag about in comparison to a modern G4 or G5, so your task damn well better make use of those SPUs or performance will seriously suck in comparison to a modern CPU. So, it looks to me like this thing will be amazing for lots of small, jobs like several tiny monte carlo sims each running in an SPU. But for real data analysis, it's going to depend on the project requirements - which could easily demand more than 256KB for local store. Then you're SOL....
Would love to read some folks post on how they plan to use the broadband interconnect to chain code and data for solving larger problems, and what limitations they see in this arch. --M
I've been running the simulator here, and managed to port the distributed.net client to it. The performance of current cores in the PPE is so-so (worse than the G4 in my Mac Mini), although I'm sure it would improve by proper optimization. The SPE is a completely different matter though. I wrote an RC5-72 core for it that should achieve ~190 Mkeys/s on 8 SPEs at 3.2 GHz, which is by itself almost ten times faster than the current fastest processor (G5 at 2.7 GHz, which clocks at 20 Mkeys/s, IIRC). For embarassingly parallel applications like key cracking, this thing is a dream.
Some technical details: the SPE's instruction set could be though of as `Altivec plus'. It has most of the functionality of Altivec (so far I've only missed a byte addition instruction), but quite a few improvements, like immediate operands for many instructions, immediate loads with much better range than Altivec's splat instruction, the addition of double precision floating point operations, etc. I'm sure there are more improvements, but these are the ones I noticed from my limited experience with Altivec. Instruction scheduling for this processor is remarkably similar to that of the first Pentium: it's dual issue with static scheduling, there are some conditions on pairable instructions and their ordering to ensure dual issue, and so on. The high latencies for instructions (2 for most integer arithmetic, 4 for shifts and rotates) are problematic, but the huge register file of 128 entries is very helpful to implement techniques like software pipelining which help mask these latencies. The local store is a mixed bag -- dealing with arrays larger than the local store should be challenging, but if you don't have to worry about it, it's great to have a fixed latency of 6 cycles for loads and stores, no need to worry about cache effects and so on. Actually, the local store behaves a lot like a programmer-addressable cache, which has some benefits compared to traditional cache: specifically, less control overhead per memory cell (so more logic can be packed in the same space) and, as a consequence, the potential for higher speeds and/or smaller latencies.
Overall, I'm very impressed with Cell, but for now I've only programmed toy examples and I'm sure to hit some limits of the architecture once I start looking at real-world code.
Join the NFSNET. Our prime goal is making little numbers out of big ones. http://www.nfsnet.org/
Mambo is the name of a clothing brand as well. The company is run by Reg Mombassa who was in an Australian band called Mental As Anything. In retrospect it sounds apt. The whole six degrees of separation thing works! You'd have to be "mental as anything" to think that cell is going to make x86 obsolete so easily.
Here is an impressive "virtual mirror" demo using the Cell processor put on by Toshiba. Basically, using a video camera, it can make a 3D model of the person in front of a the camera on the fly. Then it can manipulate the 3D model to change make-up, hair-styles, etc, basically a virtual magic mirror. Really demonstrates the truly unique features these more powerful processors will offer.
e ll.mpg
1 013/109623/
http://techon.nikkeibp.co.jp/lsi/images/toshiba_c
http://techon.nikkeibp.co.jp/english/NEWS_EN/2005
I don't own a machine that meets the simulator's minimum system requirement (namely, 2.0GHz or higher), but I'm so curious about it, that I'm willing to buy a new box just to try Mambo with CBE sim. So, what hardware platform is best for the simulator software?
I put my balls on your face. WHILE YOU SLEEP!
The highest-clocked K8 is probably your best bet; a 3.8GHz Pentium 4 probably wouldn't be bad either.
The CBE from IBM is based on SIM.
It has SLBs and TLBs for the PPEs, and SPE for modelling on the EIB.
STI uses an API and TCL for creating SPE or SPU RTEs on AIX and PS3.
. ..
Take the reefer out of your mouth and put it down, turn off the Footloose DVD, step away from the google, and put both of your hands where the men in little white coats can see them.
The term "speaks out" has connotations, like revealing a dirty secret, which doesn't seem to be the case here. I think it would be prudent to choose one's headlines a little more carefully.
/sorry
A-Bomb
I do agree with your assessments of the value of non-OOO processors.
But there's one thing OOO does that these processors will never do. That is efficiently run code that was not properly scheduled.
Now, why would you generate code with the wrong scheduling? Well, you wouldn't do so on purpose. But in the field PCs frequently encounter it. This code is code that was scheduled for a different processor. As instruction latencies, CPU clocks and memory latencies change the optimal instruction order changes.
So on any system which has to run legacy code, OOO is necessary to have good performance.
And that means PCs are unlikely to go to non-OOO processors soon. No company wants to have to be afraid to release a new processor because it won't run existing versions of Windows (or Mac OS X) as well as older machines because it hasn't been recompiled with a new scheduling. Remember what happened to Pentium Pro? It didn't run legacy code well, and unfortunately the popular OS at the time (Windows 95) was all legacy code.
On the other hand, it makes total sense for a system like PS3 or Xbox 360 where there are a large number of examples of a system which are exactly the same, down to the RAM timings, and the code run on it was compiled specifically for it.
Addtionally, to mix in other arguments, I agree P IV could generate significant performance if it didn't run out of thermal headroom. You would need good caches and such but despite what the other poster says both Intel and AMD are affected similarly with memory latency and bandwidth issues. Perhaps AMD fares somewhat better. But not so much better that if the P4 were running at double its current clock rate that it wouldn't mop the floor with the AMD.
http://lkml.org/lkml/2005/8/20/95
Leave it up to the Japanese to come up with this one.
So what, i can't buy a Cell proc anyway. Let them first sort out all the details like main system/board architecture, compilers, software and then we wil see.
I think IBM still has a long way to go. Whats their timeline on this one , 1 light year?
And for the PS3, it still isn't out , or is it??
To me this is still a lot of shouting (about nothing), but no real substance that is usable to me, like a complete computer based on Cell (no PS3 doesn't count).
My 2 cents,
M
FP
it still takes 15 seconds to open Adobe Acro... Oh, nevermind. -M
"This story has nothing to do with mactel"
When there is IBM and a SORT OF (read zealots) PowerPC story like this happens, you gotta concentrate too much not to think about Mactel.
It is my personal point of view and I am kind of emberassed that whole Mac community became Intel zealots in 1 night.
As everyone seems to agree that running general-purpose code (e.g. Linux) on a Cell is going to be unpleasant thanks to the dumbing down of the PowerPC at the core, I was wondering what the odds are of seeing this as an add-on for doing vector-friendly operations. While I don't see people rushing out to install a Cell just for the hell of it, what are the chances that e.g. future crypto-offload accelerators or even 3D video cards might use one of these puppies?
Range Voting: preference intensity matters
Except unlike P IV, AMD's chips were designed properly.
P IV was designed to run at 6GHz or something. And gate-delay wise, they could probably do it with minimal changes. Except then it produces too much heat due to transistor switching that it can't be cooled properly.
AMD's chips however, were designed to run at the speeds they are running at. To make them go 4.4GHz would require redesigning them. But yes, they would also be much faster at those speeds.
So, the argument could be made for AMD, but it's not as valid.
Now, despite all this, AMDs design is the better one, the chip can reach its potential. P IV cannot really.
I'm sure AMDs new DDR2 chips will be very fast, as their current DDR ones are also.
AMDs are definitely the price performance leader in single-core right now. In double core, Intel is faster per dollar in the low-end config. But despite this, my current machine is an AMD A64 X2 4200+. I love it, works great, real fast, not too much heat. My previous machine was a 3.0GHz/800FSB (Northwood) P4, and it was fast too (though significantly less so), and ran a heck of a lot less hot than my previous machine, an Athlon XP 1700+, despite being a lot faster.
http://lkml.org/lkml/2005/8/20/95