The Potential of Science With the Cell Processor
prostoalex writes "High Performance Computing Newswire is running an article on a paper by computer scientists at the U.S. Department of Energy's Lawrence Berkeley National Laboratory. They have evaluated the processor's performance in running several scientific application kernels, then compared this performance against other processor architectures. The full paper is available from Computer Science department at Berkeley."
OS X is closed source. This means that it is the work of the devil - its purpose is to make the end users eat babies.
Linux is the only free OS. Yes the BSD lincenses may appear more free, but as they have no restrictions, they are actually less free than the GPL. You see, restricting the end user more actually makes them more free than not putting restrictions on them. You must be a dumb luser for not understanding this.
And you obviously dont have a real job. A real job involves being a student or professional academic. You see, academics are the ones who know all about productivity - if you work for a commercial organisation you obviously do not know anything about computers. Usability is stupid. Whats wrong with the command line? If you cant use the command line then you shouldnt be using a computer. vi should be the standard word processor - you are such a luser if you want to use Word. Installing software should have to involve recompiling the kernel of the OS. If you dont know how to do this, you are a stupid luser who should RTFM. Or go to a Linux irc channel or newsgroup. After all, they are soooo friendly. If you dont know how the latest 2.6 kernel scheduling algorithm works then they will tell you to stop wasting their time, but they really are quite supportive.
Oh, and M$ is just as evil as Apple. Take LookOUT for instance. You could just as easily use Eudora. Who needs groupware anyway, a simple email client should be all we use (thats all we use as academics, why cant businesses be any different).
And trend setters - Linux is the trend setter. It may appear KDE is a ripoff from XP, but thats because M$ stole the KDE code. We all know they have GPL'ed code hidden in there somewhere (but not the things that dont work, only the things that work could possibly have GPL'ed code in it).
And Apple is the suxor because they charge people for their product. We all know that its a much better business model to give all your products away for free. If you charge for anything, then you are allied with M$ and will burn in hell.
The paper did a lot of hand-optimization, which is irrelevent to most programmers. What gcc -O3 does is way more importent then what an assembly wizard can do for most projects.
Inventions have long since reached their limit, and I see no hope for further development.-- Frontinus, 1st cent. AD
"The paper did a lot of hand-optimization, which is irrelevent to most programmers. "
But not to programmers who do science.
"What gcc -O3 does is way more importent then what an assembly wizard can do for most projects."
Not an unsurmountable problem.
I think you misunderstand what HPC actually is.
High performance computing is that which you'd want to throw a huge Beowulf cluster at, or possibly a supercomputer or twenty. Not three small pathetic cores.
Doesn't the Cell's design mean that it can very easily scale up, without requiring any changes in the software? Just add more computing CPUs (SPEs they are called, I think?) and the Cell runs faster without changing your software.
I'm not entirely sure of this, can someone corroborate/disprove?
Send email from the afterlife! Write your e-will at Dead Man's Switch.
An interesting point is that most consoles sell their hardware at a loss. At least the XBox does. This means that there is no guarantee that IBM is willing to sell their CPUs at the same price that one would believe they cost for the PS3.
Moreoever, the scientific community is very likely to push their cell+ architecture and I'm sure IBM would be more than happy to help... For a massive price.
So, when building an HPC system, you're likely to work around the best architecture (the more expensive cell+), and purchasers of the HPC will then have a cray-like proprietary system at enormous cost.
Not that this is a bad thing, I just don't believe this "low cost" "high volume" statement.
-Michael
Isn't Cell similar to things like QCDOC (from what my LQCD colleagues tell me, it's based on PowerPC, but are there similarities in the wider architecture, interconnects, etc.)? Have any plans to use it here?
Except neither of those links point to anything that proves the Cell is good for High Performance Computing which is the point of the article. This isn't anything to do with 360 vs PS3. If MS wanted to design a CPU that could be scaled up for HPC they would have done, instead they just got IBM to customise a PPC chip for their games console because their goal is dominance in the living room, not to become the next Intel.
To be honest I question the validity of this study anyway, I seem to recall lots of papers proclaiming the PS2's so-called "Emotion Engine" as the future of super computing and that never happened either. This is probably more hype paid for by Sony to make people believe the PS3 will be the second coming.
Plus if you actually watch that whole interview with Carmack you linked to, he says the only advantage of the PS3 hardware is peak performance, which if it's anything like the PS2 will be limited by memory bandwidth. And everything I've seen of the PS3's RSX suggests it's just an nVidia 7800 GTX, which means the 360 should have the advantage graphically. With the PS3 having more CPU power but the 360 having more polygon power I suspect we'll end up with fairly similar looking games.
IF IBM was the maker of the chip they would most certainly not sell them at a loss. Why should they? Sony might sell the console at a loss to recoup the loss from game sales but IBM has no way to recoup any losses.
Then again IBM is in a parnetship with Sony and Toshiba so the chip is probaly owned by this partnership and Sony will just be making the chips it needs itself.
So any idea that IBM is selling Cells at a loss is insane.
Then the cost of the PS3 is mostly claimed to be in the Blu-ray drive tech. Not going to be off much intrest to a science setup is it? Even if they want to use a blu-ray drive they need just 1 in a 1000 cell rig. Not going to break the bank.
No the cell will be cheap because when you run an order of millions of identical cpu's prices drop rapidly. There might even be a very real market for cheap cells. Regular CPU's always have lesser quality versions. Not a problem for an intel or AMD who just badge them celeron or whatever but you can't do that with a console processor. All cell processors destined for the PS3 must be off similar spec.
So what to do with a cell chip that has one of the cores defective? Throw it away OR rebadge it and sell it for blade servers? That is were celerons come from (defective cache)
We already know that the cell processor is going to be sold for other purposes then the PS3. IBM has a line of blade servers coming up that will use the cell.
No I am afraid that it will be perfectly possible to buy Cells and they will be sold at a profit just like any other cpu. Nothing special about it. they will however benefit greatly from the fact that they already got a large customer lined up. Regular CPU's need to recover their costs as quickly as possible because their success will be uncertain. This is why regular top end cpu's are so fucking expensive. But the Cell allready has an order for millions, meaning the costs can be spread out in advance over all those units.
MMO Quests are like orgasms:
You may solo them, I prefer them in a group.
"We also conclude that Cell's heterogeneous multi-core implementation is inherently better suited to the HPC environment than homogeneous commodity multi-core processors."
Whether or not HPC is something you'd want to throw 20 or more supercomputers at in a Beowulf cluster, at least you know that the PS3 is really the only next-generation video game system because nobody concerned with raw performance and power efficiency would want to use the Xbox 2 in a HPC environment.
Yes, but the Cell is designed to process data in independent packages which are scheduled and sent to processors by the central unit, it's not a traditional multiprocessor system. Hmm, I guess that from the specs the processors could be communicating via the network instead of just buses as well, which would make what you say correct. I guess we should wait and see.
Send email from the afterlife! Write your e-will at Dead Man's Switch.
Sounds like this cpu would end up having great folding performance. I so hope the PS3 ends up being hackable and we get to throw Linux on it ;-)
at least you know that the PS3 is really the only next-generation video game system because nobody concerned with raw performance and power efficiency would want to use the Xbox 2 in a HPC environment.
Not quite. What they're saying is that the Cell is better suited to parralel applications, like physics simulations, and that it is more scaleable - ie, easier to build supercomputers or distributed computing nodes from.
However, that has no bearing upon what 'generation' the host console is - largely because a console has a pre-determined number of chips installed, and cannot be scaled without breaking it's own specification. Remember, the fact that there are exactly n cores in a console is what makes that console a stable development platform (as opposed to the PC, where performance is different on each unit).
You *could* argue that console is using more modular technology, but that on its own doesn't tell you anything about overall performance, ease of development, stability, robustness, nor any of the other metrics that you can really apply to a console. If 'older' technology can be used to provide those same metrics in a home console, then which is better simply becomes an issue of cost. If the older gear does the same job, but is cheaper to produce, then it is the better alternative from everything but a marketing standpoint. Expandibility of the hardware in other platforms does not affect the quality of the platform in question.
The fact is that most scientists use high-level software (MATLAB, Femlab, ...) to do their simulations. Altough theses scientists may be interested by any potential speed-up to their workflow, they are not willing to invest any bit of their time to translate all their codebase to asm-optimized C. Thus, the ball is in the hands of software developpers, not scientists.
I'm jack's useless sig
FTA: While their current analysis uses hand-optimized code on a set of small scientific kernels, the results are striking. On average, Cell is eight times faster and at least eight times more power efficient than current Opteron and Itanium processors,
The Cell processor may be faster but how easy is it to implement an optimizing development system that eliminates the need to hand-optimized the code? Is not programming productivity just as important as performance? I suspect that the Cell's design is not as elegant (from a programmer's POV) as it could have been, only because it was not designed with an elegant software model in mind. I don't think it is a good idea to design a software model around a CPU. It is much wiser to design the CPU around an established model. In this vein, I don't see the cell as a truly revolutionary processor because, like every other processor in existence, it is optimized for the algorithmic software model. A truly innovative design would have embraced a non-algorithmic, reactive, synchronous model, thereby killing two birds with one stone: solving the current software reliability crisis while leaving other processors in dust in terms of performance. One man's opinion.
All MP machines have: communication channels, and processors. If the designers envisioned it being used a certain way and optimized it for that, well, what of it? Maybe that's how the standard game API does things but, it's still processors and communication channels. It's more than likely you can get better performance out of it by adapting your problem for it specifically, minimizing communication and keeping processors busy as much as the problem allows, same as for all other MP systems.
Someday we'll all be negroes
x86, the commodity, has registers from the days when RAM was faster than the CPU (ie 8-bit days)
The tacked on FPU, MMX, SSE SIMD stuff whilst welcome still leaves few registers for program use
The PowerPC on the otherhand has a nice collection of regs, and as good if not better SIMD--The CELL goes a big step further
More regs = more varibles in the CPU = higher bandwidth of calculation
be they regular regs or SIMD regs.
That plus the way it handles cache
Could be a pig to program without the right kind of compiler optimizing
Would that mean game developers using FORTRAN 95?
Over the last several decades, there have been lots of parallel architectures, many significantly more innovative and powerful than Cell. If Cell succeeds, it's not because of any innovation, but because it contains fairly little innovation and therefore doesn't require people to change their code too much.
One thing that Cell has that previous processors didn't is that the PS3 tie-in and IBM's backing may convince people that it's going to be around for a while; most previous efforts suffered from the problem that nobody wanted to invest time in adapting their code to an architecture that was not going to be around in a few years anyway.
I thought the Cells performance was mediocre if you only had a single task going on at a time. Given that scientific simulations aren't real time, it doesn't need to be hugely multithreaded as it's better for each tick/frame/etc of the simulation to be done one after the other.
Did Sony pay you or did Mr. Kutaragi come over to your house and type it for you.
Have you seriously never seen anything like this before? As a professional ps2/360/ps3 developer I have to say that I was seriously underwhelmed by this demo. Every one of the effects has been used before. THe original xbox has every effect he mentioned. And HL2 has a significantly more complex lighting system and postprocessing effects.
The demo appears to be a single high-poly character in a texture mapped box. The demoer admits that this is a cut-scene quality model. I believe this scene could be rendered on an original xbox with similar 'visual' quality. Why not use some of those polys to make a realistic background? Black on PS2 looked better. And they couldn't even show a solid second of actual gameplay.
I think it will be an amaxing game, but the demo was no technical achievement. It was a hurried render test for an obviously incomplete engine. Bragging about poly count when your competition can push 1.5x-3x as many is not going to win them any points either.
Regards,
----- 70% of all statistics are completely made up.
(a) lend themselves extremely well to optimisation
(b) lend themselves extremely well to incorporation in subroutine libraries
(c) tend to isolate the most compute-intensive low-level operations used in scientific computation
SGEMM
If you read the article, you will find (among others) a reference to a operation called "SGEMM". This stands for Single precision General Matrix Multiplication. This is the sort of routines that make up the BLAS library (Basic Linear Algebra Subprograms) (see e.g. http://www.netlib.org/blas/). High performance computation typically starts with creating optimised implementation of the BLAS routines (if necessary handcoded at assembler level), sparse-matrix equivalents of them, Fast Fourier routines, and the LAPACK library.
ATLAS
There is a general movement away from optimised assembly language coding for the BLAS, as embodied in the ATLAS software package (Automatically Tuned Linear Algebra Software; see e.g. http://math-atlas.sourceforge.net/). The ATLAS package provides the BLAS routines but produces fairly optimal code on any machine using nothing but ordinary compilers. How? If you run a makefile for the ATLAS package, it may take about 12 hours (depending on your computer of course; this is a typical number for a PC) or so to compile. In this time the makefile will simply run through multiple switches and for the BLAS routines and run testsuites for all its routines for varying problem sizes. And then it picks the best possible combination of switches for each routine and each problem size for the machine architecture on which it's being run. In particular it takes account of the size of caches. That's why it produces much faster subroutine libraries than those produced by simply compiling e.g. the BLAS routines with an -O3 optimisation switch thrown in.
Specially tuned versus automatic?: MATLAB
The question is of course: who wins? Specially tuned code or automatic optimisation? This can be illustrated with the example of the well-known MATLAB package. Perhaps you have used MATLAB on PC's, and wondered why its matrix and vector operations are so fast? That's because for Intel and AMD processors it uses a specially (vendor-optimised) subroutine library (see http://www.mathworks.com/access/helpdesk/help/tech doc/rn/r14sp1_v7_0_1_math.html) For SUN machines, it uses SUN's optimised subroutine library. For other processors (for which there are no optimised libraries) Matlab uses the ATLAS routines. Despite the great progress and portability that the ATLAS library provides, carefully optimised libraries can still beat it (see the Intel Math Kernel Library at http://www.intel.com/cd/software/products/asmo-na/ eng/266858.htm)
Summary
In summary:
-large tracts of Scientific computation depend on optimised subroutine libraries
-hand-crafted assembly-language optimisation can still outperform machine-optimised code.
Therefore the objections that the hand-crafted routines described in the article distort the comparison or are not representative of real-world performance are invalid.
However ... it's so expensive and difficult that you only ever want to do it if you absolutely must. For scientific computation this typically means that you only consider handcrafting "inner loop primitives" such as the BLAS routines, FFT's, SPARSEPACK routines etc. for this treatment, and that you just don't attempt to do that yourself.
By the end of the article, I was looking for their idea of a hypothetical best-case pony.
So, that means that the cell in it's current design is 14/8= 1.75x times slower for double precision than an Opteron/Itanium is for single precision. I searched around byt couldn't find a good answer on what is the ratio between an Opteron/Itanium single and double power precision performances? If it's actually just 50% slower (as I think it is) then the cell is still slower (currently 75%).
So, anyone knows for sure what is the ratio between an Opteron/Itanium single and double power precision performances?
"I don't mind God, it's his fan club I can't stand!" E8
Cell have 8 vector processor and something like a ppc to "control" all of them, it's done specially for FP operations. It's like a comparation of a GPU with a CPU, it haven't got so much sense.
politic in spanish & my blog
Check if this was sponsored by the same marketing team that was running ads that kept peddling the lackluster g4 as a supercomputer on the national watchlist.
Did you mean Fermilab, or am I not keeping up with scientific progress? :-)
I love how they manage to completely ignore all the other vector-type architectures already in the market, and just compare it to Intel/AMD which are not even designed for floating point performance.
;)
Scream "my computer beats your abacus" all you want.
But then it is from Berkeley, so that's normal.
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
The authors discuss hand tuning and assembler coding for Cell, but not necessarily for the other processors. Their 2D FFT results, for example, are a factor a 10 slower than others I have seen. Also, for the IA64 and Opteron, the performance many of these numerical kernels are highly dependent on the compiler used. The IA64 especially is very sensitive to compiler optimization to keep the 6 pipeline slots busy and also generate memory prefetch instructions at the right time to prevent stalling. As often seems to occur in these sorts of HPC comparisons, they spend a lot of time hand opitmizing for a particular platform, and compare it to other platforms that have not necessarily received the equivalent effort. As has been noted above, how much time you have to spend developing, debugging, and tuning a code matters a lot. This is particularly true for research codes. Finally, who uses single precision for scientific computing anymore? Any field that I am aware of that would use large FFTs, large linear algebra solvers, etc. requires at least double precision to get anything meaningful.
What are you running your renderer on now? Or is this power lust? You'll pay a heavy price, especially in your time. I regret doing this myself.
I will try to clear up a little of your confusion.
> You assumed that the MGS4 trailer was pre-rendered cutscene,
> that obviously shows that you have little knowledge of the PlayStation
> and MGS. MGS has NEVER used pre-rendered cutscenes.. blah blah blah
I never said it was prerendered. You simply misunderstood they way these things work. In-game cut scenes use different models than the regular game. That is because the artists need more detailed control of the animations. They can be much more complex because artists can focus on the elements used in that specific cut-scene.
Therefor even when rendered in-game cutscenes are a bad estimate of actual gameplay experience. This is why you so often see xbox cutscenes in commercials rather than actual gameplay. Sure it is rendered real time but it will always look the best possible quality.
> The original Xbox CANNOT produce similar quality as MGS4. Snake's
> hair alone would cause the original Xbox to be at its limitation.
60,000 polys for hair alone! My GOD call the nobel prize committee! Even if you wanted to waste this many polys on something that could be done with similar quality and 5k polys. What is so spectacular about 60k? They XBOX could do this at its native resolution withou too much difficulty, its not an impressive number, even the ps2 could do it, although you would only be able to render hair and nothing else.
> Finally, where did you hear that the Xbox360 can push 3x more polygons
> than the PS3? Your ass? You are NOT a developer, and it is obvious from
> your lack of knowledge in the subject.
Well I didn't say 3x. It depends on what you are rendering. But the simplest limitation is the clock speed and the number of pipelines. I am not saying PS3 is worst, since it can do a lot more shader ops per second (3x as many). But it can only do them on half as many polys at a lower clock speed. This is all academic anyway since total performance is a combination of many things. But I deal with 400k poly models every day, and I just wasn't impressed by the demo.
----- 70% of all statistics are completely made up.
The Emotion Engine was the future of HPC, the Cell is simply an extension of ideas and concepts tested out with the EE.
If the chip runs fast with some hand-optimization, then it will get done. Just follow the money. Sheesh!
I don't trust atoms -- they make up stuff.
However, the stencil code and SpMV kernels were actually coded up and simulated for the paper. They were then run (exact same code) on real hardware (a 2.1GHz prototype machine) and those results were presented at the EDGE workshop last week. The hardware performance was pretty close to the simulator (the more computationally bound the kernel, the more accurate the simulator)
The X1E MSP is certainly a vector processor, and we ran the same kernels on it and presented them in the paper. It would certainly not be considered a commodity processor though. We wanted a nice sample set of architectures: superscalar, VLIW, and vector.
So if somebody writes a couple of dozen standard routines that crank the number-crunching part of the Cell processor well, and there's a halfway-adequate compiler for the conventional-processing side, you can still get a big win from a small budget.
I did a lot of scientific-style programming on VAXes in the early-mid 80s, and my iPod Shuffle has more CPU, more disk-equivalent, faster I/O bus, and probably more RAM (? not sure, but all the non-shuffle versions do.) Our applications sped up by 2 orders of magnitude once we could get enough RAM :-)
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks