The Potential of Science With the Cell Processor

← Back to Stories (view on slashdot.org)

The Potential of Science With the Cell Processor

Posted by ryuzaki0 on Saturday May 27, 2006 @11:35PM from the making-a-station-play dept.

prostoalex writes "High Performance Computing Newswire is running an article on a paper by computer scientists at the U.S. Department of Energy's Lawrence Berkeley National Laboratory. They have evaluated the processor's performance in running several scientific application kernels, then compared this performance against other processor architectures. The full paper is available from Computer Science department at Berkeley."

6 of 176 comments (clear)

What about the programmer? by Anonymous Coward · 2006-05-27 23:50 · Score: 5, Insightful

"The paper did a lot of hand-optimization, which is irrelevent to most programmers. "

But not to programmers who do science.

"What gcc -O3 does is way more importent then what an assembly wizard can do for most projects."

Not an unsurmountable problem.
Re:What about the compiler? by Anonymous Coward · 2006-05-27 23:55 · Score: 5, Insightful

Hand optimization _is_ relevant to scientific programmers
Re:What about the compiler? by TommyBear · 2006-05-28 00:07 · Score: 5, Insightful

Hand optimizing code is what I do as a game developer and I can assure you that it is very relevant to my job.
Re:What about the compiler? by samkass · 2006-05-28 02:08 · Score: 5, Insightful

What seems to be more important than that is:

"According to the authors, the current implementation of Cell is most often noted for its extremely high performance single-precision (32-bit) floating performance, but the majority of scientific applications require double precision (64-bit). Although Cell's peak double precision performance is still impressive relative to its commodity peers (eight SPEs at 3.2GHz = 14.6 Gflop/s), the group quantified how modest hardware changes, which they named Cell+, could improve double precision performance."

So the Cell is great because there's going to be millions of them sold in PS3's so they'll be cheap. But it's only really great if a new custom variant is built. Sounds kind of contradictory.

--
E pluribus unum
No, this is why we have subroutine libraries by golodh · 2006-05-28 03:26 · Score: 5, Interesting

Although I agree with your point that crafting optimised assembly language routines is way beyond most users (and indeed a waste of time for all but an expert) there are certain "standard operations" that
(a) lend themselves extremely well to optimisation
(b) lend themselves extremely well to incorporation in subroutine libraries
(c) tend to isolate the most compute-intensive low-level operations used in scientific computation
SGEMM
If you read the article, you will find (among others) a reference to a operation called "SGEMM". This stands for Single precision General Matrix Multiplication. This is the sort of routines that make up the BLAS library (Basic Linear Algebra Subprograms) (see e.g. http://www.netlib.org/blas/). High performance computation typically starts with creating optimised implementation of the BLAS routines (if necessary handcoded at assembler level), sparse-matrix equivalents of them, Fast Fourier routines, and the LAPACK library.
ATLAS
There is a general movement away from optimised assembly language coding for the BLAS, as embodied in the ATLAS software package (Automatically Tuned Linear Algebra Software; see e.g. http://math-atlas.sourceforge.net/). The ATLAS package provides the BLAS routines but produces fairly optimal code on any machine using nothing but ordinary compilers. How? If you run a makefile for the ATLAS package, it may take about 12 hours (depending on your computer of course; this is a typical number for a PC) or so to compile. In this time the makefile will simply run through multiple switches and for the BLAS routines and run testsuites for all its routines for varying problem sizes. And then it picks the best possible combination of switches for each routine and each problem size for the machine architecture on which it's being run. In particular it takes account of the size of caches. That's why it produces much faster subroutine libraries than those produced by simply compiling e.g. the BLAS routines with an -O3 optimisation switch thrown in.
Specially tuned versus automatic?: MATLAB
The question is of course: who wins? Specially tuned code or automatic optimisation? This can be illustrated with the example of the well-known MATLAB package. Perhaps you have used MATLAB on PC's, and wondered why its matrix and vector operations are so fast? That's because for Intel and AMD processors it uses a specially (vendor-optimised) subroutine library (see http://www.mathworks.com/access/helpdesk/help/tech doc/rn/r14sp1_v7_0_1_math.html) For SUN machines, it uses SUN's optimised subroutine library. For other processors (for which there are no optimised libraries) Matlab uses the ATLAS routines. Despite the great progress and portability that the ATLAS library provides, carefully optimised libraries can still beat it (see the Intel Math Kernel Library at http://www.intel.com/cd/software/products/asmo-na/ eng/266858.htm)
Summary
In summary:
-large tracts of Scientific computation depend on optimised subroutine libraries
-hand-crafted assembly-language optimisation can still outperform machine-optimised code.
Therefore the objections that the hand-crafted routines described in the article distort the comparison or are not representative of real-world performance are invalid.
However ... it's so expensive and difficult that you only ever want to do it if you absolutely must. For scientific computation this typically means that you only consider handcrafting "inner loop primitives" such as the BLAS routines, FFT's, SPARSEPACK routines etc. for this treatment, and that you just don't attempt to do that yourself.
Ran simulations, not code by jmichaelg · 2006-05-28 03:41 · Score: 5, Insightful

Lest anyone think they actually ran "several scientific application kernels" on the Cell/AMD/Intel chips, what they actually did was run simulations of several different tasks such as FFT and matrix multiplication. Since they didn't actually run the code, they had to guess as to some parameters like DMA overhead. They also came up with a couple of hypothetical Cell processors that dispatched double precision instructions differently than how the Cell actually does it and present those results as well. They also said that IBM ran some prototype hardware that came within 2% of their simulation results, though they didn't say which hypothetical Cell the prototype hardware was implementing.
By the end of the article, I was looking for their idea of a hypothetical best-case pony.