A Look Into The Cell Architecture

← Back to Stories (view on slashdot.org)

A Look Into The Cell Architecture

Posted by ryuzaki0 on Saturday January 22, 2005 @03:40PM from the between-the-lines dept.

ball-lightning writes "This article attempts to decipher the patent filed by the STI group (IBM, Sony, and Toshiba) on their upcoming Cell technology (most notably going to be used in the PS3). If it's as good as this article claims, the Cell chip could eventually take over the PC market."

5 of 318 comments (clear)

Min score:

Reason:

Sort:

Merrimack streaming processor is like CELL by zymano · 2005-01-22 16:31 · Score: 2, Informative

Dally's Merrimac processor.

It's so similar that you wonder if they lifted it from him. The only difference is that Prof. Dally's chip has a big cache.
Consider a different approach by Space+cowboy · 2005-01-22 16:31 · Score: 4, Informative

All the programs that run on PC architectures expect certain things to be in place - they expect a single fast central CPU. They expect that good cache usage is important for performance. They expect to have access to gobs of RAM. Etc. Etc. The PS2 (and by extension the cell) is completely different.

Consider a different architecture. You have a job that consists of multiple things to do. Some of these can be easily parallelised, others are mainly sequential. Divide it up so the parallel ones are coded separately, maybe with some IPC to synchronise to some clock.

For a sequential part (say rendering the object list of a scene back to front to gain occlusion) the approach that worked for me on the PS2 (which is logically similar, if significantly less powerful) was to divide the job into tasks. Each task (say, one per object in the above) gets its own bit of code and knows about the data that it needs to perform its task.

The key thing is that the Harvard separation of code and data just isn't, on a PS2. You set up a DMA chain that loads the program into the processor, then streams the data through the program on the processor, lather, rinse, repeat. Make the chain self-submitting and you can effectively forget about that chunk of code now, it'll just happen.

This is still doing things sequentially (but we've agreed that this is a sequential task, right?) - the point is that it's being done highly efficiently within the architectural constraints. You have a dataflow architecture and even sequential code can hit the performance limits if you code to the architecture.

The Cell looks even more powerful, in that you can chain execution modules together, so you can load code into APU's 1,2,3,4 and stream the data through 1,2,3,4 automatically before it's considered 'done'. This was possible on the PS2, but ... awkward. It'll keep the effective instructions/clock down because you're effectively pipelining your software... Nice idea.

Simon

--
Physicists get Hadrons!
Re:Well, this could use some more reiteration... by hattig · 2005-01-22 16:37 · Score: 4, Informative
We will find out a whole lot more within the next fortnight, Cell is being described in a lot of details at ISSCC 2005 in early February.
Paper Details:
- The Design and Implementation of a First-Generation CELL Processor (10.2)
- A Streaming Processing Unit for a CELL Processor (7.4)
- A 4.8GHz Fully Pipelined Embedded SRAM in the Streaming Processor of a CELL Processor (26.7)
- A Double-Precision Multiplier with Fine-Grained Clock-Gating Support for a First-Generation CELL Processor (20.3)
- Clocking and Circuit Design for a Parallel I/O on a First-Generation CELL Processor (28.9)
Re:Some Thoughts by David+Greene · 2005-01-22 17:11 · Score: 2, Informative

First of all I want to say I think it is completly possible to make a processor with 8APUs and so forth.
Check.
For starters PowerPC chips already have several seperate execution units on them, and I think they use fewer transitors than intel chips.
Multiple function units on a chip is not the same thing as the 8 APUs of the Cell. First off, there's no indication whatsoever that this is a single-chip architecture. Even if it is a single chip solution, the coupling of a superscalar's function units to the rest of the architecture is extremely strong.
The Cell architecture is much more loosely coupled, which could be both it's greatest strength and biggest weakness. It's a very different kind of programming model. If the Cell designers really expect developers to code to the metal, they are in for a surprise. Even the most advanced HPC shops today (i.e. government labs, NSA, etc.) are sick of hand-optimizing code. That's why we have programs like DARPA's HPCS. The software component (compilers, debuggers, perforance analyzers, etc.) is at least as important as the underlying speed of the hardware. Usability is king these days. For the Cell to compete in the HPC market, it must have parallelizing and multithreading compilers.
I find the claim that the Cell will work optimally in all configurations from the PDA to a networked cluster to be dubious at best and patently false at worst. The differences in network latency alone will require radically different software solutions in these two environments. Comparing a cluster of Cell computers using ethernet (even 10G) to a Cray network is ludicrous. GFLOPS ain't the whole story. The Cell may dominate the Top 500 but that's almost universally recognized by HPC experts as next to useless. It's great for marketing and gloating point numbers but I'd like to see how it does on HPC Challenge.
To get a feel for this, look at the HPC Challenge results and compare the Cray Alpha (T3E) to the Dalco Opteron (cluster). Then compare the Dalco to the Cray Opteron (XD1). Then compare the T3E to the Cray X1 and NEC SX-6. Then look at the clock speeds of all the machines.
Moreover, a huge chunk of the transitor budget goes to doing things like cache consistancy or complicated instruction prediction which is probably not used on the much simpler APUs.
Not true. By far the biggest chunk of the transitor budget goes to the cache itself. Predication is relatively cheap compared to full-out dynamic branch prediction. Cell has apparently eliminated the cache which would make room for lots of processing bits. However, I'll note that contrary to what is implied in the article, the newest Cray systems, including the vector machines, all have multiple levels of cache on them. Latency and locality do matter, even in large-scale vector codes.
And finally, just because something is vector doesn't mean it's a vector supercomputer. There's a reason NEC and Cray blow SSE/Altivec out of the water and it's not just vector length. It's the whole package of vector ISAs designed for high performance codes (not just pushing polygons), enormous memory and network bandwidth and compilers that know how to make use of it.

--
Re:Steve Jobs, Vectors and OS X by Ohreally_factor · 2005-01-22 20:46 · Score: 2, Informative

And, lest we forget, Steve Jobs produced the Mach operating system for his Next Cubes.

Wrong. Jobs hired the guy who produced the Mach operating system at Carnegie Mellon, Avie Tevanian.

Tevanian started his professional career at Carnegie Mellon University, where he was a principal designer and engineer of the Mach operating system upon which NEXTSTEP is based.

Mach is the spiritual godfather of OS X

Not only that, it's the kernel!

I'm not sure what this has to do with anything, though. Are MKs especially well suited to this Cell architecture? Or are you just trying to play connect-the-dots? Hmmmm. Vectors. . . . Connect-the-dots. . .

--
It's not offtopic, dumbass. It's orthogonal.