A Look Into The Cell Architecture
ball-lightning writes "This article attempts to decipher the patent filed by the STI group (IBM, Sony, and Toshiba) on their upcoming Cell technology (most notably going to be used in the PS3). If it's as good as this article claims, the Cell chip could eventually take over the PC market."
Dally's Merrimac processor.
It's so similar that you wonder if they lifted it from him. The only difference is that Prof. Dally's chip has a big cache.
All the programs that run on PC architectures expect certain things to be in place - they expect a single fast central CPU. They expect that good cache usage is important for performance. They expect to have access to gobs of RAM. Etc. Etc. The PS2 (and by extension the cell) is completely different.
Consider a different architecture. You have a job that consists of multiple things to do. Some of these can be easily parallelised, others are mainly sequential. Divide it up so the parallel ones are coded separately, maybe with some IPC to synchronise to some clock.
For a sequential part (say rendering the object list of a scene back to front to gain occlusion) the approach that worked for me on the PS2 (which is logically similar, if significantly less powerful) was to divide the job into tasks. Each task (say, one per object in the above) gets its own bit of code and knows about the data that it needs to perform its task.
The key thing is that the Harvard separation of code and data just isn't, on a PS2. You set up a DMA chain that loads the program into the processor, then streams the data through the program on the processor, lather, rinse, repeat. Make the chain self-submitting and you can effectively forget about that chunk of code now, it'll just happen.
This is still doing things sequentially (but we've agreed that this is a sequential task, right?) - the point is that it's being done highly efficiently within the architectural constraints. You have a dataflow architecture and even sequential code can hit the performance limits if you code to the architecture.
The Cell looks even more powerful, in that you can chain execution modules together, so you can load code into APU's 1,2,3,4 and stream the data through 1,2,3,4 automatically before it's considered 'done'. This was possible on the PS2, but
Simon
Physicists get Hadrons!
Paper Details:
The Cell architecture is much more loosely coupled, which could be both it's greatest strength and biggest weakness. It's a very different kind of programming model. If the Cell designers really expect developers to code to the metal, they are in for a surprise. Even the most advanced HPC shops today (i.e. government labs, NSA, etc.) are sick of hand-optimizing code. That's why we have programs like DARPA's HPCS. The software component (compilers, debuggers, perforance analyzers, etc.) is at least as important as the underlying speed of the hardware. Usability is king these days. For the Cell to compete in the HPC market, it must have parallelizing and multithreading compilers.
I find the claim that the Cell will work optimally in all configurations from the PDA to a networked cluster to be dubious at best and patently false at worst. The differences in network latency alone will require radically different software solutions in these two environments. Comparing a cluster of Cell computers using ethernet (even 10G) to a Cray network is ludicrous. GFLOPS ain't the whole story. The Cell may dominate the Top 500 but that's almost universally recognized by HPC experts as next to useless. It's great for marketing and gloating point numbers but I'd like to see how it does on HPC Challenge.
To get a feel for this, look at the HPC Challenge results and compare the Cray Alpha (T3E) to the Dalco Opteron (cluster). Then compare the Dalco to the Cray Opteron (XD1). Then compare the T3E to the Cray X1 and NEC SX-6. Then look at the clock speeds of all the machines.
Not true. By far the biggest chunk of the transitor budget goes to the cache itself. Predication is relatively cheap compared to full-out dynamic branch prediction. Cell has apparently eliminated the cache which would make room for lots of processing bits. However, I'll note that contrary to what is implied in the article, the newest Cray systems, including the vector machines, all have multiple levels of cache on them. Latency and locality do matter, even in large-scale vector codes.And finally, just because something is vector doesn't mean it's a vector supercomputer. There's a reason NEC and Cray blow SSE/Altivec out of the water and it's not just vector length. It's the whole package of vector ISAs designed for high performance codes (not just pushing polygons), enormous memory and network bandwidth and compilers that know how to make use of it.
Wrong. Jobs hired the guy who produced the Mach operating system at Carnegie Mellon, Avie Tevanian.
It's not offtopic, dumbass. It's orthogonal.