Cray SV1 Named Best Supercomputer for 2001

← Back to Stories (view on slashdot.org)

Cray SV1 Named Best Supercomputer for 2001

Posted by Hemos on Saturday August 11, 2001 @07:36PM from the and-the-winnnaaahhh dept.

zoombat writes "The BBC reported that the Cray SV1 product line won the Readers' Choice Award for Best Supercomputer for 2001 by the readers of Scientific Computing & Instrumentation magazine. These beasts have some pretty remarkable stats, including a 300 Mhz CPU clock, up to 192 4.8 GFLOPS CPUs or 1229 1.2 GFLOPS CPUs, and up to a terabyte of memory. And they sure know how to paint 'em real nice. Of course, we all know how "scientific" the Readers' Choice Awards are..."

1 of 171 comments (clear)

Min score:

Reason:

Sort:

Re:I know nothing of such high end hardware, but.. by bmajik · 2001-08-11 21:00 · Score: 5, Informative

If its anything like the older Crays (SV1 stands for "scalable vector", iirc its sort of a mix of vector and traditional CPUs).. then it gets its speed from the vectorized nature of the cpu and more importantly, the problem at hand.

i was told in a CS course that the arch of the cray vector units is basically the same as the cray 1... the speeds have changed, the process has changed, the external peices have gotten much faster.. but at the core, the cray vector machines are very fast at the following type of thing:

given a vector of a given length

do foo to every element in that vector

_very_ efficiently

to see how this operates a bit better, consider how a normal cpu might do the following

for i = 1 to 64

begin

blah[i] = blah[i] + 1

end

that would end up getting compiled perhaps into something like this on a traditional cpu:

loop:

load blah[i]

increment blah[i]

save blah[i]

increment i

if i 64, goto loop

what we're seeing is that for 1 element, we do a load, an ALU op, a store, an ALU op, and a conditional branch.

conditional branches fuck cpus. badly. having load stores inside inner loops, fucks cpus badly.

to see why, you need to understand pipelining, but basically i'll make it short and easy: the instruction cache of a cpu is always stuffing the pipeline with its "guess" of what instructions should be... and its not until several of those 1.4ghz clock cycles later that you even know if you've got the right instruction... if you do, great.. if you dont, you're fucked and you flush the pipeline and start over.

conditional branches fuck this all to hell because without optimization, you've got a 50% chance of filling your pipeline with the wrong instructions.. so on a p4 with a 20+ stage pipeline you're talking about throwing away some sizable portion of those instructions... and then refilling them... now, branch predition realy helps this a lot, but conditional branches are just one problem... the load/store units of cpus also typically introduce huge pipeline delays... i.e. you need to load blah[i] but that takes 2 or 3 cycles (even from cache!! dont even think about it if you need to go to main memory) so any instructions which use blah[i] must be scheduled at least 2-3 clock cycles aftewrads...

so without keen optimization and ideal software loads, suddenly your 1.4ghz chip is stalling 2-3 instructions all the time.. and its only running like a 400mhz proc :)

so, to make traditional cpus fast, pipelineing and multiple EUs have been added. these have drawbacks (and i'velisted some of pipelinings above).

the "vector" approach is totally different. you actually have "vector" registers, and "vector instructions". the machine actually sets up "virtual" pipelines for you. so on a vector machine, the scenario above would be more like:

vectorsize=64

xv = xv + 1

(assuming xv is the vector register with your 64 elements in it)

what the cray hardware does is hooks up the peices of its cpu in a virtual pipeline that does something like this:

foreach element of vx

load

inc

save

notice that the foreach construct looks like a loop, but its not realy, its pipelined, so what actually gets sent through looks like this

load i

inc i, load i+ 1

save i, inc i + 1, load i + 2

save i+1, inc i + 2, load i + 3

save i + 2, inc i + 3, load i + 4

save i + 3, inc i + 4, load i + 5

etc etc etc

except for fill and drain, the load, inc, and save hardware units are always perfectly utilized. there is no branching or conditional logic involved.

the example i've chosen is very trivial, and may be subject to huge factual or conceptual mistakes :) the cray's amazing speed only works in situations where the problem can be expressed in vector instructions, i.e. do the same thing to a fuckload of data in such a way that the cray's hardware can pipeline it efficiently..

there are lots of interesting problems that the cray did _not_ handle well.. but for what its worth, the vector processors in the cray 1 aren't significantly different in operation and instruction set than the SV1 of today.. by many measures, cray "got it right" originally. the SV1 of today might use a normal BGA packaging on a CMOS based process, (the cray1 used discrete ECL logic and point to point wiring - all strung together by little old minnesotan women)

also the original cray 1 ran at either 100 or 80mhz, could take 32mb of ram.... i.e. for the 1970s it was faster than any desktop workstation until the mid 90s...

note that the top500 list crays are usually the T3Es.. which are a totally different beast than the vector processor.. a T3E is just a bunch of alpha CPUs on a very fast interconnect.. sort of like a "custom cluster in a box".

--
My opinions are my own, and do not necessarily represent those of my employer.