Seymour Cray and the Development of Supercomputers (linuxvoice.com)
An anonymous reader writes: Linux Voice has a nice retrospective on the development of the Cray supercomputer. Quoting: "Firstly, within the CPU, there were multiple functional units (execution units forming discrete parts of the CPU) which could operate in parallel; so it could begin the next instruction while still computing the current one, as long as the current one wasn't required by the next. It also had an instruction cache of sorts to reduce the time the CPU spent waiting for the next instruction fetch result. Secondly, the CPU itself contained 10 parallel functional units (parallel processors, or PPs), so it could operate on ten different instructions simultaneously. This was unique for the time." They also discuss modern efforts to emulate the old Crays: "...what Chris wanted was real Cray-1 software: specifically, COS. Turns out, no one has it. He managed to track down a couple of disk packs (vast 10lb ones), but then had to get something to read them in the end he used an impressive home-brew robot solution to map the information, but that still left deciphering it. A Norwegian coder, Yngve Ådlandsvik, managed to play with the data set enough to figure out the data format and other bits and pieces, and wrote a data recovery script."
Nowhere near that. From Cray's (admittedly only distantly related to the original Cray...) web site, http://www.cray.com/company/history
The first Cray®-1 system was installed at Los Alamos National Laboratory in 1976
and cost $8.8 million. It boasted a world-record speed of 160 million floating-point
operations per second (160 megaflops) and an 8 MB (1 million word) main memory.
If I remember right, the first that got around 1 Gflop was the Cray-2. Even the X-MP only did around 2/3 of a Gflop.
The largest X-MP had 4 CPUs, each with a floating-point adder and multiplier and a clock speed of ~105MHz. So, the peak performance of these machines was 840MFlops. Achieving and sustaining that though was tricky and was only possible in large vector operations.
The impressive part of the architecture was its memory: at peak, the memory subsystem could complete 16 memory references per clock cycle (each delivering 64 bits of data), so the peak memory bandwidth was 13GBps, or roughly what a 64-bit DDR3 solution offers.
"it could begin the next instruction while still computing the current one, as long as the current one wasn't required by the next " Doesn't this go without saying?
Back in the day, pipelining - issuing, say, a new multiply instruction every clock, even though several earlier multiplies were still working their way thru the pipeline - was too expensive for most architectures. An instruction might take multiple clock cycles to execute, but in most architectures the multi-clock instruction would tie up the functional unit until the computation was done - you might be able to issue a new multiply every 10 clocks or something. Pipelining takes more gates and more design because you don't want one slow stage to determine the clock rate of the whole design.
Which leads us to the early RISC computers, I can recall an early Sun SPARC architecture that lacked a hardware integer multiply instruction. The idea at the time was every instruction should take one clock, and any instruction that demanded too long a clock should be unrolled in software. So this version of SPARC used shifts and adds to do a multiply. At the time, that was a pure RISC design. One of the key insights in RISC, still useful today, is to separate main memory access from other computations.
The CPU design course I took in the late 80's said Seymour Cray invented that idea of separating loads and stores from computation, because even then, even with static RAM as main memory, accessing main memory was slower than accessing registers. So by separating loading from RAM into registers and storing from registers into RAM, the compiler could pre-schedule loads and stores such that they would not stall functional units. Cray also invented pipelining, another key feature in most modern CPUs (I'm not sure when ARM adopted pipelining, but i'm pretty sure it's in some ARM architectures have it now). Of course Cray had vector registers and the consequent vector SIMD hardware.
I don't think Cray invented out of order execution, but I don't think he needed it; in Cray architectures, it would be the compiler's job to order instructions to prevent stalls. In CISC architectures, OOO is mostly a trick for preventing stalls without the compiler needing to worry about it (also, with many models and versions of the Intel instruction architecture out there, it would be painful to have to compile differently for each and every model). So, for example, the load part of an instruction could be scheduled early enough that the data would be in a register by the time the rest of the instruction needed it.
Anyway, the upshot is modern CPU designs have a bigger debt to Cray than to any other single design.
--- Often in error; never in doubt!
Cray also invented pipelining
That honour goes to Konrad Zuse some 25 years earlier. His purely mechanical Z1 calculator machine (not Turing complete) had a short pipeline.
SJW n. One who posts facts.