Homebrew Cray-1
egil writes "Chris Fenton built his own fully functional 1/10 scale Cray-1 supercomputer. True to the original, it includes the couch-seat, but is also binary compatible with the original. Instead of the power-hungry ECL technology, however, the scale model is built around a Xilinx Spartan-3E 1600 development board. All software is available if you want to build one for your own living room. The largest obstacle in the project is to find original software."
I built a PVP11 "supercluster" and started with Xilinx. The hardware is great, but their software toolset is horrendous.
After months of free time development, I switched over to surplus Altera Stratix II video decoder hardware, got a copy of Quartus II, and was moving within weeks. Altera would be my suggestion for any geek who wants to try something similar!
It's instructions execute accurately clock-for-clock, but running at 33 MHz instead of 80.
well, this is a mid 70s computer, so it must have run CP/M 8D
if he really does want to run real Cray software, he'll have to implement the interrupts and context switching for Cray Operating System (COS) or the Unix Unicos
Why, cray tell, does it run so slowly?
I have one of those, the Spartan board, not a Cray-1. I did not remember, but checked online and the Spartan board has a 66 MHz canned oscillator. So, his design probably uses two clock cycles per instruction cycle.
Probably also limited my the memory speed of whatever he's using for memory. 33 MHz equals what 30 ns access cycle?
"Science flies us to the moon. Religion flies us into buildings." - Victor Stenger
S3E's have DCMs (Digital Clock Managers) making them very flexible in terms of what the internal clock frequencies are, even with a fixed input frequency.
Chances are (I can't get to the site) it just runs at 33MHz as its best-supported clock frequency. An S3E is a pretty cheap and slow FPGA - I remember writing a 32-bit CPU for one, and until I started optimising the logic-placement in the FPGA, it was only running at ~30MHz. I got it up to ~50MHz after tweaking and pipelining, but his design may do more than my simple CPU.
Simon
Physicists get Hadrons!
"Why," you may ask, "was the internal wiring so insanely packed?" The length of each point-to-point wire was individually calibrated, such that all the signals to each gate arrived at the same moment, so you didn't need flip-flops to latch values in the flow of the circuits. Kind of a "just-in-time delivery" of electrons; and each layer of buffering avoided saved you delay along the pipeline. I don't think this sort of scheme was used on any other mainframe.
Send an email to the folks at the CISL division of NCAR.
They know a thing or two about Crays.
From TFS:
All software is available if you want to build one for your own living room. The largest obstacle in the project is to find original software."
Um... why not just click on the little link provided there?
There is no "I disagree" mod for a reason. Flamebait, Troll, and Overrated are not substitutes.
I think there were (are?) four of Supercomputer Centers that had Cray 1 and later Cray X-MP machines. The Pittsburgh center did a lot of work with Carnegie Mellon, esp. the Robotics Institute.
I personally did one bit of work - porting a photometrically correct ray-tracer by Dr. Robert Thibadeau in the Image Understanding Laboratory from an Apollo workstation to the Cray at PSC - this would have been in 1989, I think. The one complication we had was that the Cray floating point format was different, so our first runs were all zeros. Other than that the code compiled and ran fine on the Cray. Of course, a run that took two weeks on the Apollo ran in about 40 seconds on the Cray.
A lot, maybe all of the work done on these machines was non-spooky research so perhaps you can track some of the professors at the associated universities, such as CMU, Northern Illininois, UCSD, Berkeley, etc. Also check out the weather folks - they have been among the biggest CPU cycle-burners for a long time. I worked briefly with one weather guy at a weather research facility in Wyoming but I don't recall any details - was it U Wy?
The SCs I recall are:
I'm sure that if you dig around in the universities you'll find folks who have stuff piled on a back shelf somewhere (probably in a tape format you can't read). Also look up in the old annals of the ACM SIG on supercomputing - that will give a line on researchers who were working on the Cray.
It's easier to be a result of the past, but more fun to be a cause of the future! http://www.spacefinancegroup.com/
As part two (see previous attempt) of my ongoing series in ‘computational necromancy,’ I’ve spent the last year and a half or so constructing my own 1/10-scale, binary-compatible, cycle-accurate Cray-1. This project falls purely into the “because I can!” category - I was poking around the internet one day looking for a Cray emulator and came up dry, so I decided to do something about it. Luckily, the Cray-1 hardware reference manual turned out to be useful enough that implementing most of this was pretty straightforward. The Cray-1 is one of those iconic machines that just makes you say “Now that’s a super computer!” Sure, your iPhone is 10X faster, and it’s completely useless to own one, but admit it . . you really want one, don’t you?
The Cray-1A Architecture
Now, let’s get down to specs - What is this bad boy running? The original machine ran at a blistering 80 MHz, and could use from 256-4096 kilowords (32 megabytes!) of memory. It has 12 independent, fully-pipelined execution units, and with the help of clever programming, can peak at 3 floating-point operations per cycle. Here’s a diagram of the overall architecture:
cray_architecture
It’s a fairly RISC-y design, with 8 64-bit scalar (S) registers , 8 64-bit/64-word vector (V) registers, and 8 24-bit address (A) registers. Rather than a traditional cache, it uses a ’software-managed’ cache with an additional 64 64-bit words (T registers) and 64 24-bit words (B registers). There are instructions to transfer data between memory and registers, and then register-to-register ‘compute’ instructions.
One of the coolest aspects of this machine is that everything is fully pipelined. This machine was designed to be fast, so if you’re careful, you can actually get one (or more) instruction every cycle. This has some interesting implications - there’s no ‘divide’ instruction, for instance, because it can take a variable amount of time to finish. To perform a divide, you need to first compute the ‘reciprocal approximation’ (something we *can* do in exactly 13 cycles, it turns out) of the denominator value, and then perform a separate multiply of that result with the numerator.
The vector instructions are particularly cool. A vector Add operation might take only 5 cycles to start producing results (remember, each vector can hold 64 values, so it takes 5 + 64 cycles to finish adding). Why wait for it to finish though? We can take the result output from the adder, and “chain” it straight into another vector unit (say a multiplier). And *that* only takes another 10 cycles or so, so we can chain that result into yet another unit (say, reciprocal approximation). Now, rather than waiting for the first operation to finish, we’re computing up to 3 floating point calculations per cycle. Clever programmers could sustain about 2 floating point operations per cycle, or 160 million instructions per second.
vector_chainingVector Chaining in Action!
The Hardware
The actual design was implemented in a Xilinx Spartan-3E 1600 development board. This is basically the biggest FPGA you can buy that doesn’t cost thousands of dollars for a devkit. The Cray occupies about 75% of the logic resources, and all of the block RAM.
spartan3_1600
This gives us a spiffy Cray-1A running at about 33 MHz, with about 4 kilowords of RAM. The only features currently missing are:
-Interrupts
-Exchange Packages (this is how the Cray does ‘context-switching’ - it was intended as a batch-processing machine)
-I/O Channels (I just memory-mapped the UART I added to it).
If I ever find some software for this thing (or just get bored), I’ll probably go ahead and add the missing features. For now, though, everything else works sufficiently well to execute small test programs and such.
The Software
When I started building this, I thought “Oh, I
I'm perfect in every way, except for my humility.
Keep in mind that this was before the IEEE 754 floating point specification. Many, if not all of the trig functions were approximations, to which Cray quipped "Do you want fast or accurate?"
Ages ago, I heard this story. Can anyone confirm if this is true or not?
Seems Steve Jobs, upon the success of the first Macs, was getting ready for the next step and he went to Cray Computer to buy one (probably to help design the PowerPC?).
Anyway, Cray Computers were not just sitting on the shelf waiting to be sold, so it seems Jobs created an altercation and demanded to see the manager about getting one, so they called Seymour down to the lobby. Steve introduced himself and said words to the effect of “I’d like to use a Cray to design the next Apple Computer”. Seymour replied “Thats great. I used an Apple Computer to design my Cray”.
The S3E itself can be clocked internally at 300+ MHz. However, the maximum speed achievable depends on the architecture and layout of the circuit implemented. The maximum clock is dependent on the longest logic and routing delay through the circuit. Since the design is apparently a register for register copy of the original Cray architecture, the original ECL logic still has a speed advantage over the CMOS S3E.
Not having seen the design, I don't know how it's been implemented, but it's possible to have a compatible design that implements all the original specifications without designing it the same way... It's also possible for an FPGA design to run faster than the original part - see the multiple-tens-of-MHz variants of the equally venerable 6502 (which maxed out at ~2MHz at the time) for example.
Clearly, the logic path dictates the final speed. That's why placement is so important, and why hand-placement is far better than the pathetic job the automatic tools produce. Perhaps you were intending that for the parent post to mine, but anyone doing any FPGA work knows about the critical path...
Simon
Physicists get Hadrons!
Very interesting. Thanks for that. Sounds like you worked there.
How did the direct to film imaging work? CRT though optics to film?
Yes, I still do work there. The film output did indeed work like you suspect. They were made by Dicomed, and we ran them so hard that our in house electronics maintenance personnel were well acquainted with them. The Dicomeds supported a 4096x4096 vector graphic display, and the usable portion depended on the output format and frame size. Raster output could be performed by displaying each point for the required amount of time to produce the desired exposure. The color versions had a color wheel and required three passes.
Initially the Dicomeds were driven from PDP-11 systems running RSX via DR-11 interfaces. Then we wrote new software (called TAGS - Text and Graphics Server) that ran on Sun 3s with DR-11 equivalents. I wrote a simple X windows based Dicomed simulator so we could test the software drivers without needing to wait for film to be developed. Although there were several developing runs per day depending on the demand.
Later on we also attached video tape recorders to a crude DVR type box (I forget the manufacturer's name) that could record up to 30 seconds at a time to hard disk then the software would start up the VCR to record that portion, then stop it. Needless to say, that was very hard on the VCR mechanisms!
The users could send their graphics (NCAR Graphics, text, raster graphics) to any of the devices by just passing a different destination device on the MASnet command.
The appearance of table top video projectors that attached to computers and cheaper laser printers was the downfall of TAGS and all of the associated output hardware.