Cray Supercomputers to be Based on AMD Opterons
PsychicX writes "AMD and Cray have announced an agreement to base Cray supercomputers on AMD's Opteron line until the end of the decade, and to collaborate on Cray's 2006 proposal for Phase 3 of the federal government's DARPA HPCS (High Productivity Computing Systems) program. Cray already offers the XT3 and XD1 supercomputers based on Opteron."
Tera bought far more than a name when they bought us. They also bought a bunch of software and hardware people, many of whom (myself not included) have been with Cray Research (the original Cray) for many years. So, while it's certainly not the Cray of the mid-1980's, the tradition still goes back there, especially with the vector machines like the Cray X1/X1E and its impending follow-on.
Go Badgers! -- #include "std/disclaimer.h"
K6 technology was acquired and modified by AMD. The K7 and K8 were designed by AMD. True, many of the engineers on the K7 and K8 teams were probably ex-NexGen since AMD acquired that company, but so what? They are truly AMD innovations. At least they didn't sink all of their research into the Itanic!
I've never completely understood this argument (yes, I admit, I'm heavily biased). If I want to build a skyscraper, I'm not going to use the "mass market" crane that puts up the roof of a residential house. I'm going to use a specialized crane that's meant for building skyscrapers.
That doesn't mean that there isn't a place for commodity hardware in supercomputing, but to say that there's no room for custom hardware either misses the point. The only thing "off the shelf" about an AMD based Cray is the AMD. The logic board, and, most importantly, the network that interconnects the processors is entirely custom. Not to mention the fact that Cray will still build some entirely custom processors...
By the way - this is hardly the first Cray based on a commodity processor. The T3E and T3D were both Alpha processors, yet nobody calls those machines "commodity".
Go Badgers! -- #include "std/disclaimer.h"
dated from June 16, 2005
s srelease.cfm?RecordID=79/
Check out the article here...
http://www.hypertransport.org/consortium/cons_pre
Have you tried Pathscale compilers http://www.pathscale.com/ekopath.html? They seem to give quite decent performance on AMD64 chips.
I know you're trying hard to make it look this is work-related, but for goodness sake, don't make major purchasing decisions based on what you read on Slashdot!
NUMALink
Have you got any links to some page describing some configuration similar perhaps to your colleague's system? I haven't heard about 10GiB/s CrayLink and find it very intriguing.
I speak England very best
Close.
Craylink was designed at SGI, and renamed to craylink after they bought Cray. They introduced craylink in the origin2000, which they started selling half a year after buying cray, so I'm sure they couldn't have integrated any cray-designs into their product in that span.
After they sold Cray to Tera, SGI started calling the technology Numalink, and currently use it in their origin3, altix3, and altix4 product lines. They are on the 4th generation of the technology, which is 3.2GB/s per direction. The cray that was sold to Tera included the half-finished X1 system, which also uses numalink. It uses the older 1.6GBps/dir links, but uses 32 networks in parallel for a total of ~50GB/s/dir per node.
The Cray XT3 uses a newer network interconnect called seastar, which offers 3.8GBps/direction. This is probably what will be used in the X1's successor.
The Cray XD1, which your colleague bought, is a product cray acquired when they bought OctigaBay. They use an interconnect called the RappidArray switch, which provides 4GBps/direction of interconnect.
All of these interconnects are high-bandwidth and low latency. The XD1, is also very inexpensive for a cray, which is always nice.
Because they don't do floating-point in hardware, or at least not to any useful level of performance.
The 8-core Niagara (T1) has 1 floating-point execution unit on only 1 of the 8 cores. Buy a 6- or 4-core Niagara, and do you get a floating-point execution unit at all?
On Niagara (aka UltraSPARC T1) floating-point will mostly be accomplished with software emulation of the SPARC V9 FP instructions.
That's why you wouldn't use Niagara for supercomputing. Web serving, yes, computational fluid dynamics or numerical general relativity, no.
Stick Men
The parent poster posts well for one ignorant of the simplest precepts of marketing. The first things a marketer learns is he must segment a market and only compete in the segments or niches in which competition is profitable. Cray isn't competing directly against clusters because clusters don't have the bandwidth necessary for the sorts of problems Crays are aimed at and Crays tend to be overkill for the problems clusters are aimed at. Cray doesn't seek out customers $.5M for that reason. Anyone who actually uses the supercomputers to solve problems knows that a 50% difference in interconnect speed per single link could mean a 90+% slowdown on a large system using a large program with high overhead. Plain old clusters aren't targetted against Crays, except by some communities that don't buy supercomputers for supercomputer problems anyway, like most Slashdot users. In the supercomputer world, MTTI is everything! That means mean time to interrupt. A bad memory module or a CPU fan blowing out on your single CPU might happen every 3 years on average, but multiply these sorts of problems by 10,000 CPUs on a supercomputer and your cluster will never get any useful work done before something goes out and it crashes. Disclaimer: I worked on the X1/X1e, which is still faster than any other chip on select problems which vectorize well. I agree that the AMD partnership was and continues to be an excellent decision, but it only says that AMD does SCALAR performance better, not everything!
-Those who would give up essential liberty to purchase temporary safety deserve neither. -Ben Franklin
No they won't! They have no reason to. The vector units that a cray uses aren't like altivec, sse, or other "bolt-on" vector units. The vector unit on a cray (or NEC) is a latency hiding mechanism. It's a method for forcing the programmer/compiler to structure the code such that the data loaded from memory is used a significant period of time after the load is initiated. This works pretty well on the HPC code that is used on crays, but not at all for the everyday server/workstation code that opterons run. Furthermore, to support that sort of vector unit, you need to have about eight times as much memory bandwidth as an opteron, which means many more pins on the socket, which are very expensive.
I think you're much more likely to see the cray vector processor retooled with lots of hypertransport connections, so it can use an opteron as its scalar unit, and use the same seastar routers that the xt3 uses. On the X1, the scalar unit already runs ahead of the vector unit, so I bet it's not all that important for the scalar unit to be on-die.
No they won't! They have no reason to.
Yes, you're probably right that it doesn't make sense for AMD economically. But I want to run numerical codes at more than 5 % peak performance on my cheap Opterons, so I want to believe.
The vector units that a cray uses aren't like altivec, sse, or other "bolt-on" vector units. The vector unit on a cray (or NEC) is a latency hiding mechanism. It's a method for forcing the programmer/compiler to structure the code such that the data loaded from memory is used a significant period of time after the load is initiated.
Yes, I know. And that's precisely the reason why I'd like to see real vectors instead of the sse/altivec toy ones. Main memory latency is hundreds of cycles, and it's getting worse all the time.
Additionally, from a microarchitecture perspective, vectors have quite a few advantages there too.
This works pretty well on the HPC code that is used on crays, but not at all for the everyday server/workstation code that opterons run.
I'm not sure about that. I guess technical apps vectorize just as well as HPC codes (well perhaps not the UI, but the code that runs the actual simulation or whatever). Heck, even some database code vectorizes nicely (sorting and hash joins).
Furthermore, to support that sort of vector unit, you need to have about eight times as much memory bandwidth as an opteron, which means many more pins on the socket, which are very expensive.
Yes, as I said some Alpha Tarantula like design is probably overkill for the vast majority of the market. My point was that a vector ISA extension with modest execution resources wouldn't need that much die area, and could help make better use of the available bandwidth, whatever that bandwidth is. As you said yourself, the expensive thing is IO. Transistors are cheap by comparison. So not having instructions that allow one to effectively use the available IO resources is a real shame.
I think you're much more likely to see the cray vector processor retooled with lots of hypertransport connections, so it can use an opteron as its scalar unit, and use the same seastar routers that the xt3 uses. On the X1, the scalar unit already runs ahead of the vector unit, so I bet it's not all that important for the scalar unit to be on-die.
Yes, that sounds feasible. IIRC it is something like this that Cray has cooked up for the Cascade project; I.e. a node consists of 8 (or was it 4) scalar processors connected to memory (I guess these could be Opterons or further in the future some kind of Processor-in-memory (PIM) stuff), and a vector unit with its own cache and fast access to the main memory via the scalar cpu:s.
As for the seastar thing, I think you're right that that's what they'll use for inter-node communication. Currently X1(E) uses Numalink licenced from SGI, so they're certainly looking at replacing that with existing in-house tech. BTW, 2H2006 will see the XT4, with the new Opteron sockets with DDR2 memory and the Seastar2 router that provides twice the BW compared to the existing Seastar.