flaming-opus · Slashdot Mirror

Re:Opterons and PowerPC together on Cray XT-3 Ships · 2004-10-26 02:28 · Score: 1

Basically this is the same off-load as blue-gene. One processor with a MPI off-load engine. In the case of blue gene, the main cpu is another 440, while xt3 uses a much stronger opteron. (of course the IBM solution is less expensive, and much denser).

The real difference in this system is the high bandwidth shared memory. Blue Gene has hardware support for shared memory, but the software appears to be strictly MPI based. (at least in the first revision, and according to what I've read, this may be preliminary data only).

Re:65 TFlop is only an estimate on NEC Strikes Back With SX-8 Supercomputer · 2004-10-21 02:39 · Score: 1

Not usually. I've actually done a little bit of work on NSA systems. (you get stack traces by fed-ex after they have been reviewed by security pros) I don't know about ALL of their systems, but the things I've dealt with they are more limited by data throughput, rather than pure number crunching.

It's a good point that NSA, CIA, etc use big vector boxes, and don't report to top500. They basically bank-rolled the r/d phase of X1 development.

Re:65 TFlop is only an estimate on NEC Strikes Back With SX-8 Supercomputer · 2004-10-20 07:51 · Score: 4, Insightful

It's probably a pretty good estimate, as this is just a clock speed bump and packaging update to the sx-6 (earth simulator).

An equally important criticism is that they've only announced the POSSIBILITY of building a 65TF system. No one has actually ordered one. The cray X1 can scale up to 50TF if fully populated. The X1E scales up to 150TF. This is of no great consequence, as the largest one in production is only 10TF. Yes they could build a really big sx-8, but it cost $200M to build the earth simulator, probably something similar to build this thing.

There are a lot of computers that are really cool - on paper.

Re:Runs Linux? on NEC Strikes Back With SX-8 Supercomputer · 2004-10-20 07:42 · Score: 1

Nope. The Global File System in super-UX is much less general purpose than redhat's GFS. It relies on global shared memory afforded by the full crossbar interconnect of the sx-series computers.

They are similar from a 1000 foot view, very different in implementation, design, and target user.

Re:specialized boot drives on Itty Bitty SCSI Hard Drive Arrives · 2004-10-15 09:53 · Score: 1

But who the hell needs a fast boot drive? That's the sort of usage pattern where it just won't matter. True, the boot drive doesn't need to be very big, and a 1" drive could serve as a very good boot drive for a server. (2 of them mirrored = better)

The ludicrous speed is where you want you database, or mail spool, or whatever the server is there for. The speed of the reboot is probably not limited by the disk speed, unless you're booting off floppy drives.

Re:Misconceptions on What's The Linux Kernel Worth? · 2004-10-13 05:22 · Score: 1

True, this isn't a comparison to a single license of windows. However it would be interesting to see what other similar deals have cost.

Windows NT is (arguably) a derivative of DEC's VMS operating system. Back in the early 90's microsoft settled a case with digital for copyright problems. (Anyone know the value?)

When cray research developed the cs6400 server, they licensed solaris from sun. I wonder what they paid or that.

Microsoft's SQL server is a derivative of sybase. Anyone know what they paid for those rights?

I don't have any numbers here, but they are all 10's or 100's of millions of dollars. $50k is ridiculous for a linux license.

Re:The end of custom CPUs on Cray XD1 Now Available · 2004-10-05 14:11 · Score: 1

uh, close.
The scalar unit of the X1 has a 400mhz clock and is dual-issue. (It's a MIPS derivative) It's a late-90s era risc chip. There are 4 scalar units on each MSP.

The vector unit is clocked at 800mhz, and has 8 execution pipes per MSP. The vector unit does not compute the entire vector all at once, in fact you wouldn't want it to, as the pipelining of the vector unit is what helps mask memory latency in a vector-processor. The CPU dumps the contents of the vector registers into the 30+ stage pipeline 8 at a time, and can use the output of one operation as the input of another operation without commiting the value to a register (this is called chaining). Since the latency of this process is really large, it masks the latency of loading values from memory. (most vector machines are cacheless. The X1 has one, but earth simulator does not) However, the memory bandwidth and computational throughput are very high. This is why vector machines are GREAT for big memory problems, and TERRIBLE for code with lots of branching and short loops. (object oriented database code for example)

Thus the theoretical peak performance of the scalar unit is half of the vector unit. However, most codes only use one of the scalar units, and that is used to direct the vector unit. Vector processors often realise 50% or greater of peak performance, though this requires a lot of application tuning. Most scalar machines are lucky to hit 10%, even with heavy optimisations. This has much more to do with the vector instruction set, and the HUGE memory bandwidth (16 channels of rdram) than with the theoretical throughput of the ALUs. ------ Vector processors are a lot less versatile than a scalar processor. They force the programmer to arrange the problem in a way that exposes a lot of parallelism, and to explicitely tell the processor about that parallelism. If the programmer is able to do that, and goes through the work, the processor exploits that organization, basically until it is limited by the memory bandwidth. A superscalar design will try to geuss about parallelism, but it's a difficult thing to do since the machine is so generic and versatile.

Re:Hmmm... on Cray XD1 Now Available · 2004-10-05 08:59 · Score: 1

Most itanium systems use a shared bus.

One should note, however, that the altix does not use the shared bus features of the itanium. Or at least that that bus is only shared by one cpu, the memory, and the bridge chip. The interconnect architecture of the altix is identical to the interconnect used on the old SGI origin systems, which were based on MIPS processors. From an architecture point of view, the Altix and the XD1 are very very similar. One uses itanium, one uses opteron.

Altix tries to run a single OS image across the entire machine, while XD1 relies on MPI to do data sharing. However, even SGI doesn't spread a single linux image across their biggest machines. They also create a cluster-in-a-box on large enough configurations.

Re:What happened to RedStorm? on Cray XD1 Now Available · 2004-10-05 06:20 · Score: 2, Informative

It's still on-going. Red Storm is the focus of cray's effort over the next couple of years. Red Storm is the real-deal MPP-style system with a micro-kernel OS. xd1 is a low-end mini-super they acquired to expand down-market. (like mercedes buying chrystler)

2 complimentary product lines. You could run the same application on both, though red storm provides real shared memory, which might allow better optimizations.

Re:Not the top end on Cray XD1 Now Available · 2004-10-05 06:15 · Score: 2, Interesting

ISR (Isothermal Systems Research, Inc.) and cray have cross licensed patents on this technology. I don't know if ISR plans on productizing this or not.

I imagine that this is extremely expensive stuff to do. Since a cray can charge $40,000 per processor for the X1, they can get away with a $700 cooler. Not so easy on a PC.

Re:Does the XD1 give the illusion of shared memory on Cray XD1 Now Available · 2004-10-05 04:13 · Score: 1

NO.
For that you need to buy crays mpp system "strider", which is a productized version of red-storm.

The xd1 hardware is probably capable of shared memory, the software is not. The nodes (each 2-cpu blade) run off-the-shelf linux, and use MPI to share data.

Re:Hmmm... on Cray XD1 Now Available · 2004-10-05 04:08 · Score: 4, Informative

Cray now has three product lines to address 3 different market segments.

They have the X1, which is a massively parallel vector system for the very high-end. (For those who need 30+Gbytes/second of memory bandwidth for EACH cpu) These things are huge, expensive, and used by a limited number of users, mostly governments.

They are getting ready to productize red storm, which is also a bunch of opterons, but strung together in a shared-memory system like the T3E. also a high-end solution.

This system, the Xd1, is a low end system designed to be a half-step better than a cluster of off-the-shelf opterons. It's a multi-kernel cluster using MPI for all the data sharing. However the interconnect basically sits where the south-bridge sits on most opteron boxes.

So Cray still has the absolute cutting edge systems, but have now expanded down-market. (Rather, they acquired octiga-bay who did the early design work).

This is also not the first time this has happened. In the early 90s, Cray purchased a small start-up that was developing a NUMA-style mini-super based on sparc processors. They turned it into a product and sold a few, though not as many as they would have liked. During the SGI acquisition they sold the product to SUN, who branded it the E10000, and made about a billion dollars off of it. It's now the foundation for all of Sun's high-end Unix servers.

Cray also bought a small company (I forget the name) that made a cmos implementation of the YMP. This became the ymp-el, the J90, which pioneered technology for the SV1.

Cray has often built mid-range systems. Nothing new.

Re:Off the shelf configuration on IBM Sets Supercomputer Speed Record · 2004-09-29 09:23 · Score: 1

You might have this exactly reversed. Blue Gene does not have ram embeded in the CPU, but rather soldered to the node-card.

However, a lot of embeded applications don't need any more than a few hundred k of ram, and could theoretically have the entire memory space on-die. I haven't done that with an IBM chip, but I've used motorolla microcontrollers with 256K of SRAM on-die. I believe the entire executable fit in 8K, and the collected data in 32K, which left us with 6 32K buffers to use for debugging.

If the asic is a RAID controller or a DSP, then yes, you need lots of external RAM.

Re:way to catch up guys. on IBM Sets Supercomputer Speed Record · 2004-09-29 03:29 · Score: 2, Interesting

That's only sort-of true. Blue Gene, like asci-red, cri t3e, paragon, etc use a microkernel OS to control the compute nodes. This is basically a couple of network stacks that allow the application to use the interconnect network, and some hooks for the larger OS which runs on dedicated OS-nodes. The microkernel mostly just gets out of the way, and lets the application run balls-out on the compute nodes. Blue Gene was even designed so cleverly that MPI barriers and all-reduce are implemented as part of the interconnect network.

But then the application does something like write(file, offset, &buffer). That can't be handled by microkernel, and must be handled by an OS node. The system call might even be handled by a different node from the I/O node connected to the disk drive. The system call is performed by a "server" on the OS node that may be part of that node's operating system, or might be a user-space daemon. Since there is only 1 thread on the compute node, it blocks until the i/o request is serviced.

This is not a hard thing to do if there are 60 compute nodes, 2 OS nodes and 2 i/o nodes. But with 100,000 compute nodes, there would have to be hundreds or thousands of OS nodes. Far too many to run with a monolith kernel. Scalability within this pool of OS nodes is a tricky problem. Previous MPP designs have demonstated that it's really easy to get the common case working, but much much harder for a few corner cases, like concurrent un-structured writes to the same file. (which tends to happen at the beginning and end of many big MPI programs. - Remember that you don't solve a problem any faster if the machine runs 30Tflops for 2 days, and then spends 25 days putting the output data together)

On a machine that large, check-point / restart is a big deal. Node failures are going to be common when that many components are involved. You end up with huge amounts of data, all of which needs to be written quickly, while the machine sits idle.

These problems are well understood. MPP designers have been wrestling with them for almost 20 years now. But any new system will have some kinks and bugs to deal with. I'm sure IBM is working hard to get them solved. Thay may have it all working already, for all I know.

You're right though, that the performance of the inner loops depends a lot on the application developers.

way to catch up guys. on IBM Sets Supercomputer Speed Record · 2004-09-29 01:35 · Score: 4, Informative

I'll be very interested in seeing how well this thing performs on benchmarks other than linpack.

Blue Gene is a very interesting design in so much as it uses IBM's 32-bit powerpc cores, normally used for embeded applications. They put 2 cores on a die, and integrated a memory controller, as well as the 4 different interconnect networks. The cores are only clocked at about 800mhz, and are thus pretty wimpy individually. However, that can be good. Since the processor cores are quite modest, the ratio of memory bandwidth to CPU flops is quite high. Similarly the ratio of interconnect bandwidth to CPU flops is also very high. Thus the CPUs should run very efficiently on problems that will parallelize to thousands of cpus. Some problems, on the other hand, will perform terribly. I expect a lot of this system's performance depends on the scalability of the system software, and the compilers / libraries.

That said, the earth simulator is also really good at some applications, and not so good at others. Instead of 16,000 small CPUs, it uses 5000 massive vector CPUs. Each is clocked at only 500mhz, but has 8 parallel execution pipes, and about 50GBytes/sec of memory bandwidth. Problems that don't vectorize run through the very modest 500mhz scalar unit.

Earth simulator has realized a large percent of it's theoretical peak performance on real world simulations (often up to 50%) while most large systems approach (10%). I'm looking forward to see how well utilized Blue Gene is. Earth simulator was a direct descendant from NEC's sx-series supercomputers, which have a 20 year lineage. Blue Gene is a radical departure from IBM's regular HPC product offerings, and uses a new microkernel OS rather than clustered AIX nodes. I imagine there will be some stutter-steps in the early days of this new product, which will undoubtedly work themselves out over time.

Great work IBM.

The low hanging fruit was picked decades ago. on HP Terminates Itanium Workstations · 2004-09-24 08:14 · Score: 2, Insightful

ia-64 is the most dissimilar, but only because everyone else is doing exactly the same stuff. Does the really include any design features not present in some form in ?

x = a->b->c also stumps hardware pre-loading.

itanium 2 doesn't do next-line prefetching, but it does read 2 bundles of instructions per cycle. This, depending on the density of those bundles, does everything that a prefetch might do, and more given available execution units.

Your contention is correct that itanium doesn't solve all the problems that face a modern risc architecture. Does that mean that no one should bother trying? Should processor makers churn out the same stuff and wait for moore's law to make things faster? Hope that multi-core cpus will somehow be better utilized than smps?

The simple fact of the matter is that there is a finite speed at which one can execute a serial sequence of instructions. One can try to execute pieces of code in parallel, but there is finite parallelism in most codes. Processors have been fighting for ways to minimize the percent of that parallel code that is mistakenly executed serially, but one is bounded by the actual structure of the code.

Loading data and instructions from memory remains an extremely expensive thing to do, and it's only getting worse. Really solving the problem would require some radical design that completely undermines current methods of programming. I applaud intel for being daring, and the end result is not a disaster, it simply fails to live up to the hype. As a replacement to pa-risc, alpha, and mips, I think itanium is a pretty reasonable choice. As a replacement for x86, not so much.

what market on HP Terminates Itanium Workstations · 2004-09-24 06:33 · Score: 1

hp withdrew itanium from their workstation offerings. Does anyone still use workstations? Of course they do, but not very many. SGI, Sun, and IBM still make risc workstations, but they are expensive, and not much better than a high-end PC. It's a market that's more-or-less gone already. Failure in that market doesn't really mean very much.

Re:TFA? on HP Terminates Itanium Workstations · 2004-09-24 06:15 · Score: 1

Uhhhh? Sorta.

adm64 is a very small modification of the x86 architecture. EM64T cloned those modifications, but that's pretty trivial compared to the fact that ATHLON IS A PENTIUM CLONE!

Second point. This is all on the instruction set level. From an implementation level EM64t prescotts are almost identical to non-em64t prescotts. Similarly athlon64 is a moderate change from athlon.

basically - who cares? They borrowed a good idea. One for which they had the legal right to borrow.

Re:Yeah, Itanium tanked... So what? on HP Terminates Itanium Workstations · 2004-09-24 06:05 · Score: 1

the ia64 has a huge die because it has a gigantic cache. It's made to compete with sparc and power5, not with opteron/xeon/powerpc. Sparc, alpha, and mips machines have supported 8MB of cache for years now. The only refinement intel made was bringing the cache onto the cpu die.

Intel hasn't produced a smaller, cheaper itanium for marketing reasons, not because it's impossible.

That said, the low-voltage 1.1ghz itaniums are under a grand. That's xeon-type pricing. The cost is coming down. No you can't have one in your bedroom, but they're doing well relative to IBM pricing.

Re:Yeah, Itanium tanked... So what? on HP Terminates Itanium Workstations · 2004-09-24 05:58 · Score: 1

Absolutely not true. The itanium may be an expensive commercial flop, but it is not a slow processor. For all you kiddies who have read each other's rumors, let me tell you: I've run real world code on itaniums, and they are FAST. I wouldn't recommend buying one, and I'm very concerned about the 3rd party software situation, but call if for what it is.

First of all, both intel and HP have ia64 compilers that put the itanium to very good use. Gcc still doesn't use if very well, but gcc has never worked very well on non-x86 processors.

As for the superiority of technology, the ia64 takes all of the ideas processor designers have been using in risc chips, and takes them one step further. Speculative loads - risc chips try to read ahead in the instruction stream to load data/instructions into L1 or registers, the itanium does it as part of the instruction set. It's used on every load. Predicate registers and speculative execution are the logical continuation of branch prediction. The ia64 instruction set is just five or six small tweaks to the basic risc architecture that everyone else has been using for a decade or more. They are just tweaks at the instruction-set level, rather than the implemtation level. The only reason anyone has called it revolutionary, is that it followed x86. If hp had released it, they could have called it pa-risc rev 2.
On the other hand, opteron is just another risc chip (jarc) with an x86 translator on the front end. It does everything one might expect a modern risc chip to do (super-scalar, branch prediction, speculative execution, yadda yadda) and does it all pretty well, and for a great cost. But it's not a revolution. It's just a good product with wide market appeal. How, by the way, is opteron "open". You couldn't manufacture your own, even if you had the tools to do so.

In my lab we've got opterons, we've got xeons, we've got G5s, and power, mips, alpha and pa-risc.
Itanium is definately the fastest processor for most tasks. That said, the dual-itanium boxes really rock and cost $15,000. The dual-G5 boxes rock almost as much, and only cost about $4000.

Re:If I may flaunt my ignorance... on Analyst Doubts Intel's Dual-Core Demo · 2004-09-16 06:55 · Score: 3, Informative

We can all shout and scream about how the netburst architecture doesn't work, but that doesn't make it true. 3.6 ghz p4's are FAST. Yes they run hot. Yes they don't get the same ops/mhz that short pipelines do, but they're doing okay. Intel still sells a butt-load of chips, and thats what they're in the business for.

Incidently, all that "sharing bandwidth" stuff is what the 2 cores on a dual-core opteron will do. It's also what ALL the 2-cpu xeons in the world are doing. Again, not the greatest plan ever, but it works well enough today. The shared bus on a dual-core prescott is no different from the shared bus on a dual-chip xeon today.

Intel is in trouble in that they might go from 93% market share to 85%. Look at the market today. Ultrasparc 4's are slower than Itanium, yet ia64 still isn't making real money. The G5 is a really fast processor, but apple has about 2% of the desktop market. Being the fastest processor THIS MONTH doesn't mean the world is going to come knocking. Being close and having a good marketing campaign is more important.

Re:Fortran? on Supercomputers Race to Predict Storms · 2004-09-16 04:24 · Score: 2, Interesting

Much of fortran code is written by scientist rather than programmers. HOWEVER, big hpc labs like this employ dozens of programmers (most with master or phds) to write the software for the scientists. I don't know about the navy labs in particular, but big DOD and DOE labs have optimization teams with several dozen programmers. These are the sort of people who present at the SC conference.

Even academic hpc facilities employ teams of experts to optimize code for the scientists.

You're right that there is a lot of inertia keeping people in the fortran fold. But the software vendors are also helping this by having really strong fortan libraries.

Re:Fortran? Eyew. on Supercomputers Race to Predict Storms · 2004-09-16 04:02 · Score: 3, Informative

Fortran is still the dominant language for programming high performance code. I'd still rather use C, but it's not really that different. When you're trying to optimize a piece of software for a machine architecture you need to use a language that is pretty low-level. The closer to assembly you are, the greater chance you have to best exploit the functionality of the hardware. C++/Java are right out.

Re:Using Fortran, eh? on Supercomputers Race to Predict Storms · 2004-09-16 03:52 · Score: 1

Not bad. Anyone who knows 'C' can learn fortran in about a day. It's a pure indicative language, plain and simple.

What's more difficult is continually optimizing for the various machine architectures. The processor clocks are generally improving faster than the memory latency or network latency. So mitigating those is becoming a much bigger part of the puzzle.

Re:Why use Linux then? on Solaris 10 to be Open Source · 2004-09-14 04:30 · Score: 1

choice of

3 video cards, 4 network cards, 2 scsi cards, 1 sound card. Or something like that.

Slashdot Mirror

User: flaming-opus

Comments · 368