North America's Fastest Linux Cluster Constructed
SeanAhern writes "LinuxWorld reports that 'A Linux cluster deployed at Lawrence Livermore National Laboratory and codenamed 'Thunder' yesterday delivered 19.94 teraflops of sustained performance, making it the most powerful computer in North America - and the second fastest on Earth.'" Thunder sports 4,096 Itanium 2 processors in 1,024 nodes, some big iron by any standard.
Pay taxes? Then you are the guy who gets the bill.
That said, I think our national labs are pretty great when they aren't designing nukes.
If google's cluster is interconnected via ethernet, there is a whole range of computational problems it can't tackle. If you want to simulate a spatial phenomenon with lot of things going back and forth in a volume, you're bound to have a _lot_ of communications. The cost of the interconnect system in those simulation systems is often a substantial proportion of the total cost of the installation.
Depending on budget, price (I wouldn't be suprised if Intel cut them a sweet deal to get this cluster publicized to help our their product's sales), and other factors, the Itanium could have been a good choice.
Especially if they were using software that had been designed for the Itanium (like they were replacing an older cluster) then they wouldn't have to port the software which would have saved real money.
I'm not a fan of Intel lately, but the Itanium isn't overpriced garbage no matter what. That smacks of fanboyism. Interesting you didn't add G5s to your list, BTW.
ALSO: Don't forget that the Itanium 2 was DESIGNED FOR big iron, while the Opteron was designed for servers and small iron. They can be used in other ways (you could run a web site off an Itanium 2), but the Itanium was designed for these kind of applications.
Comment forecast: Bits of genius surrounded by a sea of mediocrity.
If I calculate right, they are claiming an Rmax of 19.94 teraflops with 4096 processors.
The Virginia Tech cluster for Apple had an Rmax of 10.28 teraflops with 2200 processors.
So, the Itaninum 2 delivered 4.8 gigaflops per processor, the G5 delivered 4.6 gigaflops per processor.
This seems like a pretty poor showing for Itanium 2, overall. It's a much hotter chip than the Opteron or the G5, so cooling and power costs are likely much higher than a comparable apple cluster. The Xserve G5 is also likely cheaper than a similarly equipped Itanium 2 server, given that the Itanium 2 is $1398 per chip on Pricewatch, and a dual processor Xserve G5 cluster node is $2,999 list. Even with 4 cpus in a single box, I think the Itanium 2 server would easily top $6,000.
But anyway, good game to Lawrence Livermore. I'll be curious to see if Apple has another volley to fire before the top500 list closes for this round.
- "When you want something with all your heart, the entire universe conspires to give it to you" -Paulo Coelho
Powerful = fastest computation, not biggest. A roomfull of Chevettes do not make a Corvette.
If I was going to move my stuff to the other side of the country, how many trips would that take me in a Corvette?
I grew up in Livermore, the lab was some 500 yards from my bedroom window. They work on a lot more than nuke simulations, including alternate fuels (my brother in law was driving a hydrogen fuel car from the lab 10 years ago as a test), laser technology and about a million other things. Why is it people like you who hear "Nuke" rant on and on like biased little children and post inflamatory things like this?
The lab is a GOOD thing damnit. Do you even know what nukes are? What nuclear research has done for us? Grow up man.
yes, they're hot as hell and eat power the way oprah eats twinkies, and yes Intel has made a poor handling of the Itanium line, but the Itanium architecture is very interesting, and is actually very appropriate for a HPC environment. Not the part of the HPC market that clusters dominate, but the segment that Cray, SGI, HP Alphaservers, etc. have traditionally dominated. The segment that doesn't give a shit about cooling, power consumption, or price-performance, but who just need to get the job done as quickly as possible.
Some of the coolest features of the Itanium are also some of the reasons why a lot of people don't want to use it. The EPIC ISA, for example. It was designed ( along w/ the physical hardware ) to expose a lot of the internal workings of the processor to the user. But rather than recompile and re-optimize their code, people would rather bitch about migration. That's fine for workstations and servers, but in an HPC environment, you want the nifty features, you want to occasionally hand-tune code segments in assembler, etc.
Anyways, I'm not a fanboy ( well, maybe an AMD and MIPS fanboy ), just wanted to get in a few honest points before everyone started shooting holes in the Itanic.
PC moderators can suck my White pierced, tattooed dick. If you think pride == hate, s/dick/Aryan meat mallet/g.
The NSA, on the other hand... I would guess that they have the most powerful cluster of machines in the world for breaking encryption. Though perhaps not as powerful as the article's supercomputer for other tasks.
Plus there are undoubtedly several other highly classified supercomputers designed to chew on other problems.
So it would seem that you'd have to caveat any claim of regarding the "fastest computer" by saying it's the fastest known, non-secret computer. But then the headline loses some of its appeal.
"Big Iron" is a very vague term - server benchmarks behave very differently than scientific computation as far as performance is concerned; if you don't believe me I can easily point you to a couple of research papers analyzing them.
The humongous on-die caches makes the Itanium perform well on servers, and definitely not the instruction-set architecture. So "WAS DESIGNED FOR" is only 50% true.
The Raven
Intel provides excellent Linux support for Itanium. Also if you use the Intel compiler, which Lawrence Livermore does, you get considerable speed boost on Intel CPUs.
l ers
See: http://www.llnl.gov/linux/linux_basics.html#compi
Intel can afford to provide little niceties like this. Can AMD? I doubt it.
Ed Note: Unless the author wishes to narrow his/her audience to a small subset of Slashdot users, standard formatting and non-cutesy sentence case is always appropriate.
There are basically three type of clusters:
Shared Nothing: In this, each computer is only connected to each other via simple IP network: no disks are shared. and each machine serves part of data. These cluster doesn't work reliably when you have to aggregations. For example, if one of the machine fails and you try to to "avg()" and if the data is spread across machines, the query would fail, since one of the machine is not available. Most enterprise apps cannot work in this config without degradation. For example, IBM study showed that 2 node cluster is slower and less reliable than 1 node system when running SAP IBM on windows and unix and MS uses this type of clustering (also called federated database approach or shared nothing approach).
Shared Disk Between Two Computers: In this case, there are multiple machines and multiple disks. Each disk is at least connected to two computers. If one of the computer fails, other takes over. no mainstream database uses this mode, but it is used by hp-nonstop. Still, each machine serves up part of the data and hence standard enterprise apps like SAP etc cannot take clustering advantage without lot of modification.
Shared Everything: In this, each disk is connected to all the machines in the cluster. Any number of machines can fail and yet the system would keep running as long as at least one machine is up. This is used by Oracle. All the machine sees all the data. Standard apps like SAP etc can be run in this kind of configs with minor modification or no modification at all. This method is also used by IBM in their mainframe database (which outsells their Windows and Unix database by huge margin).
Most enterprise apps are deployed in this type of cluster configuration. The approach one is simpler from hardware point of view. Also, for database kernel writers, this is the easiest to implement. However, the user would need to break up data judiciously and spread across machines. Also adding a node and removing a node will require re-partitioning of data. Mostly only custom apps which are fully aware of your partitioning etc will be able to take advantage.
It is also easy to make it scale for simple custom app and so most of TPC-C benchmarks are published in this configuration. Approach 3 requires special shared disk system. The database implementation is very complex. The kernel writers have to worry about two computers simultaneously accessing disks or overwriting each others data etc. This is the thing that Oracle is pushing across all platforms and IBM is pushing for its mainframes. Approach 2 is similar to approach 1 except that it adds redundancy and hence is more reliable.
So what type are we talking about here?
Yes, and even before the Earth Simulator was built the US was falling behind on environment and climate research, to be behind even some smaller European nations.
Even so, the US preaches to the rest of the world how they should do and think. I think you have reached the state of "being so dumb you don't even KNOW that you are dumb".
You make good points throughout your reply, but if you're clustering--the idea of buying the fastest available just doesnt make sense, unless underlying it really is that much faster in even a cluser envrironment?
/shrug
That sentence doesn't even parse, but anyhow: single-thread performance still matters to clusters. There is a limit to how much you can effectively parallelize many problems. If that limit is 1, then you need a Cray or something. If the limit is extremely high, you can use distributed.net, or a cluster of recycled C64s.
In the middle, you might be able to parallelize the task to a limited extent. If you can only split your work into 500 parallel tasks, then you want 500 of the fastest processors you can get. For many applications, that means 500 Itaniums. Even if you could buy 800 Opterons for the money, they might not be as fast.
only other option would be they thought intel would hold up better/be more stable.
Itanium has slightly better manageability; you can find out when a memory module or CPU is likely to fail for example. There is a heap of error detection/correction in the CPU, far beyond Xeon or Opteron afaik. If you have hundreds of machines being able to easily detect failures is worth something.
(Or you can just take the google route and let it fail and replace the whole box. But that really requires your whole application to be written to accomodate it.)
I don't know if I would call NUMALink a true shared memory system. It is NUMA after all! I was thinking of 32-64 way machines with a true shared memory system, or large vector machines based on SX-6 processors for example.
But look at NUMALink4, its got 6.4 GB/sec per link bandwith and 240ns latency.
QsNetII is just under 1 GB/sec bandwidth, the limit of PCI-X, with a latency of 3us.
So, NUMALink4 has 6.4 times the badwidth and 12.5 times less latency than QsNetII. That a much larger performance difference than Opteron vs Itanium!
This is the official top 500 list of supercomputers (not updated yet although thunder is mentioned as '*possibly* the second-most powerful computing machine on the planet'). Linux moving up to second place (from fifth a bit ago, iirc), woohoo! Only one left to beat!
Visit http://ringbreak.dnd.utwente.nl/~mrjb/growingbettersoftware to download your free copy of the book
the problem is not that you couldn't get the processors, the problem is scale.
A system like this will use a high-speed interconnect, not gige. The popular choice right now is infiniband, and that stuff isn't cheap, and also has limits to the number of ports per IB switch. The system at LLNL has 4 procs per node, which reduces the number of IB switches involved. 5000 dual proc (you suggest 248 proc) machines would require 2500 IB ports, instead of 1024.
now if you considered the opteron 848 ($1300), in 8proc nodes, that would be something to think about, reduce the number of IB ports in half, and be able to double the processors.
the other consideration is also processor scale. the 27% per CPU is signifigant, because even with dual proc SMP, you loose some % of the CPU time. There was a posting on an article about how processors scale this way. I forget how the principle works.