More on Virginia Tech G5 Cluster: 17.6 Tflops

← Back to Stories (view on slashdot.org)

More on Virginia Tech G5 Cluster: 17.6 Tflops

Posted by CmdrTaco on Sunday October 12, 2003 @03:03AM from the thats-a-lotta-flops dept.

daveschroeder writes "BBC World's Click Online has a video report (with text transcript) on Virginia Tech's new 1100-node dual 2.0 GHz G5 Terascale Cluster. The report quotes the performance as 17.6 Tflops. As a point of reference, the cluster would be number 2 on the most recent June Top 500 list, behind only Japan's Earth Simulator, and considerably more than doubling the performance of the current number 3 1152-node dual 2.4 GHz Xeon MCR Linux cluster. Assuming the performance figure accurately reflects the LINPACK score (which it should; since the deadline for submissions for the upcoming list of Oct 1 has already passed, one would imagine VT would quote that figure), and depending on new entries for November's upcoming list, the cluster should almost certainly rank in the top 5 - all for only US$5.2 million. The video report is available in Windows Media 9 and Real formats; the relevant portion starts at 13:00."

24 of 390 comments (clear)

Better links for Windows Media by Mwongozi · 2003-10-12 03:07 · Score: 4, Informative

You can watch just the report itself, no skipping required, by following the links on this page:
http://www.bbcworld.com/content/template_clickonli ne.asp?pageid=666&co_pageid=3
Heist by DarkHazard · 2003-10-12 03:08 · Score: 2, Funny

Surley they only need 1099 G5s.
Yes, but, by Anonymous Coward · 2003-10-12 03:11 · Score: 2, Funny

can it defeat an iMac in Apple's Photoshop benchmarks ;-)?
Re:Can the results be trusted? by sakusha · 2003-10-12 03:13 · Score: 5, Informative

They have previously discussed this, they use error correction algorithms, no ECC RAM necessary.
Twice as fast...? by suwain_2 · 2003-10-12 03:14 · Score: 2, Interesting

...considerably more than doubling the performance of the current number 3 1152-node dual 2.4 GHz Xeon MCR Linux cluster.

If I understand this correctly, it's saying that a G5 is more than twice as fast as a dual 2.4 GHz Xeon? (1152 dual 2.4 GHz Xeons vs 1100 dual 2.0 GHz G5s -- there are fewer G5s and they run at a slower clock speed.)

This is a pretty staggering statistic. I hadn't really believed the hype about how fast the new G5s were.

--
________________________________________________
suwain_2 :: quality slashdot p
1. Re:Twice as fast...? by adam872 · 2003-10-12 03:36 · Score: 2, Informative
  
  The two clusters are different enough that making accurate comparisons is difficult. The new G5's have a more recent PCI architecture, they use Infiniband as the interconnect and it's possible that they made use of the AltiVec (though I hear that this may not be the case because of 32 bit limitations). I believe none of these apply to the Xeon's. In high speed computing, the interconnect is vital, so that alone may push this cluster ahead for the time being. I don't doubt that the individual G5 processor are bloody quick (and as a Mac user and fan, I'm kinda glad) though.
2. Re:Twice as fast...? by TheRaven64 · 2003-10-12 04:24 · Score: 2, Funny
  
  I'd still take the G4 over the P4...
  I'm not so sure. It's almost winter, and this house can get pretty cold...
  
  --
  I am TheRaven on Soylent News
3. Re:Twice as fast...? by mangu · 2003-10-12 05:13 · Score: 2, Informative
  
  17.6 Tflops in 2200 processors results in 8 Gflops/processor. I don't know about the Xeon, but I have benchmarked my own 2.4 GHz Pentium4 at 6 Gflops, multiplying two 1000x1000 random matrices using Lapack. So, yes, 8 Gflops at 2.0 GHz is faster than 6 Gflops at 2.4 GHz, but only slightly. Also, there is the overhead in the matrix multiplication. The peak performance in the 2.4 GHz P4 would be 9.6 Gflops, so one can say there's no magic other than Apple marketing in the G5. The diference in performance between those two clusters probably comes from other factors than processor power.
4. Re:Twice as fast...? by Blondie-Wan · 2003-10-12 06:15 · Score: 2, Insightful
  
  You obviously haven't been paying attention. The Mac world stopped crowing about the G4's superiority over other systems once it became clear it no longer was (if it had been to begin with), i.e., as the MHz gap got wider and wider. Prior to the G5 intro at WWDC, Steve hadn't run one of his keynote Photoshop bake-offs in a while; even he wasn't claiming dual ~1 GHz G4 systems were faster than what the Wintel world could offer. Many people pointed out (correctly) that there was still a MHz myth and that the Macs weren't as slower than the Wintel systems as the numbers might imply, but on the whole, most Mac people were certainly conceding at the beginning of this year that the fastest PCs were faster than the fastest Macs.
5. Re:Twice as fast...? by Halo- · 2003-10-12 06:17 · Score: 2, Informative
  
  Yeah, it might be the same argument, but either way it is fairly pointless. This is why benchmarking is such a controversial subject. Do you measure pure operations per second? If so, which ops? Or do you measure the actual wallclock time it takes for real world programs to execute a set of common tasks? Again, which programs and which tasks? (Which doesn't even begin to get into the all-to-real problem of vendors adding hacks to screw with the benchmark ala the video card arms race....)
  
  I don't know jack about Macs, but it wouldn't surprise me if some marketing drone had claimed that. They might have even had a few examples to back it up.
  
  Ultimately, "speed" isn't really a singularly quantifiable entity. I can think of at least three different ways to measure speed:
  
  1) Pure CPU operations per second. Good for hard math, but only part of the equation
  
  2) Hardware speed across the entire system. If the CPU screams, but the memory subsystem drags ass, the usage speed is slower. I doubt VT's cluster would be very cool if it was using 9600 baud modems for interconnects. :)
  
  3) Perceived speed. How fast does it feel like it runs? For example the Transmeta chips which "learn" and optimize will feel slower or faster depending on if the code has been run recently, but the hardware speed hasn't changed.
New top-500 list will be announced around Nov 18 by slyfox · 2003-10-12 03:46 · Score: 3, Informative

The new "top 500" list will be announced right before SC2003 and discussed in detail at a session of SC2003 on November 18.

Look for another (less speculative) story on Slashdot around then.
Re:It runs MacOS X !!! by paulthomas · 2003-10-12 03:56 · Score: 3, Informative

Ugh.

It has been said thousands of times by now I'm sure.

Running Mac OS X does not mean running FreeBSD Mac OS X is a system of frameworks running on top of a Mach Kernel. The only thing that relates Mac OS X to FreeBSD is the userland. In addition to the userland you have: Cocoa, Carbon, Aqua, Java, etc. The FreeBSD portion is minimal.

And yes, if you want you can run this lower level unix without the rest of Mac OS X. It is called Darwin. It runs on Intel and PPC if you're wondering. No, this doesn't mean that Mac OS X runs on both or ever will.

Here is a short description of the BSD families.
Re:Can the results be trusted? by Usquebaugh · 2003-10-12 04:11 · Score: 4, Insightful

Yep, and they're going to be top 5. Between you and them I wonder who has the best knowledge of how to build a cost efficient cluster?
Project leader speaking at conference Oct 28 by daveschroeder · 2003-10-12 04:20 · Score: 4, Informative

The project leader, Dr. Srinidhi Varadarajan, will be speaking at a session entitled Building Virginia Tech's G5 Supercluster on Oct 28 at the upcoming O'Reilly Mac OS X conference.

He'll probably reveal some of the technical details, such as the version of Mac OS X used, at that session.

Also, according to a blog at O'Reilly:

Next year, all the little known details [about the cluster] will be revealed in a new book. By that time we'll know what the project means for supercomputing and for Apple.
Re:But it doesnt add up...? by Aardpig · 2003-10-12 04:37 · Score: 2, Informative

Would anyone care to shed some light onto this?

I can shed light to this extent: a linear scaling between processors and processing power is only realized in the most idealized of situations (those known as 'embarrasingly parallel'), where each job is small and completely independent of other jobs. The funny thing about embarrasingly parallel tasks is that they do not need a fancy parallel computer; they can just as easily be accomplished on N separate 486 machines, if N is sufficently large.

The upshot? If they claim a purely-linear scaling, they are either lying, or only considering those jobs for which one can get by on a (large) Beowulf cluster of shit machines. My head is not turned by this news...

--
Tubal-Cain smokes the white owl.
Re:Can the results be trusted? by Waffle+Iron · 2003-10-12 04:38 · Score: 4, Interesting

So they bog down the software doing something that could be done in hardware?
Just because it's in hardware doesn't mean it's free. The ECC logic is going to add a small delay to each of trillions of memory accesses. Plain memory can most likely be tuned to run faster than ECC memory.
If you're running a constrained problem and can verify the results at the end, a single error check in software could consume far less overall time than the continuous ECC hardware checks. The software check would probably catch other types of errors as well (including many errors caused by software bugs).
How does it not add up? by daveschroeder · 2003-10-12 04:40 · Score: 3, Interesting

Easy: you yourself point out that 1100 * 15.7 = 17.27 ... not 17.6.

Since the call for papers for the new Top 500 list was Oct 1, and the BBC show aired on Oct 9 with a companion BBC News story dated Oct 12, you'd hope that VT was simply regurgitating the figure that has already been sent to the Top 500 organization.

And why are you trolling around with one of those super-old benchmarking stories? We've already established that every manufacturer does what they can to show their products in the best possible light. At least Apple documented their test results and methods in full.

So acually, your logic doesn't make any sense: you jump to the conclusion that it's not real results - even though real results already exist and have been submitted, and the entire story is pretty much about that process, making performance figures a critical piece to get accurate - and that they must have just multiplied some benchmark number by 1100. Then, even though the subject of your own post indicates your recognition that "it doesn't add up", you still apparently assume that the results are somehow doctored, this time for the worse, and you manage to weave in one of the stories that tries to make it look like Apple lied with its benchmarks - which it didn't - which is unrelated to the current issue! How does it "assume" the original scores were accurate?? YOU are assuming that they're just multiplying. You might have been onto something if the multiplication actually came out, but it doesn't, meaning that is NOT what they did.

Bravo, +1 Troll.
BJs for Geeks by bstadil · 2003-10-12 05:11 · Score: 4, Interesting

My favorite comment:
In fact the heat is so intense that ordinary air conditioning units would have resulted in 60 mph winds

--
Help fight continental drift.
Interesting math by lexcyber · 2003-10-12 05:11 · Score: 2, Interesting

The VT cluster cost about $5.2 Million and get approx. 17TFlops - The NEC Earth Simulator gets 35TFlops and cost one billion dollars. That makes it 192 times more expencive. So you can build 192 VT Clusters. And then in theory get. 3.2PFlops for the same amount of money. If you detract performance for cable lenght etc. - You will most definitly get around 1PFlops.

So, you supercomputerusers out there - build a 1PFLOPS cluster NOW!

--
- To understand recursion, we must first understand recursion -
numbers ok....reading is wrong... by djupedal · 2003-10-12 05:15 · Score: 4, Informative

The 'project' uses the same amount of electricity as 3,000 average sized homes. There are many more devices deployed than just the 1100 G5s. The cooling system alone is a major power eater. Read the articles :)
17.6 TFLOPS is Rpeak, not Rmax! by DeeKay · 2003-10-12 06:15 · Score: 3, Insightful

Simple equation: 4 FPops/cycle (IBM-PPCs) * 2GHz * 1100 G5s * 2 CPUs/G5 = 17.6 TFLOPS!

No *real* Rmax linpack scores are known yet, and from what i figured the submissions on Oct 1st are just for *inclusion* in the list, real Linpack scores can be submitted till shortly before (or even on!) the conference mid-November..

This article is BS and should be removed...

P.S.: 4 FPops/cycle per clock with 2 FPUs i hear you scream - Impossible! - That's due the Multiply/Add FMAC thing that counts as 2 FPops!
MOD PARENT UP by Mark+of+THE+CITY · 2003-10-12 06:37 · Score: 2, Interesting

I'm glad someone out there thinks these things through...

BTW, an acquaintance told me of her ILLIAC IV days. With 64 independent processors it was the fastest pre-Cray machine, but sometimes did produce wrong values. Standard practice was to have 3 processors run the same problem and compare the results at the end, deciding that the 3x performance hit was worth it, if the results actually meant something...

--
The clearance system sounds logical. It is not. It is completely arbitrary. -- John Bolton
Re:MOD PARENT UP by sakusha · 2003-10-12 07:11 · Score: 2, Interesting

Yep, that's an accepted practice in mission critical real-time systems. I recall reading about the IBM computers used in the Space Shuttle, they have triple redundancy, all 3 computers operate in parallel, and they "vote" on all results. If one computer doesn't agree with the other two, it is outvoted. Of course this is an extreme oversimplification of the software design, but you get the idea.
Re:Can the results be trusted? by afidel · 2003-10-12 09:52 · Score: 4, Informative

No, ECC ram typically is just made with faster internals. As an example most ECC comodity ram is CAS2 latency whereas most generic ram is CAS3, so the ECC ram will perform exactly the same as the non-ECC ram. You can buy CAS2 non-ECC ram but it's nearly as expensive as the ECC ram. If you have a simple idiot check at the end of a complex calculation then saving the cost of going with ECC may be worth it but most clusters this large will be used on too many different projects to assume that all of them will have such checks. For an idea of how important ECC is read (a href="http://www.ibm.com/servers/eserver/pseries/c ampaigns/chipkill.pdf">This IBM whitepaper on their chipkill ECC scheme. Even normal SEC ECC ram (what most ECC ram is today) will have aproximately 900 failures per 10TB per three years. I think that IBM is right and that eventually all ram will be RAID-M, that is a RAID5 style array of redundant memory banks that are composed of ECC banks. At future densities this will be necessary because a single high energy particle will have the ability to scramble an entire memory word including it's ECC checking bits.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.