Big Mac Benchmark Drops to 7.4 TFlops
coolmacdude writes "Well it seems that the early estimates were a bit overzealous. According to preliminary test results (in postscript format) on the full range of CPUs at Virginia Tech, the Rmax score on Linpack comes in at around 7.4 TFlops. This puts it at number four on the Top 500 List. It also represents an efficiency of about 44 percent, down from the previous result of 80 achieved on a subset of the computers. Perhaps in light of this, apparantly VT is now planning to devote an additional two months to improve the stability and efficiency of the system before any research can begin. While these numbers will no doubt come as a disappointment for Mac zealots who wanted to blow away all the Intel machines, it should still be noted that this is the best price/performance ratio ever achieved on a supercomputer. In addition, the project was successful at meeting VT's goal of developing an inexpensive top 5 machine. The results have also been posted at Ars Technica's openforum."
I've always been sort of intrigued by
Quod scripsi, scripsi.
Way to go /. -- updated the logo from G4 to G5 just in time.
--
First, from a an Oct 22 New York Times story:
Officials at the school said that they were still finalizing their results and that the final speed number might be significantly higher.
This will likely be the case.
Second, they're only 0.224 Tflops away from the only Intel-based cluster above it. So saying "all the Intel machines" in the story is kind of inaccurate, as if there are all kinds of Intel-based clusters that will still be faster; there is only one Intel-based cluster above it, and with only preliminary numbers for the Virgina Tech cluster at that.
Third, this figure is with around 2112 processors, not the full 2200 processors. With all 1100 nodes, even with no efficiency gain, it will be number 3, as-is.
Finally, this is the a cluster of several firsts:
First major cluster with PowerPC 970
First major cluster with Apple hardware
First major cluster with Infiniband
First major cluster with Mac OS X (Yes, it is running Mac OS X 10.2.7, NOT Linux or Panther [yet])
Linux on Intel has been at this for years. This cluster was assembled in 3 months. There is no reason for the Virginia Tech cluster to remain at ~40% efficiency. It is more than reasonable to expect higher than 50%.
It's still destined for number 3, and its performance will likely even climb for the next Top 500 list as the cluster is optimized. The final results will not be officially announced until a session on November 18 at Supercomputing 2003.
What they're not telling you is that the real reason they are building a supercomputer is because the only copy of the router passwords is GPG-encrypted, and they lost the key.
That 80% efficiency simply sounded too good to be true, and it was.
Now its at 44%. Thats not a small drop, thats a MASSIVE drop.
They didnt predict any loss in going from a small subset to the whole system? Or was it a publicity stunt (we can outperform everyone! our names are __________!)
[I can picture a world without war, without hate. I can picture us attacking that world, because they'd never expect it]
That's nothing, last time I benchmarked my Big Mac Cluster (100 Big Macs) it came to almost 57.6 megacalories. Those Apples will never be able to match that!
I read the internet for the articles.
Not terribly surprising. Much like estimated death tolls for disasters, never believe the first set of benchmarks for a computer. Wait until thorough testing can be done before you start believing the numbers.
;)
Y'all should know this by now.
~D
This sig has been enciphered with a one-time pad. It could say almost anything.
"best price performance" and "Apple" in their minds?
"Virginia Tech: Home of the Poor Man's Supercomputer and Michael Vick."
Apparently there are a lot of cases where a MULTIPLY and an ADD do come together like that, but I'm not surprised if LINPACK doesn't consist entirely of those pairs. ;)
The 17.6 TFLOP theoretical peak assumed a perfect case consisting entirely of MULTIPLY-ADD pairs. In a case assuming no MULTIPLY-ADD pairs, the theoretical peak is 8.8 TFLOPs.
7.4 TFLOPs is only 42% of 17.6 TFLOPs, but it's 84% of 8.8 TFLOPs. I suspect the actual "efficiency" of the machine lies somewhere in the middle.
(As for me, I'm happy with just ONE dualie...)
Thanks for asking!!
While these numbers will no doubt come as a disappointment for Mac zealots who wanted to blow away all the Intel machines, it should still be noted that this is the best price/performance ratio ever achieved on a supercomputer.
It still bests all other Intel hardware with only the Alpha hardware on top. And given the CPU count, even the Alpha hardware does not match it. Look at the numbers.....The Linux based 2.4Ghz cluster has almost 200 more CPU's on board with a 217 Gflop/sec difference. The Alpha clusters are running anywhere from 1,984 to 6,048 more CPU's.
Visit Jonesblog and say hello.
See http://www.netlib.org/benchmark/performance.pdf page 53.
Since yesterday's release at 7.41 Tflop, the G5 cluster has already increased almost a Tflop, and is now ahead of the current #3 MCR Linux cluster, and about 0.5 Tflop behind a new Itanium 2 cluster.
/Watched WarGames too many times as a kid.
First you have the iTunes store which doesn't do anything but give the average user basically anything he or she might have wanted to have in on online music store. Despite its being free, we're all cheesed off that it doesn't support OGG, or it's meant partly to push iPods (duh), or whatever.
Now this -- a supercomputer that has, to quote that again, the "best price/performance ratio ever achieved on a supercomputer." But dang it all, it doesn't completely blow away every established precedent -- it's just in the top five on the usual list of comparisons. One more crushing disappointment.
From Microsoft, we just want products that don't completely ream us. From Apple, we want the entire world to seem a little friendlier and cooler with every product release, every dot-incremenent OS update. They both disappoint us, but the expectations seem a little different...
"Fundamentalism" isn't about divine morality. It's about human authority.
So, yes, these numbers are preliminary, and yes, they WILL increase - they already are. See http://www.netlib.org/benchmark/performance.pdf (the official source of preliminary numbers), page 53.
The preliminary performance report at http://www.netlib.org/benchmark/performance.pdf contains the new entries for the upcoming list as well (see page 53).
of all of these so-called "benchmark" discussions. Everyone really knows, in their heart of hearts, that the only valid benchmark is to be found in real-world applications such as Quake III. I want to know how many fps this alleged "supercomputer" gets.
144l. ph34r my 133t l3g4l 5k1lz!
Anyone know how much merit there is to using Nmax (or N1/2) to compare different systems?
"There are a dozen opinions on a matter until you know the truth. Then there is only one." - CS Lewis (paraprhase)
Yes, but doesn't Moore's Law and the commodification of computer hardware suggest that each new generation supercomputer will have the best price/performance ratio?
Efficiency is strongly dependent on the interconnect. Does anyone know if the 128 node benchmark (that supposedly showed ~80% efficiency) was run with only one Infiniband switch -- i.e. all nodes connected through only one switch?
BTW, the performance never was stated to be 17 TF, so it did not drop to 7.4 (or whatever it ends up to be).
While I am amazed at the initial price vs preformance that this cluster of macs have obtained I am worried about the eventual cost all the electricity and cooling will be for the cluster. I remeber reading in some random article that the electricity used to cool and power the computer was extimated around 3,000 midrange homes. Just from a quick calculation of homes x $100 x 12 months we get the horrible figure of 3.6mil. So over a 10 year lifespan of the cluster it will cost 36mil more the the current price.
While it is still cheaper then the original cost of Intell or IBM super computers I personaly would rather spend more and waste alot less electricity, since if I remeber correctly the cost of engery for comparable super computers was in the range of 0.5 mil-1 mil. Although they are stationed in other countries so the cost of electricity could be dramaticly less in japan then in america but I doubt it. Someone should really get the kW per hour used by the top 5 super computers and then calculate the price per year based on that.
Never could figure out why my girl liked my bitch tits, then I found out she was a lesbian.
I installed a button on the front of my cluster
to manually clock the CPU's.
So far i've managed ONE whole flop.
My record is for the slowest supercomputer
on the planet.
siggy played guitar
Yes, the G5 should be capable of more than a little better performance than "a Xeon", but what I find interesting is that it is a Xeon which was initially released well over a year ago by Intel. What I am curious about is if someone could build an equally "cost-efficient" super computer based on more recent intel hardware. The differences in speed, cache, front side bus, etc. that Intel has made in the past year would no doubt lead to higher numbers. If I were comparing a Xeon Cluster to a G4 cluster, people would scream that it's apples and oranges - why does the same not hold true for intel CPUs?
Jack Dongarra says that a "supercomputer" is simply a computer that, for todays's standards, is REALLY fast. I saw one presentation from him, and he said he run the Linpack benchmark on his notebook (2.4 GHz Pentium 4) and it would get to the bottom of the Top500 list in 1992. So, this supercomputer definition is very fluid.
Because the Power4 is hotter and uses more current than the G5. To use 2200 Power 4 CPUs they would have to about triple the cooling capacity of the room. For all the heat and power, the Power4 lacks the AltiVec units that allow the G4/G5 to process vector operations so quickly.
The G5 is also significantly lower cost than the Power4
Article X: The powers not delegated... by the Constitution...are reserved...to the people
The degree of loss is interesting, and suggests that their algorithm for distributing work needs tightening up on the high-end. Nonetheless, none of these are bad figures. When this story first broke, you'll recall the quote from the top500 list maintainer who pointed out that very few machines had high performance ratings, when they got into the large numbers of nodes.
I'd say these are extremely credible results, well worth the project team congratulating themselves. If the team could open-source the distribution algorithms, it would be interesting to take a look. I'm sure plenty of Mosix and BProc fans would love to know how to ramp the scaling up.
(The problem of scaling is why jokes about making a Beowulf cluster of these would be just dumb. At the rate at which performance is lost, two Big Macs linked in a cluster would run slower than a single Big Mac. A large cluster would run slower than any of the nodes within it. Such is the Curse that Amdahl inflicted upon the superscaler world.)
The problem of producing superscalar architectures is non-trivial. It's also NP-complete, which means there isn't a single solution which will fit all situations, or even a way to trivially derive a solution for any given situation. You've got to make an educated guess, see what happens, and then make a better informed educated guess. Repeat until bored, funding is cut, the world ends, or you reach a result you like.
This is why it's so valuable to know how this team managed such a good performance in their first test. Knowing how to build high-performing clusters is extremely valuable. I think it not unreasonable to say that 99% of the money in supercomputing goes into researching how to squeeze a bit more speed out of reconfiguring. It's cheaper to do a bit of rewiring than to build a complete machine, so it's a lot more attractive.
On the flip-side, if superscaling ever becomes something mere mortals can actively make use of, understand, and refine, we can expect to see vastly superior - and cheaper - SMP technology, vastly more powerful PCs, and a continuation of the erosion of the differences between micros, minis, mainframes and supercomputers.
It will also make packing the car easier. (* This is actually a related NP-complete problem. If you can "solve" one, you can solve the other.)
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I think that magazine article must be wrong. If 1100 Macs use as much power as 3000 homes, then each mac is using about 3 houses worth of power. That seems excessive unless the home is in a 3rd world country or those 9 fans are really really running full blast. More likely, each G5 (with networking and cooling equipment) uses a few hundred watts. Even at 500 W/Mac, 1100 Macs, $0.15/kWH, 24 Hr/day, 365 day/year the cluster costs about $722,700/year. More likely, each Mac probably only consumes an average of 300 W max and is not running full tilt 24x7, so the cost is maybe around $300-$400k/year.
But your point is a good one. I often wonder about the environmental economics of people running SETI, Folding@Home, etc. on older machines. Most of those older "spare" CPU-cycles are quite costly in terms of electricity relative to newer faster machines that do an order of magnitude more computing with the same amount of electricity.
Two wrongs don't make a right, but three lefts do.
My feeling is that the ~40% efficiency seen on the larger scale run is an indication that either VA Tech spent very little time tuning the problem size or they didn't design their InfiniBand fabric to really handle 1100 nodes hammering away at Parallel Linpack. (Given that they've been extremely vague about how their IB network is structured, I fear it may be the latter.)
I doubt that's true, especially if they're using the IBM PPC compilers. The G4 has both significantly less memory bandwidth and a single double-precision-capable FPU, whereas the G5 is basically a single-core Power4 with an AltiVec unit in place of some cache. IBM's compilers (despite being a little wonky as far as naming and argument syntax) generally produce pretty fast code."My life's work has been to prompt others... and be forgotten." --Cyrano de Bergerac
(hides/ducks - I ain't an anonymous coward for nothing!)
Efficiency of a parallel computer considered to be
E=Ts/(n*Tp)
where Ts is the time to perform the computations serially, Tp is the the total time to perform the computations on the parallel machine and n is the number of parallel processing units.
It wouldn't take much to get a drastic improvement in efficiency simply by improving the time slightly for each parallel processer, especially for 1100 nodes.
I don't know how the benchmark program runs, but improving the communication time would imrove the efficiency as well.
It shouldn't take much to boost this by a few million flops.
The 21st version of this list does not
show the SETI@Home project. The top entry
is NEC at 35 terraflops. Today's SETI@Home
average for the last 24 hours is 61 terraflops.
It may be a virtual supercomputer, but it
is producing real results.
-- Stephen.
Grumble... Go take a look at Apple's description of the G5 architecture before spouting.. Here's the relevant lines:
- Each PowerPC G5 processor has its own dedicated 1GHz bidirectional interface to the system controller for a mind-boggling 16GB per second of total bandwidth -- more than twice the 6.4-GBps maximum bandwidth of Pentium 4-based systems using the latest PC architecture
- 800MHz HyperTransport interconnects for a maximum throughput of 3.2GB per second.
Apple uses the same basic memory set-up as the AMD Opteron.Thanks for the pointer. Now, about that "most cost effective" bit? Compared to what? At retail prices?
I have been sitting here by my 1100 node G5 cluster trying to copy a 17.6 MB file for the last 20 minutes. It is so freaking slow now that I only get 44% efficiency. On my 1.5 Ghz P3 I would be able to do this in under 20 seconds. .....
I'm about as far from a Mac basher as you can get, but you are completely off-base.
ECC can *detect* two-bit errors, and can *repair* single-bit errors. ECC memory is *not* the same as parity memory!
And ECC is not designed to catch "bad" memory, it is designed to handle bit errors that occur naturally with "good" memory.
All memory has a bit-error rate, which is incredibly low. However, given a system with gigabytes of RAM, you can expect a bit error every couple of days. Hopefully this error will be in an area that is non-criticle, but multiply this by a thousand or so processors, and there is a real risk.
And, since your message was so inflammatory, how about you do some f*cking research before you spout off next time...
Is the things you can find out by looking at the whole list.
Like...
The highest rated "classified" computer in the US is only at #44, a Cray with 1900 processors that clocks in at "only" 1166 GFlops. One can assume that it resides at NSA. Does anyone really believe that NSA would be using such a relatively "slow" supercomputer. Piffle. The faster ones are probably so classifed that no one without a very high security clearance even knows they were built.
Avon Products apparently has a supercomputer that can do 277 GFlops (#456 on the list). Just what on God's Green Earth does Avon need with a supercomputer that makes the Top 500? Studying flow patterns in cosmetics? Data mining the Avon Ladies? Kinda makes you wonder, doesn't it?
BMW apparently spends a whole lot of money on HP super computers, with 12 on the list (unless I missed any--#'s 225, 243, 244, 322, 323, 324, 331, 342, 417, 418, 429, and 485), with a combined processing power of 4188.6 GFlops, and that was all installed in the past three years. With all that power, they still couldn't figure out that an embedded Windows OS for their flagship car was a bad idea...maybe they need to kick the F1 team off the supercomputers for a while and let the production car guys in...
Err, Apple's G5 and the AMD Opteron don't have an even remotely related memory setup. The G5 looks a lot more like the AthlonXP and AthlonMP setups. The Opteron has an integrated 128-bit wide DDR memory controller, connects multiple CPUs directly through cache-coherent Hyptertransport links, and uses additional 32-bit, 1600MT/s HT links (3.2GB/s in each direction) to connect the CPU directly to the I/O chips.
The Powermac G5 uses up to 1GT/s, 64-bit wide version of IBM's Elastic I/O bus to connect each processor to a memory controller chip, which in turn has a pair of 64-bit wide DDR memory controllers. These buses are also shared for the processors I/O needs, which are passed over a 800MT/s, 16-bit wide hypertransport link to the PCI-X controller.
As for the width and speed of the Hypertransport links, Apple is very confusing on this front. In the document you linked they say "two bidirectional 16-bit, 800MHz HyperTransport interconnects for a maximum throughput of 3.2GB per second." In their PowerMac G5 Tech Specs PDF they say "two bidirectional 800MHz HyperTransport interconnects for a maximum throughput of 1.6 GBps." So which is it? And just what bandwidth are they measuring?
The PowerMac does indeed have two separate bi-directional Hypertransport links, the first connects the memory and processor controller chip to the PCI-X controller, and the second goes from the PCI-X controller to the extra I/O chips. It seems to me like the page you quoted is ADDING the bandwidth of the two daisy-chained hypertransport links, which would be TOTALLY incorrect.
My numbers came from the fact that a 16-bit (8-bits per direction) 800MT/s hypertransport link gets you only 800MB/s in each direction. Of course, it could really indeed be a "800MHz" hypertransport link, ie a 1600MT/s link since Hypertransport is a DDR protocol, but I highly doubt that since every other specification they mention just doubles the "MHz" number anytime they encounter a DDR bus (not that Apple is the only one to do this, Intel's "800MHz" bus runs at either 200MHz or 400MHz, depending on which clock you look at).
KASY0 achieved 187.3 GFLOPS on the 64-bit floating point version of HPL, the same benchmark used on "Big Mac". While "Big Mac" is about 40 times faster on that benchmark, it is about 130 times the cost of KASY0 (~$40K vs ~$5200K). Considering the size difference, "Big Mac" is VERY impressive, but it can't claim to be the best price/performance supercomputer on the HPL benchmark.
Note: KASY0 gets 482.6 GFLOPS (0.48 TFLOPS) on a 32-bit precision version of Linpack, satisfying our under $100 per GFLOPS claim.
Regardless, Virginia Tech's "Big Mac" is a very impressive machine. My congratulations to them!
Tim Mattox
You're forgetting the AC costs... If you've ever worked in a DC you know that the room itself can get mighty toasty, and toasty air leads to cooked systems.
Each processor, drive, and switch generates heat which is dissipated into the air. Untouched that heat accumulates and will kill the entire thing. With 1100 dual processor nodes running (and you can be they'll each be running at pretty close to full tilt) constantly that's a hell of a lot of heat that needs to be removed from the air.
The G5's memory controller is built into the U3 IC, which is essentially the "north bridge"- it is NOT built into the CPU.
It connects to the CPU via the "Apple Processor Interface" NOT via hypertransport. It connects to it's memory controller at 1/2 the CPU speed, unlike Opteron and Athlon 64 which connect to the memory controller at FULL CPU SPEED.
Documentation:
developer.apple.com
apple.com (thanks for the link)
From the U3 Northbridge, G5 uses hypertransport to connect to the other peripherials at 3.2GB/s.
Opteron supports a hypertransport rate of 6.4 GB/s directly from the CPU.
The Opteron 4xx and 8xx models also happen to have THREE of these hypertransport channels connected in a cross-bar configuration for SMP systems, giving EACH CPU a dedicated 6.4GB/s connection, rather than the G5 architecture which much share that connection (since there is only one U3 chip in a dually G5).
Support for PCI-X in the G5 by standard is a great thing. I wish more AMD systems contained it... I appreciate their native support of firewire and gigabit eithernet. But seriously... do you really want to argue architecture against a workstation class CPU? I'm a bit dissapointed by the Athlon 64, but the Athlon 64 FX (desktop version of Opteron) and Opteron lives up to most of my expectations and I expect to see more speeds out in the near future.
Stewey
There are 10 kinds of people in the world. Those who understand binary and those who don't.