Japan's Petaflop Supercomputer
slashthedot writes "Japan has built the fastest supercomputer in the world. While the BlueGene/L contains 130,000 processors, Japan has managed to create the first Petaflop supercomputer, called MDGrape-3, with just 4808 chips, and it cost just $9 million to develop."
Imagine a Beowulf cluster of these!
Making that computer must have been harder than getting a story from MSN posted on the main page of Slashdot!
Yeah but does it run Linux?
Religion for nerds. Stuff that really matters
It now costs 15 dollars per gigaflop. In the early 90s, a million dollars per gigaflop was normal.
Religion for nerds. Stuff that really matters
should be used in conjunction with the topic from the previous article. Creating coutless means by which, to not only find vulnerabilities in things like Javascript, but equally, construct fixes to those vulnerabilities. Once it creates an open door, it generates the fix for closing it and keeping it closed. Machines like this can think thousands of times faster than your average black-hat-crackah, so why not use them as a fight fire with fire tool?
Every one is so concerned with internet safety, on would think that at some point massive resources with be set forth in order to effectively deal with the flaw finding few out there making it difficult for the rest of to simply enjoy the benefits of the internet.
The article says that this machine is much more efficient than other supercomputers. Is it actually cheaper to run large programs like SETI@HOME on a supercomputer? Electricity isn't cheap.
Religion for nerds. Stuff that really matters
The original article seems to be unreachable, so I can't read it, but the precis has the wrong chip count: It does have 4808 LSI chips, but it also has 19,122 Xeon processors.
Will this run Vista at a decent speed, or should I wait for the Rev B and SP1?
If this petaflop supercomputer really only costs $9 million and only occupies the space of a large walk-in closet, why don't they mass-produce it and sell it. No, not to individuals but to corporations and governments. Folding@Home and Seti@Home could suddenly be like, sorry guys we don't need you anymore - we got something better. Having hundreds of copies of this super computer could quickly solve problems across the globe that much slower supercomputers are currently having trouble with!
... and in the DRM, bind them.
NOT what the VP of Marketing wants to hear:
"Not just a flop, but a flop a million billion times over."
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
the supercomputer is quite cheap. they can probably sell a lot of these machines and will sweep the top500 list. however, it mentioned that the processor is specialized in doing astrophysics calculation. i am not sure if this will be useful for other fields.
but the good think about it is that it is more energy efficient. it seems the trend in desktop/servers right now are also going to the supercomputers. maybe they could include a performance per watt ratio in the top500 list as well.
Live your life each day as if it was your last.
But, you still can't get 100 gigaflops for 1,500 dollars. :(
Japan has managed to create the first Petaflop supercomputer, called MDGrape-3, with just 4808 chips...
FLOP = floating operation [per second].
PETA = 10 ^ 15, or "a quadrillion".
(10 ^ 15) / 4808 = about 207,986,688,852, which would indicate that each chip is running at several hundred TERA-hertz [and, even then, the machine would have to possess an operating system so efficient that it could consistently perform one floating point operation per clock increment, which seems extraordinarily unlikely].
Or is this an "analog" computer and are these "analog" FLOPS?
And no, I did not RTFA.
Does that mean its a giant cluster of unwanted aibos?
Do not try to read the dupe, thats impossible. Instead, only try to realize the truth
What truth?
There is no dupe
Great. 9 million dollars to build the thing, 15 million dollars to build the infastructure to power and cool it, probably.
Read carefully, and you'll discover some A-level mathematics in this sentence. Wow.
You're wrong here.
It's not about being fast. It's about creative ways to do things that interfaces weren't intentended for.
Your idea would work out as soon as you have a way to replace artists with computers.
Nuff said.
Where are the really neato results we should be getting from these? I'm tired of "Country X builds massive TeraWatt computer system." I want to read about "Country X mapped the cancer genome" or some such.
Besides, these are relatively not impressive. Sure in the 50s, 60s, 70s, 80s we were maturing the technology. Inventing new technology, analyzing it, etc. Now it's more of the same. Huge budget, lots of space and infiniband connections...
Show me the MFlops/Watt rating of this? Are they improving it? Are we wasting less resources? The irony of this is they pollute by wasting tons of energy, all so we can predict global warming or whatever.
Tom
Someday, I'll have a real sig.
...its Geforce MX 420.
I have a GeForce MX440 so trust me, thats not funny.
ROFL at the "From the renders-a-million-tentacles-a-minute dept" ... nice choice!
"Show me the MFlops/Watt rating of this?"
No problemo!
The number of flops: (10 ^ 15) / 4808 = about 207,986,688,852 flops per chip, - from a previous poster.
The number of watts: 300,000 - from the manufacturers' site = 62 watts/chip
207,986,688,852 / 62 = 33,546,240 flops (33 MFlops) / watt.
As someone else already said, and mentioned in Parent's link, this is a very specific machine, for Molecular Dynamics simulations, everything from memory handling to processing is optimized only for handlig particles and doing force calculations on them. Therefore, it'll serve a relatively small market.
That said, I'm very curious to see how fast it'll run gromacs, the MD program I use. This is pretty optimized for parallel simulations already, and I'm able to do the calculations I need on a small opteron cluster in no time.
The biggest problem might be now to find useful research questions to simulate on it! Actually that is the main problem why computational medicine didn't really take over yet. The good thing is that this machine will give researchers time to think about this instead of spending their time thinking how to get enough computing power.
molmod.com - computing tips from a molecular modeling
"the machine may be ineligible because of its specialized hardware"
2 006/gb20060726_150659.htm?chan=topStories_ssi_5/
What specialized hardware? I would really like to read a more technical article about this machine. I would guess that the Japanese focused on vector processing like they did in the design of the Earth-Simulator.
The best supporting evidence I have for this conclusion is the comparison of Japan's last two supercomputers:
Sun Fire X64 Cluster
Earth-Simulator
Sun Fire has 10,368 processors with a Rmax(GFlops) of 38,180.
Earth-Simulator has 5,120 processors with a Rmax(GFlops) of 35,860.
That's 49% less processors with 94% the processor power*.
Here's the original article link:
http://www.businessweek.com/globalbiz/content/jul
*Only comparing one aspect of performance.
Your "Grumpy Old Man" impression is passable, but it's nowhere near as funny as Dana Carvey's was.
From the article: Meteorologists use supercomputers to predict climate patterns decades into the future by analyzing huge databases of statistics.
It all makes sense now. When they predict 90% chance of rain three days in a row and we don't see a drop, they relly meant that it will rain sometime between now and thirty or forty years from now.
Oh, please. This machine only uses 300kW - that's maybe the equivalent of 150 American homes. These folks are building a specialized (as in not "more of the same") machine to support a particular bit of science (molecular dynamics simulations) that isn't gonna make for flashy headlines, and I say more power to them. I'd rather there were more scientists out there doing basic research that may actually be useful, than have them chasing after stuff for headlines that will make you happy.
And if you're trolling, yeah, you got me, so congratulations.
[b.belong('us') for b in bases if b.owner() == 'you']
From the article it sounds like the whole thing is based on a large collection of specialised processors designed only for protien folding calculations, so while it may be able to do those at a petaflop rate it probably can't do anything else at nearly that rate (just as the WWII Colossus computer could beat a 486 at Enigma cracking it certainly wasn't faster terms of actual computing speed)
9 million, sign me up, where I can get one.
but I thought Japan already had a lot of studys on protein?
I've seen the videos of it a few times and stumbled across entire collections of them! they call it something like bukkake.
You've all been had by a reporter with an overactive imagination talking to a researcher selling his own shit. The MDGrape is a specialized processor (you can actually buy it commercially as a separate board for your computer) that does exactly one thing: particle simulation using traditional laws of physics. This will allow it to do computational molecular dynamics on the small scale or universe modeling on the large scale. All it understands is data input in the form of particle positions and will output the new positions in the next time step. Can you place two numbers in a register and ask it to add the results? No. Can it do any piece of the HPL benchmark required to get on the supercomputing list? No. It does one thing, but it does it well. This whole article is like comparing the rendering capabilities of your new Nvidia GPU and the latest AMD CPU, then concluding AMD is full of idiots who can't engineer because the Nvidia chip renders more polygons.
Comment removed based on user account deletion
Check out the company Mathstar (http://www.mathstar.com/). They just taped out a chip the other day that when it comes to market will do about 500 Gflops a chip. The technology is quite incredable and although it is not specifically a general purpose chip the chip can be programed to work in any way you like allowing you to get max preformance for the applications that you need to run. Honestly I would like to get a hold of about, say, 50 of these and see what I could make them do in parallel (as they are made to be hooked up in parallel also). From what I have heard these would be competitive with processors now a days in price and therefore likely less then $1,000 a piece. Making it $0.50 a Gflop!
The problem with that is that this computer is very specialised to molecular simulations. It can't very easily do other things, like seti or folding (okay, well, maybe that it can do). It was easy to design and cheap because it didn't have to be general purpose and adaptable, like BlueGene/L is.
I love deadlines. I like the whooshing sound they make as they fly by. - Douglas Adams
Cancer research sounds a little better than preventing-your-browser-from-misbehaving research. But at only 9 mil a piece, why not both?
In fact, you could put thousands of these machines together for less than 10 billion. For 10 billion dollars you could crack any reversible cryptographic algorithm in the universe on a weekend. I call that world domination.
Maybe Gates still has interesting things to do with his life after all.
(10^15)/4808 = 207 986 688 852, i.e. ~208 billion flops, i.e. if the chip executed only 1 instruction per clock, it would be 208GHz (not THz as you imply). Except of course the chip does more than 1 instruction per clock. Modern x86 chips do multiple flops per cycle. A Cell should be able to do at least 9 per cycle. I imagine that a dedicated vector processor, of the sort that NEC used to make, can do tens of flops per cycle.
Furthermore, many processor architectures have instructions to do several basic floating point instruction in one step. For instance, PowerPC has a one-cycle multiply-accumulate instruction (multiply and add in one step), so for marketing purposes, a PowerPC has twice the flops. Now, imagine if you have a vector processor that has a highly-optimized instruction for taking square roots or doing trig in one cycle. A square root operation will translate into dozens of basic flops (add, multiply, subtract). Such a processor might therefore be rated at 208 gigaflops even though its operating frequency is <1GHz.
If it costs $15/gigaflop, then they would have paid... $15 million
A $6 million subsidy (40%) isn't small change.
[Fuck Beta]
o0t!
No, a Petaflop is when an animal rights activist throws themselves in the path of a fishing trawler, cattle car or some other vehicle used in the meat or fur industry. It is similar to, but not quite the same as the terraflop which is more used in anti-logging activities.
"Waste not one watt!" - CZ
Of all the MD 20/20 varieties...grape stands out as the best.
Blar.
Something for the Java Swing Developers and Users out there :-)
I guess it would depend on the definition, whether it has to be capable of general purpose or only specialized. Technically, it should be possible to easily get petaflop performance by putting a few million into a computer using chips designed only to run LINPACK.
Personally, I don't think it should qualify. Otherwise the EFF's $250,000 Deep Crack, which could only crack DES (although faster than tens of thousands of regular computers at that time), would qualify too.
How many petaFLOPS will IBM get out of a new Blue Gene made from Cell processors?
--
make install -not war
......will it run linux??
It wasn't a troll. I honestly believe we don't step back and say "should we do this". Just because you CAN do something doesn't mean you SHOULD. Of the computers on top500.org how many of them have led to new discoveries or tested hypothesises [sp?]?
It not just computers though. Look at the number of people who subscribe to the notion that they need their own personal vehicle, bottled water, blah blah blah.
So a group built a overgrown home computer. Big deal. Let's wait and see what they accomplish with it.
Tom
Someday, I'll have a real sig.
Though the theoretical performance of this computer is higher than that of BlueGene and may have higher realworld performance too, you can't compare this supercomputer with BlueGene and other TOP500 supercomputers since it can't run LINPACK. It's just too specialized for its use.
...how many bogomips does it do?
If you can just take their n^3 algorithm (with quantum it's more like n^8), and make it n^2, you can do all that on your desktop :)
Not all progress needs to be brute force. But brute force is much more fun to brag about.
-
- Adam L. Beberg - The Cosm Project - http://www.mithral.com/
From TFA:
Experts believe that the nation with the most machines near the top of the ranking generally has the most competitive economy.
Oh come on - were these American experts by chance? How about flops/head? But lets think for a moment. Do raw flops count, or is it what you do with them? Once you have a big computer, it's easy to generate lots of numbers. The art of science, though, is to abstract your question, so you can make some useful predictions. Otherwise you might as well just measure the world that's out there, in all its complexity.
Measuring in yards/meters is easier than measuring in nanometers.
Predicting long term weather trends is easier than daily weather conditions in your area.
When fluid dynamics and computers are to a level to handle compressible fluids at the scale needed, the predictions will still be off to places that aren't the focus. Frequently the predictions for my city only come true to part of the city.
Democracy Now! - uncensored, anti-establishment news
Tech specs for MDGRAPE-3
...is it willing to come out of the closet?
The trouble with science is that the value of research is very hard to measure. However the more we know about the world the better we understand it. Just because supercomputer research hasn't produced a certain amount of value yet doesn't mean it never will. I'm all for "wasting" money on learning more, because the more humans learn the more likely we are to discover stuff that is useful.
What from the article it looks like they did special purpose asic that solved their problems, and those are controlled by standard cpu:s. Depending on algorithm you can get multiple orders of magnitude performance advantage for doing a special purpose chip instead of general purpose computing chip.
Lets do order of magnitude computations here, pair of general purpose cpu cores use about 100M transistors not counting cache. An adder takes 1000 transistors. So with cpu:s transistor budget you get 100000 adders running in parallel. In overall the performance difference would be 1000x for the asic design over general purpose solution. As for not counting cache is important since you probably want the ondie storage for the temporary values, and caches transistor density is far higher than logics. And thats not the best case not worst case scenario but more or less what to expect in general rule if you don't saturate the memory in which case you should add more or faster memory channels or change algorithm for less bandwith limited, still can make trade offs that no off the shelf CPU could reasonably make. In overall you still get atleast 10x performance increase over going for standard cpus. So expect 1000x to 10x on code that runs EXTREMELY optimally on general purpose chip. Of course you CAN construct a case where general purpose computer beats the special purpose one. But more than often that case cannot use lots of processors as once you can parallerize the special purpose wins.
The problem with special purpose is that you cannot do everything, you can do one thing and that thing VERY WELL.
You just change the control logic to a logic solving the problem.
©God
hmm...here's what I see with this... ...And then...there's the military...
With such great power and such few processors, this will cause other (but not all) computing technology to migrate in that direction.
I can see the average PC doing 15 Terra flops with in the next 5 years. This, if I am accurate, would put the home PC in the processing realm of the human brain. Is it possible that an AI which could pass the Turing test with near 100% of the subjects is not long behind? Humanoid robots and robotic transportation?
Should we put a "Three Laws Treaty" on the international table?
Does this mean that the animation of Anime will be better? If not, so what.
207,986,688,852 / 62 = 3,354,624,013 flops = 3199 MFlops/watt My AMD64 3000 + has around 42 MFlops/watt
What new technology was developed to produce this machine?
Or was it a case of having loads of money, room and a friendly merchant at Fry's?
That's my complaint. It was different with the first Crays. Nothing like it existed before. They had to invent new technology to accomplish it. This is more a case of networking via gige and optical then stacking box upon box.
Tom
Someday, I'll have a real sig.
Already the article suggests it may not be capable of running linpack, the other question being, are these 32-bit precision operations or 64-bit precision? Linpack explicitly measures 64-bit precsion. This is one reason why despite some clustered deployments that are inevitible with the cell processor, those won't be impressive top500 wise despite the cries of 'OMFG, cell has uber gigaflops'. Cell brags on the gigaflops, but the state of Cell as it is announced today is only interesting 32-bit precision wise. 64-bit precision won't blow away the conventional Power/PPC chips which are impressive Linpack wise.
XML is like violence. If it doesn't solve the problem, use more.
Exhibit A. 1 peta flops is 10 to the 15th power floating point operations per second.
Exhibit B. This computer has ~5000 chips.
This means each chip should be able to capable of 200 giga floating point ops per second.
I know of no technology which can allow any floating point unit to be clocked at 200 GHz.
Even if it were possible, the kind of power it will consume would make P4s look like mere tiny fuzzy little animals.
This means that each of these chips has to have multiple fpus running in parallel. For low power apps, generally going over 1GHz clock (at todays chip process technologies) is not viable. Assuming that to be the case, this would need 200 FPUs in each chip, amounting to the equivalent of 1 million nodes (just distributed over 5000 chips): why does this matter? The larger the nodes, the larger the complexity of splitting the application into so many threads of execution, and the larger the communication bottleneck. Yes, integrating 200 FPUs on a single chip would certainly ease the design of the communication system, but that also means that going off chip will in general have to carry withitself a large large large "communication penalty".
Also, in that case, I would consider the article deliberately misleading, as they make it a point to mention the lower number of chips being used in this design as evidence of it being better than the other super comps.
As to having so many FPUs on a chip, there are dozens of companies out there making massively parallel chips...1024 and 2048 fpus per chip has already been done...
theres more to this than meets the eye...
if anyone here has more info, care to share?
-ghoul2
Sigura Non Grata
so does that mean in the future we'll have the MD-Grape 20/20???
I compiled some quick facts which compare those three supercomputers and added pointers to other resources for your convenience:
http://www.bloglines.com/blog/ITnomad?id=126
Cheers, Alex.
You look like a million dollars. All green and wrinkled.
To triple previous speeds with so few processors some radical engineering took place; strangely enough, the bus tolopogy closely resembles that of a four-dimensional domo-kun.
It is theorized that a complex tolopogy resembling a four-dimensional Hello Kitty will run roughly twenty times as fast.
~
http://slashdot.org/comments.pl?sid=06/07/30/13823 4&threshold=1&commentsort=0&mode=nested&cid=158108 14
19,122 Xeons.
(1 * 10 ^ 15) / (2 * 10 ^ 4 ) = 5 * 10^10.
That's 50 billion floating-point operations per second. If each Xeon is dual-core, it's 25 billion ops per core per second. If they're 4GHz processors, then it's 6.1 ops/cycle. I'm not sure how it achieves that. Even multiply-add fused instructions only do 2 ops per cycle.
I still have to ask if this is achiveable.
http://lkml.org/lkml/2005/8/20/95
This computer, like all the previous (md)grape generations, is a central force potential calculation accelerator.
it does nothing but calculate 1/sqrt(dx^2+dy^2+dz^2)*variable, but really really often.
Grape 6, 5 years or so ago, was already running at 200Mhz, had a throughput of one force calculation per pipleline and 6 pipelines on once chip. So it counts as 1.2 billion force calculations, each being (1* inverse, 1 sqrt, 3 adds, 3 squares, 2 fmul, ect).
A lot of flops, but totally useless as general purpose computers.
HI O WISE PRINCE. WHT TOOK U SO DAM LONG?
From their 2004 press releases:
e WWW/04/0827/108259.htmle rcomputer+on+a+chip/2100-1008_3-5322558.htmlr .htm
http://www.taborcommunications.com/hpcwire/hpcwir
http://news.com.com/Japan+designers+shoot+for+sup
http://www.peta.co.jp/md2/faq_en.html
http://grape.astron.s.u-tokyo.ac.jp/grape/compute
http://www.primidi.com/2004/09/01.html
For the first time, I have become worried about an unbalanced singularity. If one country reaches the singularity first, the power they would gain might allow them to prevent a singularity in other countries. The US should invest in technology to speed and guide the development of singularity technology here at home. We can't afford to let the singularity happen somewhere else first.
-John Fenley
Very good explanation. You could even compare this to the Human brain, which only operates at about 50Hz (if I remember my AI class properly) but can have every single one of the trillions of Neurons doing its own little threshold calculation. Granted, it's difficult to compare Neural nets to non-linear circuit systems in a meaningful way, but it does demonstrate the ridiculous extreme of parallelisation.
I wonder if the Googleplex machines and its distributed systems have a throughput near this and if so, does it qualify for a supercomputer?
The article is badly written. It cost Riken $9m, because NEC (as SGI Japan) paid for most of the hardware, and because Hitachi and Intel provided all but three of the workers.
In short, Riken had almost nothing to do with the process, except for the design of the single custom chip involved, and even then, most of the work was done by outside firms who wanted the press. And even then, it still cost the host organization $9 million!
StoneCypher is Full of BS
FLOPS is not the plural of FLOP. FLOPS is FLoating point Operations Per Second. Man, it drives me nuts when clueless journalists think they can just call one petaflop. I know it sounds funny to say one petaflops, but that is exact what it is. Quit propagating erroneous acronyms - please.
What can I say to the morons who commented on this (really astonishing) design in the article.
Two words.
Sour grapes.
Maybe more....yes, it's a better design than yours....you can't HELP but learn from it, as it just made your idiotic computer science ideas so much rubbish.
No, but they should be worried when a 'technology magazine' sees the need to explain that 298 is a larger number than 250.. Yes, this might be shocking, but after you substract 298 from 500, you are only left with 202. And no, 202 is not larger than 298, even if you take the whole of it. So, yes, if you have 298 apples of a total of 500, noone will be able to have more than you. Next, we will have a closer look at the letter 'G'.
Hmm, and with a cluster of these, local news stations may now be able to accurately predict the weather six days in advance.
Matthew Brundage
Silver Spring, MD
The honest answer is "we don't know", and that we should continue on (for whatever that means) doing what we do...
Reason is the Path to God - Anon
No, this is nothing like a beowulf cluster. While the basic architectural outline is classical, using a general-purpose computer to feed instructions and manage I/O for a whopping big array processor, there are numerous small, critical innovations which contribute to the enormous flop count. BlueGene/L you might consider just a big flipping stack of workstations, but there is an order of
magnitude difference in flops between that kind of commodity system and MDGRAPE-3.
Gig-E is a pretty sad sort of MPP interconnect, BTW. Infiniband is a big step up, and HyperTransport 3 is another hop skip jump beyond that. When the VPUs are talking over a direct interconnect, magic can happen.
-I like my women like I like my tea: green-