Linux Supercomputer Wins Weather Bid
Greg Lindahl writes "The Forecast Systems Laboratory, a divison of NOAA, selected HPTi, a Linux cluster integrator, to provide a $15 million supercomputing system during the next 5 years. The computational core of this system is a cluster of Compaq Alphas running Linux, using Myrinet interconnect.
Check outwww.hpti.com for information on the company. "
I work at a company that is working on a very complex artificial intelligence architecture, and for a variety of reasons it is written in Java (since the other most popular AI languages use VMs or are directly interpreted, expect the AI community at large to want good interpreters on Linux).
We looked at putting together a Beowulf Linux cluster to run our software, which is very memory and processor intensive, but Linux could not do the job because JVMs on Linux are absolutely terrible. We wound up on WinNT (we couldn't afford Suns, but plan to upgrade when we can) because the JVMs were the best.
Because people making large software systems are fed up with reengineering for new hardware, expect other people to start choosing Java for large, intensive applications that were previously written in C, Fortran, C++, etc.
If Linux can't compete with other OSes for running large Java programs, these projects will not be able to consider Linux as their OS of choice (which we all WANTED to do here, we were very upset to go to NT).
Right now the fastest Java environment we've found is Java 2 with HotSpot, running on NT (we're testing Solaris now, as we might be able to afford Suns soon). Can the Linux community do any better, or even as well? So far, no.
o/~ we are pissed, we are pissed, we have to resist... o/~ - ec8or
Check out these pages to see what 15,000,000 USD could really do... It is quite interesting to see a el cheapo based on 96 PIII-500 wiping 48 21264-450... Both systems were running Linux...
I think the major problem is getting GCC and PPC/Linux to show up on their radar screen. For example, Motorola *could* be turning out specs for G3/G4-based motherboards and encourage Abit and Tyan to rollout consumer-level boards that bolt into ATX boxes and use SDRAM and EIDE drives. But they don't. Getting cheap PPC hardware and optimizations for GCC are pretty much hitting the same wall...Motorola is out-to-lunch.
Thats bull, there has been no improvements to any fields of mathematics since Pythagoras's postulations.
I believe they are planning to use the Legion system developed at the U. of Virginia. http://legion.virginia.edu
Perhaps in the future a cluster of G4's will be used. The gcc compiler should/may be generating more efficient in the future as improvements are being made. IIRC, apple is using gcc in the development of the forthcoming MacOS-X.
Nonetheless, it is nice to see the federal government go this route.
Something is up with that link you gave. I know the k7 is superior then the p3 but if you compare the k7 vs the alpha you will find the k7 is twice as fast. hmmmm
I am also looking for speed in powerpc g3 for a new powerpc linux box and the standard p3 was alomst twoce as fast. I am a former mac guy and I use to regard anything from apple in benchmarking as fact but either apple is really lieing (probably are) or this test is biased. I think you should found out what this test was trying to prove. Something is really screwed up.
"Never stick an electrical appliance down your pants." -Tim Allen
The major controlling factor is the model. For fluid dynamics, approximations are made to make the problem solvable. Stuff like, aOf course, the input parameters/data can play a major role. If the problem is chaotic, one has to run a whole bunch of scenarios to obtain a statistical model.
My only dispute with what you said is that if the model is wrong, the results may be wrong. Running three models with limitations that yield the same result may not give you the right answer. Additionally, chaotic effects can lead to bad results.
And to the idiot who commented about no advances in math, I would like to say that while the math (e.g., 1+1=2) may remain the same, the physical model may be different.
FSL's proposal stated that Computational Performance was the primary evaluation measure.
Scientific, vector processor tuned codes are known to run fastest on the Alpha 21264 + Tsunami memory chipset, so it is the only choice for a no-compromise, fastest computer in the world solution.
Take a look at the benchmark numbers (albeit limited) on http://www.hpti.com/clusterweb/ for some initial results.
Now, on the choice of Myrinet... This is a more interesting question.
Any takers?
No_Target
Your point is well taken in that there is a need for I/O balance in all supercomputing systems due to the need to save the results, particularly in those calculations that involve dynamic phenomena, like weather. The faster the computer, the faster results come out of it.
An enabler for cluster effectiveness is the Fibre Channel Storage Area Network, a technology that allows multiple hosts to read _and_ write to the same file at the same time at very high bandwidth.
In fact, the I/O bandwidth of a cluster in this context is still limited by the speed of the PCI busses on one node if you are serializing the I/O to that one node. If this is the case, the XP1000 will sustain about 250+ MB/s with three-four Fibre Channel Host Bus Adapters on its two independent PCI busses. If your software can distribute the I/O to multiple nodes, like FSL's parallel weather forecasting API can (SMS), then your I/O bandwidth is essentially limited by your budget for RAID systems, Fibre Channel Switches and HBAs.
No_Target
If you are using Java for performance reasons, I would suspect that your intelligence is somewhat
artificial.
Reimplement it in C and and save yourself a cluster.
I have seen these waste of time AI projects before. Lots of "good ideas" implemented in a
stupid way.
After reading through all these comments, I have come to the conclusion that rather than posting
clueless messages on slashdot, some reading may be inorder. Take a look at the Linux Parallel Processing/Beowulf Howto's, there is also a Beowulf FAQ, A "How to Build a Beowulf" book, and much more.
One thing about the computer business is that it
is full of people who "do not know they do not know". RTFM
What does that have to do with it being a 64 bit processor? MMX does some 64 bit arithmetic, and I truly dont think it makes x86 machines 64 bit.
The BIOS is impressive?
Linux does not use the BIOS for much more than booting the system and collecting some configuration information...
A few other thoughts:
- Not only are the Alpha 264s unmatched in terms of both floating point performance and memory bandwidth (although the next-generation PPC is very good in that regard also), they are also among the best at dealing with the data-dependencies and access-latencies which occur in real scientific codes.
- DEC^H^H^HCompaq probably has the best compiler technology of anybody out there commercially (IBM are also very good technically, but as Toon Moene of the Netherlands Met Office put it, "XLF was the first compiler I ever encountered that made you write a short novel on the command line in order to get decent performance."
- Note for AC # 68 State-of-the-art weather models are not spectral models. Spectral models are appropriate only for very coarse scales at which cloud effects are only crudely parameterized (and to some extent are only appropriate on vector-style machines (and not current microprocessor/parallel) because of the way they generate humongous vector-lengths). At the WRF scales, the flow is not weakly compressible! Note that the global data motion implied by the FFTs in hybrid spectral/explicit models is a way to absolutely kill scalability for massively parallel systems. Finally, spectral models do not support air quality forecasting, such as we are doing (see http://envpro.ncsc.org/projects/NAQP/).
- Weather modeling is a problem which has exponentially-growing divergence of solutions (two "nearby" initial conditions lead to different solutions that diverge exponentially in time), so as coyote-san suggests, there is a tendency to run multiple "ensemble" forecasts, each of which is itself a computationally-intense problem. So far, I haven't managed to get the funding to develop a stochastic alternative (which will be a fairly massive undertaking -- any volunteers?) This means weather modeling can soak up all avaailable CPU power for the (foreseeable)^2 future. At least the individual runs in ensemble forecasts are embarassingly-parallel.
An aside to LHOOQtius ov Borg: have you tried the GNU java compiler (now a part of the gcc system -- for the intensive apps, generating native machine code is much faster.Hi, Greg! Didn't know you were here!
"My opinions are my own, and I've got *lots* of them!"
HPTi press release says it's an Alpha Linux supercomputer.
EV5 = 21164 and at 450, it's going to be the old style ones. Try again. That chip came out around the same time as the PPRO 200
gcc is gcc version 2.95.1 19990816 (release). Compile time options: -O9 -mcpu=ev56
ccc is Compaq C T6.2-001 on Linux 2.2.13pre6 alpha. Compile time options: -fast -noifo -arch ev56
The benchmark consisted of running two scripts through the CGI version of PHP4. We compare user times as measured by time(1). The tests were run three times, the shown results are mean values. The scripts are available from the Zend homepage. PHP was configured with --disable-debug.
The test shows that the code ccc produced was about 10% faster than gcc's. Other conclusions are left as an exercise to the reader.
Too bad Windows 2000 can't handle bad weather, otherwise it would of been the logical choice. ;)
Beowulf clusters are a concept, not a physical item...
I'm crying foul on the moderations I've been given on this story. It's true that the government finds ways to mess things up, e.g. crypto laws, software patents, etc.
M2 has seemed to make moderations a bit more accurate, but I don't see it working out for me here. Unless somebody actually goes to the page and sees what I'm talking about -- "Alpha" in ten hours, and the EV series are cranking out units faster than LensCrafters...
I didn't make up those "CPU's". They are actually listed on the page! Please follow the link and see for yourself.
--
--
E2 IN2 IE?
Linux-clusters are also getting to business solutions!
Check out: http://linuxtoday.com/stories/10157.html
.signature not found
last I heard the fastest available JVM on Intel platforms is the one made by IBM for... get ready... OS/2! Yes, I beleive it is from 15 to 25 percent faster than any JVM on NT.
Hm; I need to write a perl script to generate this type of junk.
Did I mention another of my graduate classes was chaotic dynamics? :-)
The very definition of "chaos" is high sensitivity to changes in the initial conditions. If a weather front appears in the same place (within the resolution of the data grid) on all 120-hour forecasts despite a reasonable variation in the initial conditions, you can be pretty sure it isn't in a chaotic realm and your forecasts will be fairly accurate.
On the other hand, if a modest amount of variation in the initial conditions result in wildly different predictions, the system is obviously in a chaotic realm and you can't make decent predictions.
As odd as it sounds, for something as large as a planetary atmosphere it's quite reasonable for parts of the system to be chaotic while other parts are boringly predictable. That's why they were starting to compare the predictions from different models, the same models with slightly different initial conditions, etc. That might give the appropriate officials enough information to decide to evacuate a coastline (at $1M/mile), or to hold off another 6 hours since the computers predict the storm will turn away.
P.S., the models do make mistakes, but fewer than you might expect. It's been years since I've thought about it, but as I recall most models work in "isentrophic" coordinates and are mapped to the coordinates that humans care about at the last step. The biggest problem has been the resolution of the grids; when I left I think the RUC model was just dropping to 60km; by now it's probably 40 or 30km. To get good mesoscale forecasts (which cover extended metro areas, and should be able to predict localized flooding) you probably need a grid with 5 or 10 km resolution.
For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
> I don't know about now, but five years ago, a state-of-the-art code for weather forcasting used spectral approximations (Fourier or Chebychev expansion functions) in the X- and Y-directions (Latitude and Longitude, say) and some high-order ...
:)
Dude... I think you just compressed an entire episode of Star Trek into six sentences.
I've finally had it: until slashdot gets article moderation, I am not coming back.
I assume this is a pun and goof, not proof that you haven't heard of the company Digital, which invented the Alpha and was bought by Compaq...
Usually, when one is investing in the kind of high end networking hardware necessary to make a clustered supercomputer, one uses FTP or NFS instead of floppies... Only an idiot would compile a program individually on each of 100 nodes of a cluster anyway.
www.unix.digital.com/linux/software.htm it's free now...
Digital? Have you seen their power supply? I don't see any digital in that, they should give credit where it is due.
As far as their product, you sure are right when you say that they are in the alpha stage. How many electrons can they pump thru their chip in 1 second? Heck, our light house sized vacuum tubes could suck in electrons like its a black hole. Don't forget, in the pure analog world, thruput=power=speed, with each discrete electron acting as a piece of information. This is what real VLIW is suppose to be.
Absolutely. This is NOT a Beowulf cluster.
Beowulf refers to the tools created at NASA Goddard CESDIS
This cluster uses MPI and tools developed by the University of Virginia's Legion Project
Beowulf has become, to some, a generic term for a Linux cluster, like Kleenex to tissues.
Mark Vernon HPTi
FSL runs their RUC model globally with a 40km resolution today. They expect to run RUC globally with a 10km resolution on the new system. However, there is a lot of weather that wants even finer resolutions.
The fact that NOAA doesn't mention Linux in the press release means that NOAA doesn't care what the box is, if it meets the peformance requirements.
If SGI or IBM (the two other leading competitors) had won, the press release wouldn't have mentioned Irix or AIX either.
HPTi could deliver 10,000 trained monkey's in a box if it met the performance requirements.
The fact that a Linux solution could exceed the performance of an SGI or IBM supercomputing solution is important to the Linux community, but not directly to NOAA.
Mark Vernon
HPTi
To say that I'm favorably impressed by the performance of the Compaq ccc compiler would be a major understatement. IMHO, with the release of this compiler, they have just overcome the Intel price/performance issue.
I've seen 280% speedups over gcc's best effort, more than justifying the 100% price premium of the hardware over (for instance) dual PIII boxen.
If I was going to put in a number crunching cluster (and I may) AlphaLinux would be the best way for me to go, cutting 40% from my TCO over IntelLinux.
Thanks Compaq!
>You build your NUMA box that has 1 fat highway, and it turns up like the subway systems in the metropolitan areas. The whole
purpose of hypercube or 5-D torus is to have a shortest path to as many places as possible, instead of hopping onto that megapipe
and making a stop at every node to see who wants to get off.
Technicaly you are correct. What I wanted to illistrate though is that in big NUMA boxes, you have one copy of the kernel running all processors. With a Beowulf system, and a Cray T3E I believe, you have a local copy of the kernel on each node of one or two processors. This negates the SMP problems of Linux on multi-CPU machines.
its beta. if they were giving it away for free
there would be no reason not to just make thier
own back end to gcc for the alpha.
i still dont know why compaqs doing this...
Granted, it's not a 2 to (1 + 1) performance ratio in the truest sense but the concept is valid if not the accuracy of my description.
On top of that, the previous post said nothing about running on 32bit. Alpha and several other currently available systems are running 64bit today (and for the past several years). True, x86 is not 64bit. IA-64 is not really an x86 processor but the next generation from Intel. IA-64 will bring Intel more in line with what other chip manufacturers have been doing for extreme high end systems for years and will bring it to prominence on the desktop.
D. Keith Higgs
CWRU. Kelvin Smith Library
My office has been taken over by iPod people.
You would need each node to boot an OS image and I would prefer the optomized one. either way you need a diskette in each node to boot or use special ethernet cards that boot from a central server. This would be bad because it would hurt performance on the bottlneck of the super computer which would be the speed of the ehternet. The floppies would also need to contain the special messaging software. Again the ethernet would clog everything if its from a central server. Besides you only boot once and after its booted the diskette is no longer used. The other method is to install beowolf on each harddriver. This would take too long to install.
"Never stick an electrical appliance down your pants." -Tim Allen
I made a few spelling errors. I also meant hard drive on the second to last sentence. Sorry
"Never stick an electrical appliance down your pants." -Tim Allen
Ok everyone, here's your chance to talk about that bitchin' Beowulf cluster...
I am curious as to the what the determining factors were for selecting Alphas over Pentium-based systems.
I've installed Linux once on an Alpha box and the BIOS is truely impressive, much better than PCs. But what are some of the other reasons? Wider data/cpu buses? Larger memory configurations?
Anyone who actually uses Linux on Alphas is encouraged to reply.
Ugh, mo mentioning of Linux in the press release. Fear of the penguin?
That may kick ass, but imagine a Beowulf cluster made out of... oh wait, it already IS a cluster. :) I guess they need to get to work on that internet tunnelling massive computing surface initiative if they want to make this computer part of a Beowulf cluster...
Good to see that using Linux as a tool, a company can provide a commercial grade super computer at what appears to be a very attractive cost/performance ratio.
Along with the use of Linux in digital VCRs and other Internet appliances this goes a long way to validating Linux as a viable, and very flexible commercial platform.
-josh
Either way, you could make Toy Story in about 10 minutes on this thing once it's up.
$15 million will buy a lot of beowulf. Anyone see how many nodes? I didn't find the number listed.
Kythe
(Remove "x"'s from
Kythe
I've installed Linux once on an Alpha box and the BIOS is truely impressive, much better than PCs. But what are some of the other reasons? Wider data/cpu buses? Larger memory configurations?
The big thing about the Alpha for people like NOAA (who run big custom number-crunching apps written in FORTRAN) is its stellar FP performance. A 500MHz 21264 Alpha peaks at 1 GFLOPS and can sustain 25-40% of that, because of the memory bandwidth available. A Pentium III Xeon at the same clock rate peaks at 500MFLOPS and can sustain 20-30% of that.
That doesn't fly for everybody, though. Where I work, we have a huge hodgepodge of message-passed, shared-memory, and vector scientific codes, plus needs for some canned applications that aren't available on the Alpha. We picked quad Xeons for our cluster and bought the Portland Group's compiler suite to try to get some extra performance out of the Intel chips.
"My life's work has been to prompt others... and be forgotten." --Cyrano de Bergerac
Although HPTi may believe in Linux as a clustering solution, it would appear that they have trusted their web page to IIS 4.0. It also seems that their web authoring tool is MS based, judging from the occurence of "?" where normal punctuation would be found.
This is good news, but it only affirms the role of Linux in niche markets. It will be some time before it is accepted widely as a general purpose business or desktop solution.
If the G4 can sustain >1gflops, then why not build a cluster of G4s running LinuxPPC? Jeremy Fincher
General Processor Info.
Compare the SPECfp scores of high-end Intel and Alpha offerings. Take a look at a 600MHz PIII Xeon and a 667MHz Alpha 21264.
The reason to choose Alpha should be obvious.
Total and utter BS. If you dont need an address range larger than 2 gig's in your application 64 bit can only slow it down.
i hear all of the great tales of lore about boewulf cluster and their amazing speed yet i am forced to ask if it will perform as advertised. as i understand it, (and i may be way off here, so please correct me) beowulf clusters do not completely overcome the problems that linux has with multiple processors. of course this is something hoped to be fixed in later kernel releases, but does the noaa really have the time to bring down a system such as this for kernel recompiles? a very fast machine? yes. but will it ever live up to it's full potential? i hope it does, but i still have to wonder.
I worked at FSL for several years, although on a different project. I knew people working on the weather models, and I took a class on parallel processing from the CU professor who shared the old Paragon supercomputer with NOAA. I even had an account on the Paragon briefly (for that class) after leaving NOAA.
NOAA needs to solve partial differential equations (PDEs). A *lot* of PDEs. My class spent a lot of time on solving numerical methods, and my entire undergraduate class in the early 80's was covered in the first lecture of my graduate class a few years ago. My Palm Pilot, running multigrid analysis, could beat the pants off a Cray-XMP running the best known algorithm from 15 years ago.
AI programs may not scale well, but the type of work done at NOAA *does*. Furthermore the hot topic a few years ago was applying some ideas from chaos theory to weather forecasts - take a dozen systems, insert just a little bit of noise into the initial data (essentially, instrument noise in your observations), then let them all run. If all models show the same weather phenonema, you can be pretty sure that it will occur. If the models show wildly different results (e.g., Hurricane Floyd slams into Key West in one run, but NYC in the other) you know that you can't make any firm predictions. As an educated layman's guess, I expect that the reason the hurricane forecasts are so much better than just a few years ago is precisely this type of variational analysis.
For every complex problem there is an answer that is clear, simple, and wrong. -- H L Mencken
If the G4 can sustain >1gflops, then why not build a cluster of G4s running LinuxPPC?
I'm not convinced the G4 can sustain 1 GFLOP/s in any kind of real calculation -- it simply doesn't have enough memory bandwidth. The G4 uses the standard PC100 memory bus, AFAIK. That's 64 bits wide running at 100MHz = 800MB/s peak. So without help from the caches, the absolute best you can do is on *any* PC100 based system is 200 MFLOP/s using 32-bit FP or 100 MFLOP/s using 64-bit FP. In practice you can only sustain about 300-350 MB/s out of the PC100 memory bus, so things get even worse. The caches will help quite a bit (maybe a factor of 2-4), but I have trouble imagining the G4 being able to sustain over 500 MFLOP/s even on something small like Linpack 100x100 because of the limited bandwidth and latency of the PC100 bus. Other processors that have similar peak FP ratings have much higher memory bandwidths; we've benchmarked an Alpha 21264 (1 GFLOP/s peak, ~400 MFLOP/s sustained) at about 1 GB/s memory bandwidth (that's measured, not peak), and a Cray T90 CPU (1.8 GLOP/s peak, ~700 MFLOP/s sustained) at 11-13 GB/s (again, measured not peak).
There's also the question of compilers. You have to have a compiler that recognizes vectorizable loops and generates the appropriate machine code to use the vector unit. Unless Motorola's feeling *really* magnanimous, I don't see that kind of technology making it into gcc (and g77, more importantly for scientific codes) any time soon. Otherwise, you're at the mercy of a commercial Fortran compiler vendor like Portland Group or Absoft. PGI hasn't shown any interest in PowerPC to this point, and Absoft currently does PPC compilers only for MacOS 8, not OSX or LinuxPPC.
I'd love to be proven wrong on this, but based on my experience I don't see how you could do it.
"My life's work has been to prompt others... and be forgotten." --Cyrano de Bergerac
Doesn't this make the problem embarrassingly parallel? If I have to run 30 times on slightly perturbed input data, can't I get a near-30x speedup this way, without a lot of painful parallel PDE programming?
Just imagine: weather@home (yeah I know it's probably not *that* embarrassingly parallel).
I would suspect that there's a number of reasons why NOAA went with the solution they did, and not just merely because it's a fast set of machines running a fast operating system.
Every six hours the National Weather Service sends out to all of it's forecast offices around the country a series of models to help in local forecasting. Each model is based on a massive amount of information that comes in to their central office, and that information is used in preparing the next set of forecasts. Now, you would want a) a system that is capable of processing all of this information rapidly and reliably, with b) redundancy built in so that if a part of the system goes down, you're still able to digest and transmit those models. Using a cluster of systems gives you that backup redundancy, and using a stable operating system gives you that speed and reliability to churn out models reliably.
The people at NOAA likely could care less about advocacy in this respect. What they want is a system that they can use, provide them the reliability and performance that is demanded, for a reasonable cost. $15 million for a distributed cluster that gives them a lot more bang for the buck is definitely money well spent. And remember, this IS your tax dollars at work, one of the few times you will ever see it spent for a truly worthwhile cause.
-Tal Greywolf
A friend of mine tried linux alpha and the performance was quite bad. After a posting at a newsgroup she found out that the gcc compilier is not optimized for alpha. She had to buy an expensive c/c++ one for the alpha box and then after a recompile the performance was great. I wonder how hard it was to get the cluster going wiht the compilier issue. I would hate to make 80 diskettes for all the machines because of licensing issues with the compilier. I heard alpha linux lacked some features of the standard intel one. Is this true or was it refering to the unoptimized compilier that comes with alpha redhat linux?
"Never stick an electrical appliance down your pants." -Tim Allen
Give me a buzz when Java gets Design By Contract or even the C language's simple "assert()".
come on now! we all know that plot is irrelevant these days!!! its all about explosions and breasts.
-- your knees hurt, don't they?
"My opinions are my own, and I've got *lots* of them!"
what are the areas that typically require heavy-duty processing power? all i know of is weather modeling and graphics rendering...
-- your knees hurt, don't they?
Don't know about the latest Alpha based systems, but by the terms of some supercomputer apps, such as matrix algebra stuff (FE, CFD, etc.) the bigger DEC servers were nothing to right home about around 18-24 months ago when I was doing a lot of benchmarking.
The peak total memory bandwidth available then was 2.4Gb/sec in the AlphaServer 8400, and it really had an impact on big calcs - can't speak for SPECfp, but for a big matrix algebra calc you need (asymptotically approaching) 4 bytes/sec per "flop", and these systems just didn't cut it.
I won't even speak about 32-bit Intel boxes - the 100MHz cache bus sucks enormous rocks, and the 4Gb memory limit (3Gb with NT, less with Linux IIRC) cuts it out of the big job league anyway. This is maybe OK if it's a node in a large MPP system, but these days you want to be able to bring 64Gb or more of RAM to bear on a single problem.
The question we used to hear from our engineering staff was along the lines of: "Hey, my desktop PC is n-zillion MHz, and it runs this tiny test calc almost as fast as the big machine, why don't we just get a lot of big twin Xeon PC's with XYZ graphics cards?". Or occaisonally, the same thing in favour of SGI workstations - engineers love toys just like the rest of us.
This is the classic misconception caused by benchmarks in the FE industry; a lot of test calcs will fit in the cache on a Xeon PC or an R10k or UltraSparc workstation, and show pretty acceptable performance, but the dropoff when you move to a larger problem size and start hitting RAM is sudden and dramatic.
By comparison, if you look at real supercomputers, like the high end Crays or NEC SX series, memory bandwidths of 2 to 4 Gb/sec *per processor* are the norm.
The machine we ended up buying to replace a low-end vector Cray was - an HP V-Class.
The PA-RISC has excellent scoreboarding and memory bus, and the Convex architecture keeps it well fed. We tested on the Convex S-Class hardware running at 180MHz with SPP-UX, and HP guaranteed that the delivered system running HP-UX would meet the clock over clock speedup ratio, which it did with room to spare. We saw well over 700 MFlops *sustained* per CPU on a 200MHz PA-8200 using rather nondescript FORTRAN, against a theoretical peak of 800.
The picture with the newer PA-8500 machines is not so rosy, as the memory bandwidth does not seem to have been scaled up with the capabilties of the new CPUs, especially with double the number of CPUs per board. Nevertheless, as the previous posters' figures would indicate, I believe the sustained throughput still exceeds that of the latest Alpha based systems for certain types of job, and the price/performance is very good.
Of course, for the rabidly religious, Linux is still not well supported on PA-RISC, and doesn't handle the high end hardware.
All your points are valid and I'll briefly explain the nitty gritty:
1) global circulation models are actually done by people in the US, downscaling via nested regional models are limited to this part of the world and if and when the system becomes operationalised, is expected to be distributed. Think cooperating groups around the world sharing the CPU burden
2) the 100m models are interfaced to streamflow and catchment models which are only a comparatively small region set within with the wider desert (rather uninteresting). Think sparse multi-resolutional hierarchy.
3) futher submodels are inherently linear in space/time, while the climate fields are calculated once, the bulk of the operational landscape runs the scientists are interested in are multiple ensembles which require lots of memory, hence some rather painful use of staging and compression. Think conversion to streaming media rather than static files.
If you're interested in more details, send me your email and I'll point you to some of my papers.
Regards,
LL
Wow, that is one fast machine. All for just weather! Sure, it's not the fastest machine out there, but 4 TFlops for finding out if it's going to rain on Saturday? heh. just joking. The mathematical models used in weather forcasting, and understanding the complexities of even a single supercell (which produces thunderstorms and tornadic activity) is mind-boggling.
In Soviet Russia...michael would be rotting in Siberia!
Greg - and the team at Digital^H^H^H^H^H^HCompaq responsible for the compiler involved: Congratulations ! This is a first, but it won't be the last ... Toon Moene.
AC said: "Either way, you could make Toy Story in about 10 minutes on this thing once it's up."
Yes, but what kind of plot? Would it be Woody and Mr. Potatohead lost in a hurricane with a large number of penguins?
Raw power is cool, but art takes a bit more than that.
Will in Seattle
Remember those vast performance diffs between the 80386SX-16 and the 80386DX-16? That's what we got here.
7lt;Note-to-Microsoft> Nanny-nanny-nah-nah, our OS runs on IA-64 and yours won't.7lt;/Note-to-Microsoft>
D. Keith Higgs
CWRU. Kelvin Smith Library
My office has been taken over by iPod people.
For $15,000,00 to buy an Alpha Beowolf, it sounds like they might have 2,500 nodes with a 'decent' Alpha system. But if they go really high end, they'll have about 750 nodes (For the 'killer' $20,000 Alpha machines).
That doesn't include the cost of the Myrinet cards and switches, racks, 3rd party software, support people, power, cooling, etc. Believe me, if you're paying $15M for a machine, part of it better be going for support personnel and infrastructure. The configuration's probably more like 250-500 nodes with a corresponding number of Myrinet cards and switch ports, 30-75 racks (8 nodes/rack if you're lucky), a *buttload* of power and air conditioning, and 2-5 onsite support people working in it full time.
"My life's work has been to prompt others... and be forgotten." --Cyrano de Bergerac
Just imagine a recursive $15M cluster of multiple Beowulf clusters... how's that?
It wouldn't surprise me one bit -- U.S. government agencies seem to find ways of being excessive, duplicitous, overly redundant, and do things in an excessively superfluous manner.
Maybe Rob can use some of the quick IPO cash from Bendover and put it into this site -- or maybe they've already gotten advances, and that's why Slashdot's been up and working for a change this afternoon?
I'm no troll -- in fact, I pretty much stay away from bridges altogether...
--
--
E2 IN2 IE?
Hold on folks, this isn't necessarily a beowulf. I could not find the word "Beowulf" on the HPTi page. (Maybe I didn't look hard enough though).
Not every Linux cluster is a Beowulf. The fastest alpha Linux cluster in existence is not a Beowulf.
Anyone know what they plan to use?
All that is being said is that Linux is being used for one type of supercomputing task: weather forcasting models. Some people are joyful about that. But that does not infer they think Linux is a solution for everything. It is reasonable to infer though, that since you are making an incorrect logical inference, that your logic may be flawed in other areas of reasoning. I can't say for sure wether your JAVA/NT solution is the best solution for your application. But since we have already established that you are a person of flawed logic, I wouldn't place alot of confidence in your decision to use NT.
visit the SETI@home CPU type statistics page. -- Alpha EV6 and EV67's are rockin' ass^H^H^H, if not as much as the "Intel Puntium" or "PowderPC" chips...
--
--
E2 IN2 IE?
Buying the hardware is only 15-30% of the total cost. Also, in a production environment, you should not be fixated by the CPU. The question should be, within the capital budget, what is the best combination of resources that maximises the effectiveness of achieving your mission.
:-).
To give you some real-world experience, a group I'm working with is looking at continential-scale simulation at a 5km resolution with the aim of going down to 100m. Now despite what most people think, the bottleneck (in this example) is in fact the I/O, with estimated total requirements of 30 TBytes. Doing the sums show that to keep up with the CPU (say hypothetically 1 run/24 hours), you would need average throughput of 350 MByte/sec. Hardware that supports both this volume and capacity is NOT cheap. We would joke that we paid x million for the I/O and SGI would throw in the Cray for free
Now as for how an Alpha cluster could be used, it would fit very nicely into the dedicated batch box category. It has a very high CPU rate and some decent compiler optimisation. As such it would augment whatever existing environment exists, reducing the workload of the more expensive machines for development which generally have better tools (just you try debugging a multi-gigabyte core dump). The biggest problem nowadays is not the algorithms, but managing the data traffic to the CPUs and this is where Linux clusters are weak with relatively slow interconnects, unbalanced memory hierarchies, and cheaper but higher latency memory. You have to accept the disadvantages and shift jobs which are not suited for this architecture off. A bit of smarts goes a long way in stretching the budget.
LL
Anybody else noticed that Linux is not mentioned anywhere in the NOAA press release while it's promimently displayed in the integrator's ?
Is the NOAA afraid to say that they are basing a 15 million dollars investment on free software rather than on something from Microsoft/Sun/IBM/whatever ?
-- the cake is a lie
I think another (better?) answer is that gcc/egcs doesn't have much in the way of DSP type stuff, where you do parallel computations. Alphas get performance inherently, as its FPUs are very good, and it does not have to d!ck with SIMD instructions - something that many compilers don't do well anyways - usually you have to call hand coded assembly to get good performance out of SIMD (= single instruction multiple data, where one instruction is executed on multiple sets of data - like MMX, KNI (SSE), AltiVec, etc)
And the raw bandwidth of even the unreleased G4s trail that of three year old Alpha designs anyways, and now there's the switch-matrix arch that gets close to twice that of the new G4's theoretical bandwidth (EV6 500 ->> ~2.6 GB/s, G4 (7400) -> ~0.8 GB/s). This is the 'theoretical', Alphas still get 1.3GB/s in sustained throughput, 50% more than G4s Theoretical
Did you even bother to benchmark how well your
stuff runs under solaris x86, on the same hardware
you have now?
"Affording suns" has nothing to do with anything.
Solaris x86 is cheaper than an NT license,even.
You seem to know what you're talking about so I'm worried I may be missing something, but I don't see why Motorola needs to feel magnanimous to contribute optimisations for their chips to gcc. Wouldn't they just need good business sense? Anything that increases the value of their processors must be a good thing for them. Or is vectorizing loops so hard a problem that they'd spend more than they'd gain?
--
Fuck the system? Nah, you might catch something.
Heck, My research center has Vacuum tubes computer that is faster than ASCI Red + All the flavors of Blue (9000 PPro + 6000 MIPS + 2000 Power3) You see, the trick is in the implementation. If you take 1 wavelength of an analog signal, there could easily be 100,000,000 discrete levels(especially with a 10,KV plate voltage.) Fine tuning of the voltage differentiation amplifier would probably quadruple the speed even more. Now we only have to upgrade the holographic scanner for the punchcard readers.
Forget about any of these digital OS, we even implemented our own ANALinux, which used OS technology that was originally implemented for the quantum computers that is slow to come about. Except for the fact that probability wave algorithm in the kernel was reimplemented with the electron wave method(more descrete.)
We can't open source it yet, since the whole kernel runs via negative feedback, so it is constantly being upgraded. We could take a snapshot of the loaded kernel image by detaching all the ferrule doughnuts at the same time, but the source would all be in analog stream and useless unless you have another valve box.
It easily interfaces with outside systems even though it is 100% analog inside due to the (ported) quantum kernel's interface, which utilizes the duality of the wave and sends discrete signals to outside the box. The only problem is the primitiveness of current technology. Since petabit networking has not been implented, we basically watch the tube's change in brightness as I/O. Current internet access by outsiders is via out webcam pointed at the tubes.
This OS is totally unhackable since nowbody know how to hack it. Input is vial variosistors instead of toggle switchs, so all the script gramps who hacked their way into Univacs would not know how to break in.
So all you digiphiles, put you toys down and use the computer that work like the way humans do.
I don't know about now, but five years ago, a state-of-the-art code for weather forcasting used spectral approximations (Fourier or Chebychev expansion functions) in the X- and Y-directions (Latitude and Longitude, say) and some high-order (compact) finite difference method in the Z-direction (Altitude). Incompressible (or weakly compressible) fluid flow, extra scalar transport equations for humidity, etc. Fractional step time integration method with the pressure correction equation being solve with a combination of Fast Fourier Transforms (in the X- and Y-directions) and a line-implicit solver in the Z-direction. No idea what they use for turbulence models...an anisotropic Reynolds Stress model maybe or a dynamic subgrid model if they are doing (Very) Large-Eddy Simulations. Things may have changed too...may have dumped the spectral approximations to get more flexibility in modelling surface contours (mountains and such). Spectral elements maybe? Who knows? I do automobiles and combustors for a living.
You build your NUMA box that has 1 fat highway, and it turns up like the subway systems in the metropolitan areas. The whole purpose of hypercube or 5-D torus is to have a shortest path to as many places as possible, instead of hopping onto that megapipe and making a stop at every node to see who wants to get off.
And who is preventing the 10-D cube from having a 100lane highway? The only limitation is that you end up with the traveling salesman with too many route to follow(but with enough routes, there is a very good change your destination is only 1 hop away)
Well, 10x could be true for the code these guys may be running. (spec is not everything, this is very important for Memory Intensive code). Take a look at STREAM, (memory bandwidth bench) PIII ~ 300MB/s Alpha DS20 ~ 1300MB/s And since these systems use EV6 "buses" each processor gets all that bandwidth to its self in multiprocessor systems. But back to spec, here are some more numbers Published results at www.specbench.org (Compaq XP1000 667 Mhz) 65.5 SPECfp95 37.5 SPECint95 (Compaq GS140 700 Mhz) 68.1 SPECfp95 39.1 SPECint95 Informal results (www.novaglobal.com.sg) (These systems have better memory systems than those above) (AlphaServer DS20 667 Mhz) 72 SPECfp95 38 SPECint95 And you can get a well equiped system (DS10) from www.dcginc.com for only $3500.
What sort of software is it running? What exactly uses all this power (I know how fiendishly complex weather predictions are...I'm just curious what kind of software exists/is being developed for it...)?
Well, hmmmm, while I am not completely convinced that you aren't trolling, that is a shorter version of the arguments that I have used to get Alphas here at work. The bandwidth issues alone make such a huge difference that for really large data sets you can get close to mainframe class throughput with a UNIX platform and tools and a radically smaller price tag and better performance. Not a mainframe, but able to deal with big, fat pipes really well. Now, if we can get the PC hardware to make 133 or 266MHz 64 bit PCI buses so common that 6 channel LVD or FC-AL (or SSA, if the 320MB/s stuff ever gets released) can really keep the pipes full and we get two more buses on the duals, we could have some nice, Cray-type performance with the bandwidth as well. Now that would be cool.
Well, 10x could be true for the code these guys may be running. (spec is not everything, this is very important for Memory Intensive code). Take a look at STREAM, (memory bandwidth bench) PIII ~ 300MB/s Alpha DS20 ~ 1300MB/s And since these systems use EV6 "buses" each processor gets all that bandwidth to its self in multiprocessor systems. But back to spec, here are some more numbers Published results at www.specbench.org (Compaq XP1000 667 Mhz) 65.5 SPECfp95 37.5 SPECint95 (Compaq GS140 700 Mhz) 68.1 SPECfp95 39.1 SPECint95 Informal results (www.novaglobal.com.sg) (These systems have better memory systems than those above) (AlphaServer DS20 667 Mhz) 72 SPECfp95 38 SPECint95 And you can get a well equiped system (DS10) from www.dcginc.com for only $3500.