IBM Demos Cray-Matching Linux Cluster
An anonymous reader sent us a link to an InfoWorld story
where you can read about IBM slapping together an
Open Source Supercomputer
capable of matching a Cray on PovRay benchmarks.
It's basically just a cluster of Xeon based Netfinitys.
Smooth.
Gee, I wonder if this will force SGI to start selling Crays at Barnes and Noble :P.
They're even *emphasizing* the fact that they could just use Linux off-the-shelf to put together a world-class supercomputer at a fraction of the cost.
:-) Yeah, right, for about 5 million and 37 MCSE's on-call to reboot the machines periodically...
Proprietary operating systems are just looking worse and worse... could you do that with NT?
Linux 2.2.2 is in books already? The POV bench
says they were using 2.2.2..... Something fishy
going on here.
Should IBM be doing this? How much have they spent on Deep Blue? Could a Linux cluster equal the Deep Blue for half what a Deep Blue costs?
and while we are on the subject, a link from a post made yesterday.
http://www.brightstareng.com/
laptop cluster?
IBM's really into the PowerPC, I wonder how this set-up would turn out using one of them...
One of the things missed by most of the Linux supercomputers is inter-nodal bandwidth. If you can break the data apart with minimal communications, and you're bound by CPU, not data, then the "cluster" is a great solution. "True Supercomputers" (which are more and more rare) are capable of moving 10s of gigabytes a second thru the CPU using memory that has no latency, etc. Also, things like the SP2 (IBM's BIG toy), the Cray T3E and the Origin 2000 have very high bandwidth I/O interconnections, none of which is less than 1000x faster than "fast ethernet". What the Linux world needs is a SERIOUS I/O solution. Myrinet is as close as it gets, but it doesn't scale past 64 nodes without some pain.
Clusters of machine like this are great for tasks that are highly distributed like rendering. You just hand off a frame at a time to each machine.
Other tasks that aren't so parallelizable (e.g. scientific simulations, where you need to know what happens in the timeslice you're on before you can calculate the next one), won't perform as well.
My Login's Not working...
--PoochieReds
Still, you might be able to get close if you had FDDI or GB ethernet between the nodes. I'd like to see how it performs on a scientific (or engineering) simulation test. Maybe something with fluid dynamics.
Red Hat rule once again. There is no future for these other also-ran distro's. If any of them were any good, IBM (and others) would invest in them.
Red Hat is here to stay.
Way To Go Linux!!
An IBM SP/2 is a cluster. It doesn't have the high-performance architecture of a cray. I don't know what they're using these days but it's probably no better than fast ethernet or FDDI.
I don't think you would stand much of a chance getting that to work, unless int13 can handle gigabytes of diskspace, since no ServeRAID driver exists for win95/98 (assuming that they used the IBM ServeRAID controller for this test).
Not to mention that win95/98 does not support SMP.
POVBENCH measures how long it takes to parse and render a specific POV-Ray scene (skyvase.pov) with specific settings and write the resulting image to the disk in chunks of 1000KB. The method in use is ray tracing using floating point math. This method is not the same medhod used to render Toy Story. The resulting image contains 640 * 480 pixels. For each pixel 1 to 9 rays are traced. In addition to those reflected rays are traced, too. 3 seconds is awesome. You can download POV and try for yourself.
Check it out at www.top500.org
It's # 113
This is hardly a fair comparison. Let's take a problem that is clearly trivially parallizable and has marginal internodal communication and then claim that it's a true measure of a cluster.
What about the large memories and IO rates needed to feed a supercomputer?
Anyway, I could spend some more time on this, but in this forum it's just a religious pissing match anyway.
It's amazing what passes for technical merit and critical thinking in the linux community.
They used 36 PIIs in the cluster. Avalon uses 140 Alphas. How fast is that Cray thing anyway?
Oh, and BTW, wouldn't using Alphas be more cost effective then this Intel crap?
this one is just asking for it, so lets not start.
First, don't use IBM's prices! One can get much cheaper dual systems than from IBM. Hardly a reasonable comparison.
And, x86 isn't so bad compared to the alpha when running linux because there aren't any good compilers for the alpha on linux. yeah, one can violate the license agreement and use the Compaq compilers but that's hardly proper.
x86 does very well for the money.
More info...
According to the books sitting on my desk for the SP2 we own, the "maximum internodal bandwidth" is 120MB/s (which is 1.2Gb/sec) and raw data across this wil lreach this bandwidth. Once you layer IP or PVM or anything else on top of it you're losing data. IP across 100Mb Ethernet won't do 10MB/sec, you're limited by non-detministic protocols and the overhead.
POVRAY is a useful benchmark for certain applications (those whcih go totally parallel with minimal internodal communications and which work with small data sets). One of the things we use our SP2 for is mining thru terabytes of data which is all related and requires as much bandwidth as possible between nodes.
I'm not arguing Linux clusters aren't valuable, they are, just not for everything, and not for a lot of hte classic problems. In almost all problems 10 CPUs of 1GFLOP are more usable than 1000 at 10MFLOP. Communications between CPUs will always be too slow.
NT 5.0 will kill all of you lusers off! Just ask any of the branches of the military. I'm going to tell by biatches in Washingtom to send a cluster of SmartShips(TM)your way!
Ed "Got the pentagon in my pocket" Muth
"SmartShips, NT5.0 are trademarks of the MicroShaft Corporation"
That's interesting. Maybe I should run a search on Big Blue, learn more about it. Thanks!
You get points for enthusiastic Red Hat/Linux support. However, the blatant bashing of other distros, esp. Debian is bound to lose some attention. And that part about people would invest in the others if they were any good is rather cliche. Perhaps something along the lines of the other distros being to difficult to configure to trust on a supercomputer. Overall, a C-- post. Keep working, though. With enough practice, anybody can be a first class troll!!!
Proprietary operating systems are just looking worse and worse... could you do that with NT? :-) Yeah, right, for about 5 million and 37 MCSE's on-call to reboot the machines periodically...
This reminds me of the story about one of the first vaccuum tube computers in America. I guess the tubes burned out often enough that they had a platoon from the army swapping out bad tubes. The whole room was hot enough that they were dressed in their skivvies.
"Jackson... #37 BSOD'ed again... Reboot!"
"Cramer! hit the switches on 67 and 68!"
"Hey Wilson... your goin to the brig! you haven't rebooted a machine all day you lazy puke... What's that you got there, a cd? Some sort of penguin band or something?...
As interesting as this technology is, and as highly as I think of Linux and Beowulf, IBM's demo was extremely misleading. POVray falls under the category of "EMBARRASSINGLY PARALLEL TASK", for which the node-interconnects are only used to distribute the metadata and collect the end result afterwards. Supercomputers need much lower-latency interconnects than Gb ethernet for solving the kinds of problems they are purchased to solve, like FFT and normalization of very large, very sparse 4D matrices. IBM's cluster would completely tank on an actual supercomputer-class problem because the nodes would need to communicate very quickly during computation.
That being said, I think there is a bright future for clusters. Commodity network hardware is getting faster (and latency is decreasing) at an exponential rate, FPGA's are getting faster, denser, and cheaper, making customizable parts cheap and easy to build (ie, for short interconnects), and the consolidation of ALU, memory, and glue-logic onto the same die will also make custom hardware easier to build. Moreover, clusters have the quality of having memory bandwith scale linearly with node count, a quality shared to a lesser extent by ccNUMA's and excluded by SMP's/UMA's. Software technology being developed for intelligent use of memory heirarchies (ie, splitting up a task into cache-sized data sets) will be directly applicable to cluster architectures as well, and will take advantage of this linear-scale aggregate throughput. I think traditional supercomputers will remain top dog for certain problems for which these benefits will not apply, but I also think that the set of such problems will be growing smaller with time, as methods are discovered for effectively applying cluster technology to problems traditionally solved by supercomputers.
--- Guges ---
See the "Trolling" thread currently running in comp.sys.super.
(Can't be arsed to login, too much trouble.)
Note that while gcc's optimization for the Alpha is lackluster compared to that of Compaq's compiler, gcc's optimization on x86 is worse. Cygnus is trying to fix this now, but it remains to be seen if they can. The compiler was really built for architectures with 20+ spare GPR's, and gcc's register allocation et al curl up and die with only 6 GPR's, total, to work with (eax, ebx, ecx, edx, esi, edi). We really need to rewrite huge chunks of gcc some day (and egcs isn't enough of a rewrite).
00:00:06 2466.67 acer
intel celeron 366 MHz
windows 98
An Acer? At 6 seconds?
same test? not sure.
but it is interesting
I work with the SP systems as well and I suggestp swperf.html#applperf
anyone seriously interested in SP switch performance take a look at the following URL:
http://www.rs6000.ibm.com/resource/technology/s
I dont think you can just give a blanket "the SP switch inter-node communication is such and such". Theres a lot more to it than that.
jkhjkhkjh
By the way, has anyone considered the impact that low-cost supercomputers will have on the security of encrypted communications? It's looking like it's going to be a lot easier to crack codes.
Deep Blue and Beowulf are analagous. Deep Blue is a special cluster running AIX 4.2, with a very fast switch for passing messages. The shared disk subsystem is also highly optimized.
It is incorrect to compare SMP to parallel clustering. The difference boils down to whether processors share memory or use communication. SMP never scales well for large numbers of processors, because of cost, and that's why Beowulf etc. use clusters. Linux has limited SMP scalability, but SMP isn't nearly as interesting as clustering anyway.
ah, well.. I work with the latest silver nodes and I see much better performance than you report. such is life.
Why the fsck should I care about IBM? I use Debian, it works. My frinens use Debian, they like it. There is a whole user and developer community who use it and like it. In which way IBM will hurt Debian by using RedHat? SuSE or Caldera are the once who might die (worst scenario) since they have actually to make some profit. In fact, RedHat does not hurt the other distributions but benifit them by drawing attention to Linux. More attention to linux, means more applications (which will run on all distributions of course) and it laso means there will be more Linux ready systems available (again this is your choice what to install on them).
Debian (and others) benefit from RedHat the same way FreeBSD community benefits from Linux. e.g. All those software projects inspired by Linux actually produce software that runs on BSD. KDE, GNOME, GIMP, etc and all your beloved Linux applcations are all available to FreeBSD community at no cost .
I think in OSS community we all benefit each other.
I think Linux clusters can really errode SGI's supercomputers' market share..
It probably makes sense to take some of the numbers at haveland.com with a grain of salt. So I could say that a cluster of 5 286s running Windoze 3.0, were able to do the benchmark in 2 seconds. It would probably be removed but some of you would belive it. Use some common sense people!
PovRay benchmarks..
Mohaha.. whata joke.
give me LINPACK numbers baby!
The 9 second time was not a single machine - it was a cluster of 10 machines (remember these are PARALLEL results, single results is a different spreadsheet). Their list of hardware can be found at http://WWW.CE.UniPR.IT/research/parma2/
....
The Povbench spreadsheet is not clear on the number of nodes and CPU's used, only the type of CPU used.
Parma 2 had
8 450mhz PII processors
2 PPro 200mhz
4 Pentium 100 machines
I would count the 2 PPro 200's as one 400mhz and the 4 100mhz Pentiums as 1 400Mhz - for a total of 10 400Mhz Pentium II processors.
Also, we did achieve 3 seconds with several nodes down - using 28 processors. To get from 9 seconds to 3 seconds scaling at 100%, they would need to use 3 X the number of processors. Hmmm
10 X 3 = 30
Sounds about right.
Jay Urbanski
Netfinity Systems Engineer
IBM Advanced Technical Support
MCSE, PSE, Certified Solaris Systems Administrator
(817)962-3597 TL 522-3597
(817)962-7307 fax
(800)413-9093 pager
urbanski@us.ibm.com
uh, if the resolution if very small maybe.
Pixar used over 3000 nodes to render their "A bugs life" movie, but recall they render very high resolution frames. (4096x4096?)
I guess 800x600 realtime rendering can be done with a ordinary cluster, the frames are only 50-100kb each if JPEG/GIF
If using overclocked Celeron 300A you will get alot of nodes! 2500 node celery cluster for $150k
I understand the interest of extremely low cost intel hardware, but wouldn't it be good to have fewer nodes and have more processors in each node? I'd think that would make the problem of internod communication start to head downward.
And if we want a multiprocessor node, wouldn't it be good to use Suns, after all Sparc is Scalable Processor ARCitecture (sp?).
And if we can have IP over scsi, why not IP over ultra2 scsi, or differential ultra2scsi, or fibre channel?
Compaq CPlant Cluster (150 Alpha/Linux nodes)
Rmax: 54240
Rpeak: 150000
Rank: #97 in Top 500 world fastest supercomputers.
Could you imagine a K-rad b30wulf cluster made from these thi... oh... um...
nevermind.
If it were necessary, They could have used IBM's SP Cluster Switch. It runs at 2.5 Gigabits, I think. They are bringing this technology from the RS6000 SPs to Netfinity clusters.
Your comments are 100% correct. As one of the IBMers who helped set up the cluster demo, I can attest that the story has been somewhat mis-reported. We never made the claim that this was anything other than a CPU-intensive benchmark, and a neat thing to show off. I don't suspect that the DOD or EPA wil be doing A-bomb particle drift sims anytime soon on a similar system ~ but it WAS fun to see what could be done with an off-the-shelf copy of LINUX.
One thing to keep in mind is TCP is a lousy way to communicate across ultra high performance topologies... HIPPI+TCP sucks in general. There are lower level protocols necessary to eek out the bandwidth. You can check ouyt some of the research done by Myricom (http://www.myri.com) on their 1.2Gb fabric, and IP sucks for this application. From talking to people at IBM, I get the feeling that they never intended people to use sockets across the fabric! Unfortunately, for many people it's IP or the high-way, which can be a very limiting ideology to wrap oneself in.
Yes, but check the specs. Each of the Netfinity machines can be pulled off and run as a stand-alone, with it's own SCSI, Ethernet, graphics card, etc.
Each of the Alpha cluster nodes can just function as a node. No SCSI, no graphics card. It's built to perform purely as a node.
Check the pricing for Alpha vs. PII. Crap or not, the Intel chips are cheaper. It's just a matter of how the cards are configured.
Do consider that the Cray T3E-900 has 48x450Mhx alphas -- and that it gets matched by the Xeons.
The Microway cluster won't even have the memory
bandwidth of the Cray either. I think what this benchmark shows is the importance of having a large L2 cache running at the core speed of the processor -- since thats where the main difference in architectures lie.
And also how QUICKLY the solution can be put together -- Cray couldn't even install such a system in the time it took these guys to set this thing up.
The current best-foot forward that SGI/Cray has to
offer is the O2000. That's what the ASCII-Blue Mountain site has several of, not a T3t. So why didn't IBM compare with that? Because they wanted to be able to say "We beat Cray", having lost the performance war to the ASCII-Blue Mountain site.
the notes floating around internally from the few guys that set it up are that RedHat was the one on the shelf.....they'd have grabbed whatever they found.
this literally was 'slapped together' just to see what the heck would happen.
IBM as a whole didn't run out and decide to pick Red Hat as their 'approved' version of linux, the two main guys that set it up found Linux at Barnes and Noble and used it the day before the benchmark was run...
opinions are not IBMs, just my own...
An Anonymous IBM Employee
A lot of people have already mentioned that using network technology (ethernet or myrinet) to interconnect beowulf nodes results in a system that isn't terribly useful for tasks requiring greater bandwidth.
My question is, has there been any work done on alleviating this situation? Would using interconnected PCI slots help measurably? Or will we quickly run into the brick wall that is due to the x86 architecture's poor memory bandwidth?
It might be possable with something like gigabit ethernet but with 100baseT you would run out of bandwith after adding just a few more nodes to the cluster IBM used to demonstrate.
I know the PowerPC chips don't seem very spiffy when you look at SPEC benchmarks, but you must remember that most Unix boxen have considerably greater memory I/O bandwidth than their x86 brethren. Add this to the fact that IBM's SP clusters are connected with a very high bandwidth pipe. An RS/6000 SP cluster may not look very cost effective compared to a Beowulf, but it would probably still be faster. For absolute performance, Sun and SGI will probably hang on top for quite a while.
As far as I know, Beowulf is passe.
The only companies making x86 NUMA machines is Sequent and Data General. Unfortunately, they are both pandering to Microsoft so we'll probably never see Linux running on any of their boxes.
Any other manufacturers making exotic hardware like this for Linux?
asdfqwerty
They would have had to compile a new kernel to get Beowulf capabilities, since those aren't in the vanilla kernel.
What about 1000Base-T?????
Max bandwith is in the same balpark as the PCI bus
Cards are around $350 now but forget about finding a switch for a decent price
Stupid question but 4XAGP will be the fastest I/O bus on the P.C.
As BeoWulf nodes do not need video would it be possible to make an AGP I/O card to interconnect nodes????
Couldn't they have installed the system from the CD, then simply upgraded to the 2.2.2 kernel (as I'm doing this very moment)?
And yes, penguins do smell a little like fish.
which service are you? you almost sound like a cto
It may be possible. AGP 1X was basically 66MHz PCI. Usually bandwidth is not the problem but latency. The gigabit backplane in a Sun Ultra Enterprise class machine is quite different from gigabit ethernet. Gigabit ethernet still has a lot of latency while the Sun backplane is probably 100 to 1000 times less.
You should look into things like hippi if you really need the bandwidth.
aoeu',.pyf
Get a real keyboard layout!!!
Ya'know, SGI may not be worried but I bet the state department is. Iran, Iraq, China, Pakistan, India have a very nice and cheap way to make a super computer to further their nuclear programs.
Actually, those countries may find many civilian uses for these boxes.
http://www.cray.com
A Cray T3E, which is the fastest massively parallel Cray, runs at up to 1.2 TeraFLOPS, with 2048 DEC Alphas. I believe that there is an upgrade to get it to run 600 MHz Alphas instead of 450's, which would increase the max speed. I remember reading that they put together a machine that did over 1 TeraFLOPS on real code. So basically, the answer to "how fast is a Cray" is "Very".
Forest Godfrey
The bandwidth numbers you're talking about (25 MB/s) used to be true with the network as of 1994; perhaps you have access to an old model of SP?
With the latest model I found more than 100 MB/s.
HIPPI's 100/200 Mbytes/sec would not be achieved on the PCI bus. Hense the question regarding AGP I/O adapters.
Probably never will happen as the market is way too small and unfortunately PCI rules.
I was all hot and bothered by the 1.6GByte/Sec XIO bus speed on our SGI O2000. Started digging into the details on the four channel XIO SCSI card. You guessed it: card internaly uses a PCI bus.....
The pc architecure is really hurting for high bandwidth I/O. Every time we purchase a new system for throwing around our multi-GB files we review the NT options, but unfortunately the proprietary expensive unix solutions always win......
uh? since when ms compilers are considered good at x86 code generation? my pgcc is far better and I think Intel's C too is.
Have you actually seen the code that VC++ puts out? It's far better than any GCC derrived compilier I've ever seen including various versions of pgcc and egcs. I think that this stems from a few different reasons. First, GCC is designed to compile to multiple processors with the same front-end. There will always be processor specific optimizations that you really have to put into the front end to make them work. It's a trade-off -- performance for portability. Second, as good as the open-source model is, I'm not convinced that it produces optimal results for something like a compilier. Compiliers require a lot of cutting edge computer science to make fast, and (donning asbestos suit) people who have such skills generally don't work for free. They get jobs at Microsoft or at SGI/Cray or any of the other compilier vendors where they can pull down serious money for their efforts. There are other major problems with open source compiliers, in the area of language features and proper implementation. Egcs has some pretty major C++ bugs in it, and I know all of the OSS wankers love C, but for people who prefer not to live in the 80s, OOP is a real software engineering win, and OOPish C like the GTK people have just doesn't cut it.
I don't know... I just have this odd feeling that a Cray 2048-processor T3e/1200 setup would beat out an RS/6000 to the Nth degree. =)
It just can't be classified in the same category... I mean, it doesn't even make your neighbor's lights dim.... =)
Certainly isn't any worse than whatever genius at Microsoft decided to name their embedded OS "WinCE". I mean, yeah, I wince whenever I think of the thing, but...
And I thought marketing was supposed to be Microsoft's -strong- point...
reconsider that. what you say isn't logical. You're saying that the fact that IBM and others only invest in RedHat proves that the other distros don't have merit, but I could line up thousands of slashdotters who'd argue that debian or suse or whatever kicks redhat in the arse. I'm a redhat user, mostly because that's what I have used in the past, and it fits how I need the OS to install/function. I'd imagine that IBM and others are choosing RedHat to be their Linux prodigy child because it's a smart marketing move. From where I stand, redhat is the frontrunner in the corporate world, and companies will just run with that because of redhat's established name.
The other distros have qualities to them that are better for some people than redhat's distro... IBM picking redhat is purely a marketing move and says very little about the quality of other distros compared to rh.
Posted by Olaimi:
..
What if IBM had this package in a box scheme along with
- UDB (AKA DB2)
- Visuage Age suite
- Lotus Notes (Dominos)
- e.commerce
- well the list is very long i guess !
I bet Microsoft have no future in corporate IT Departments!
Cheers
Ok, so how fast is this? Does the benchmark measure how long it takes to render a "typical" frame? If so, does that mean it would take (3 seconds/frame * 24 frames/second * 17 nodes) 1224 nodes to render a movie in real-time?
Something like that could make a really cool video game. Of course, in ten years your Playstation will be able to do it.
It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail. - Abraham Maslow
IBM has been around long enough that I think they've run out of acronyms! For example, WAS already stands for "warehouse administration system" and "work activity system". And that means it's one of the least used set of initials out there!
I especially liked the part when they said they got Linux from a bookstore the day before. Heh.
"Hey, Bob! We got this nifty IBM cluster here. What'd you want to do with it?"
"Wait a minute. I'll run out and grab a Linux cd from the bookstore down the street."
Hahahahhahaha. That's fucking great!
"shop smart:shop s-mart" ash
But look carefully at that entry for the Dual Pentium IIs, and you'll see that the total cost of the system is listed at $12,000. Either that is one KICK ASS system, or more likely it's a Beowulf cluster of dual P-IIs, and they forgot to mention how many CPUs were involved.
The next Cmdr Taco duplicate will be ready soon, but subscribers can beat the rush and see it early!
That was only an 64 processor air-cooled T3E, the smallest one SGI makes. SGI makes Origin 2000s twice as big (128 CPU SMP) and T3Es *32* times as big (2048 CPU MPP). Also, I'd be more impressed if IBM has pointed to a more widely used (and less temperamental) benchmark than parallel POV-Ray, which ends up being mainly I/O bound. NAS Parallel Benchmark numbers would be nice.
Beowulf clusters are nice if you've got a parallel problem that only scales well out to a moderate number of processors (32-64 at most), or if price/performance matters more than raw performance. They get clobbered if the problem is bounded by I/O, communication latency, or per-CPU memory bandwidth.
Frankly, IBM has a lot more to worry about from Beowulf clusters than SGI does. Their supercomputer class machine, the SP, is just a cluster of rackmount RS/6000s with a very high speed internal network, and it has all the same problems as a Beowulf cluster relative to a more tightly coupled parallel system like a T3E or an Origin. Plus, AIX is eeeeeeevil; IRIX is much nice IMHO.
And before anybody asks, yes, I work with both traditional supercomputers (Cray T94, Cray T3E/600-136LC, SGI Origin 2000/24xR10-250, IBM SP-2/8) *and* a Beowulf cluster. We've been doing benchmarks to compare our Beowulf to our big machines; in some cases, the Beowulf wins, and in others, the big machines win. It really depends on the problem. We (i.e. my group at OSC) may be announcing some benchmark pages here in a few weeks.
--Troy
"My life's work has been to prompt others... and be forgotten." --Cyrano de Bergerac
Before you keep posting messages about how unfair the comparison was, if you read the article again, you'll notice that the point they want to convey is not how fast the Linux cluster was compared to the Cray, but instead how easy and inexpensive it was to set it up and get running, using just a few x86 boxes (I admit that Netfinities aren't exactly what I think about when I hear the word "cheap") and software that can be acquired for free. Damn, they got the software from Barnes & Noble! How much easier than that can it get?
In Soviet Russia, Jesus asks: "What Would You Do?"
Hell, no, not SCSI for internod communication. Not even Ultra2. Yes, it's considerably faster than Ethernet in data transfer rates, but (a) you can't put a switch on a SCSI bus to avoid having a node wait for another one to finish sending data to start transmitting; (b) a device in a SCSI bus can't arbitrarily send data to another device in the chain; (3) you're limited to up to 16 devices in the bus (I'm not sure about whether recent developments change that limitation, though). It would take a fugly hack to make it suitable for the job. Fibre Channel and FireWire could be better choices, but I don't know jack about them, so no further comments.
In Soviet Russia, Jesus asks: "What Would You Do?"
Not yet for Linux, but wait for SGI to come up with some ccNUMA stuff in the not-so-distant future... or so it seems to be.
In Soviet Russia, Jesus asks: "What Would You Do?"
Is the comparison to a Cray fair when you consider
the inter-node communication/bandwidth needs?
In clustering and parallel computation, bandwidth
counts. My guess is that a different application
that requires much more communication between
nodes, the T3E would step on the Netfinitys. 100
megabit ethernet does it for low-communication
jobs, but what about those that require much more
intensive inter-node communication.
---Check the pricing for Alpha vs. PII. Crap or not, the Intel chips are cheaper. It's just a matter of how the cards are configured.---
Oooh graphics cards are expensive aren't they?
You have not covered the performance aspect. Alphas systems have twice the FPU power of any intel system at the same price, new (per MHz is a little different, but cost-performance is more important than CPI. For clustering, that is very good. Remember that all new Alphas currently have 64 bit PCI slots, like such used for gigabit or four port duplex 100bTX cards, reducing memory system bottleneck and increasing raw comm throughtput for parallel cluster/node computing like this. Communication is the key for parallel.
JRDM
The comparison you make is bad... The Cray test was done about 15 months ago AND used older software (POVRAY 2.2 vs 3.02), older compilers, older generation CPU, and no one uses 450MHz Alpha CPUs anymore. Cost wise, a new DS20 dual Alpha computer is less expensive than a new quad Xeon-450 AND outperforms it.
If Microway makes a new cluster, its memory performance/bandwidth will probably multiply by 10 given the new chipset.
Deep Blue runs on PowerPC processors. There's a port of Linux for PowerPC processors.
The cluster may be able to outrun big blue... but if you install Linux on Big Blue, that should speed it up quite a bit i believe.
Then again, I seem to recall something about Linux being not quite THAT scalable, although I could be wrong..
Can Linux handle thousands of processors?
Try http://www.starbridgesystems.com/Pages/technology. html
It's actually information on another supercomputer, but it compares the system to the IBM Blue Pacific.
Under the processors section, it says that the IBM system uses "5,856 Power PC 604 processors"
I think what's interesting about this is what it didn't say -- that the old record of 9 seconds was on a Dual Pentium II. So having *18 times* the number of processors only got it three times the performance...
Any comments on this? Obviously a dual Pentium II is pretty damn good at this too, being only 1/3 the speed of a $5.5 million supercomputer. Anyone have any idea why adding so many processors to the Linux cluster would improve results so little?
That's a point, but I doubt that's the cause in this case. I suspect its something else. If I was rendering images (as opposed to large amounts of calculations that deal with the results of other sets of calculations), then I'd just split the image up into 36 chunks and have one processor blast through each chunk, and stick them all back together at the end. My assumption is that's how this test works because they mentioned how a few scanlines were dropped when one node went off line. So its not true PVM-style clustering like Beowolf.
In this case, 18 times the processors should give 18 times the speed -- unless the test really isn't processor bound. I'm not sure what it would be bound by, however... lousy implementation? I/O? Network?
If the test isn't really processor bound then the comparison to the cray is meaningless, because there's something wrong with the way the software is coded to work on parallel machines, I'd think.
I disagree that the $12k cost means it was a cluster of Pentium II's. I've bought a couple Pentium II systems in that range, its easy to get up there when you add a lot of RAM, lot of harddrive space, etc. Using name parts jacks the price up a lot. (ie, VA's selling systems with Intel boards rather than supermicro or some other lower-cost company...)
On a side note, I remember reading a year or two ago that someone was working on a networking layer that allowed IP and other protocols to be routed between cluster machines over a 40MB/sec SCSI bus. Anyone know if that ever got to completion? A four-fold jump in network speed would make quite a difference to I/O bound applications. (And SCSI cards are a lot cheaper than Gigabit ethernet or other real high-speed networking technologies...)
I doubt it...
SGI's NUMA architecture means data can be pumped *much* more quickly between nodes, 100-1000 times as fast. Network-based Linux clustering is useful only for calculations that are fairly self-contained and don't need a lot of data to process.
What I think would be more interesting, given SGI's leanings towards supporting Linux on their MIPS and Intel platforms, is if they eventually tweak the multiprocessing in the kernel to support NUMA style multiprocessing and I can throw Linux on an Origin server. Or maybe better yet a NUMA-architecture Intel machine (i'm not really up on floating point speed comparisions between newer MIPS and newer Intel chips). Since they've dropped real PC-compatibility on their new Intel machines, that sort of a shift is a lot less painful than the initial dropping of support for DOS/16 bit apps.
So SGI doesn't get hurt by Linux. Linux *can't* really compete with a Cray at any real-world tasks (not yet...). And SGI is in a *real* good spot to be the ones selling the Linux-compatible hardware that actually could. In which case, why would they care? Their profit may be lower on a $500k Cray-comparible NUMA linux system than on the Cray, but I'd bet they'd sell enough more of them to make up the difference.
Time will tell.
Ah, that clears it up. Its always good to hear the real deal from the source. :)
...which is equivalent to loading a service pack in the Windows world. How many people run stock NT 4.0? Even clueless users know better: they make sure to keep up with service packs. Using RH 5.2 plus updates is as close to "off the shelf" as a standard production line Windows system.
--Lenny
Clusters are definately a better idea for something as blatently parallel as non-real-time rendering. Commodity parts are vastly more cost effective. However, this is a very special case. Most applications require far more bandwidth for heavy interprocess communication. Try such an application on a Beowulf-type system and you can watch it fall flat on it's face. Suddenly computation is I/O bound, and the Cray really earns it's keep.
Things like Crays are expensive mainly because they have very special, very fast hardware for this purpose. It may be extraneous hardware for something straight ahead like a render farm, but there are many cases where such massive bandwidth is very necessary. Thus, for most applications, replacing a Cray with a Beowulf cluster just isn't an acceptable solution.
Beowulf clustering has been proven to be a cost-effective non-real-time rendering system, however.
--Lenny
I'm still running 2.0.36 but I believe a 2.2 kernel rpm is available. If their copy of Linux was Red Hat based couldn't they have bought it off the shelf, installed it, and then downloaded and installed the kernel rpm? I'd say that if they have to get a single file via free download and type 'rpm -i' that still counts as off the shelf.
I was most impressed with the graceful failover. Unplugging one machine and having nothing more dramatic than a slight delay in one portion of the result is the kind of presentation that really makes an impression with "results" type people - you know, the ones who say, "I don't care how it works. Just show me that it does work."
I'm also pleased with IBM's recent decision to release their Websphere Application Server on Linux - although the person in marketing who thought up that name should be demoted. The acronym is "IBM WAS." Both passive and past-tense. Sheesh!
I noticed some of those results too. I am thinking that a governing body isn't involved with this process. Several other VERY strange results permeated the benchmarks. I downloaded the results and will keep em around just for kicks.
Food: It's whats for dinner
Wouldn't it be possible to export all of your Linux header files over to, say, a windows box, compile the code with VC++, then link it against the appropriate libraries?
People did this with BeOS when CodeWarrior/x86 was spitting out terrible machine code. Say what you must about M$, but their x86 codegen kicks serious butt.
For alphas, how about building the code on DEC Unix with the cool compilers?
Shouldn't your 13-year-old ass be in school right now?
Get a clue.
.
we did it off the shelf. we're a redhat and kernel mirror, so we built our own 2.2.x kernels on top of a 5.2 install. we tweaked, of course, but hey, if you are building a supercomputer or a cluster, you better get set to tweak.
100bT and a switched hub, DEC Tulips bought on sale, donated hardware, etc... we paid approx $400 for our 8 node cluster.
i mean, who the heck wants to drop in a 2.2.x kernel rpm? c'mon! sources have been out for a while....
jose nazario jose@biocserver.cwru.edu
Where does one find software capable of clustering? I've looked at MOSIX, which seems like it might work. But I've heard a lot about Beowulf (excuse the ignorance of this next part). Is Beowulf a software package I can download somewhere? What other options are there for linux openbsd or freebsd? Thanks..
It's misleading to say that Deep Blue runs on RS/6000 processors. The chess engine is all in custom VLSI chess processors, the RS/6000 just acts as a control processor, which isn't particularly interesting as a supercomputer application.
The failover is part of PVMPOV, and has nothing to do with the configuration of the systems. It was coded this way because it used to run on a room full of unreliable machines that ran at wildly different speeds that other people were using, and even if the machine didn't die, it was possible that a system was busy with other things at the time, so I didn't want to wait for renderings to finish when other CPUs were idle... Check out the PVMPOV Home Page for more info. Yes, 32s was impressive at one time (it used to be at the top of the list).
First, a few posters forgot their reading glasses and failed to notice that there were 17 machines, each of which had 36 PIIs in them. If you could have done that with 17 machines having only ONE PII, THAT would be news.
Second, the article says they used Xeons. I don't know about what prices IBM gets, but the cheapest I could find a Xeon was about $700. At this price, just the Xeons would cost half a million. The $150k price tag on this setup is just unbelievable, unless either (a) I really misunderstood how many processors they have, and/or (b) $150k was just what they had to buy in addition to what they already had lying around.
So that would be what? 16 2-headed Xeons and a 4-headed Xeon? 6 4-headed Xeons, 1 2-headed Xeons and 10 1-headed Xeons? I figured since it was IBM they could come up with practically anything on short notice. Wait...what am I SAYING? Anyhoo, nowhere does it say 36 total Xeons. Also nowhere does it say 36 for each server. I just did the math and balked at 2.11765 processors per server.
ts.ts. 36 Xeons overall. read carefully next time.
aint no 36 headed xeons arounds.
<^>_<(ô ô)>_<^>
Overall it depends on the application. GCC is not that bad on Alpha integer. The performance loss is mostly in floating point and math libraries.
This shouldn't be true for much longer. Compaq released their math library for the Alpha last week (see here for details), and, acording to posts to comp.lang.fortan they will be releasing their Fortran compiler as well (as a commercial product, not for free). This should make Alphas much more appealing for cluster use.
-jason
In vectorcomputing, each node computes a very small part of the big picture, making communication time a very big (or small in this
case) bottleneck. Thus these computers need super fast, specialized networking connections.
Umm, I think you're confused. Vector computers, such as the older Cray machines, use special vector processors that can operate very efficiently on long vectors of data, applying the same operations (hence the name). Things get a little more confusing with later machines which are actually parallel-vector computers (i.e. they had multiple vector processors that worked in parallel).
It is generally accepted that parallel computers, whether they are "big iron" type machines, such as the T3E, Origin 2000 or SP2, or clusters of workstations and PCs, are the way to go for high-performance computing. Of course, some people would point to the latest vector machines from Japan to contradict this...
You are right that the true measure of performance is based on applications, and there are applications suited to each of these architectures. We have found that for our problems (large-scale reservoir modelling) clusters of commodity PCs perform quite well in comparision to an SP or T3E, even with 100 Mbps networking, but there certainly are other applications with more fine-grained communication requirements for which even a T3E or O2k is barely sufficient.
And a very expensive one:
Check www.microway.com for an Alpha cluster priced at 2500$ per node and $4,500 for the master console. This means that for the $150000 used by IBM one could assemble a 50+ node alpha cluster instead of 17 PCs...
God, when will people ever learn that x86 just does not worth it...
Baker's Law: Misery no longer loves company. Nowadays it insists on it
http://www.sigsegv.cx/
Microway has NDP compilers and libraries that are comparable to Compaq's. Unfortunately not all of them are available for Linux. Unfortunately they also cost money.
;-)
Still, I would bet that you can actually get a very decent special deal if you purchase all the stuff together.
Overall it depends on the application. GCC is not that bad on Alpha integer. The performance loss is mostly in floating point and math libraries.
Anyway, I will bet for the Alpha for most of the cases
Baker's Law: Misery no longer loves company. Nowadays it insists on it
http://www.sigsegv.cx/
Sorry dude. This does not scale. You are going to start going into some real network equipment problems and heavy expences after 16-24 nodes.
So getting a bunch of sloppy boxen is not an idea. There has to a compromise between box speed, box quantity and price of network equipment.
Baker's Law: Misery no longer loves company. Nowadays it insists on it
http://www.sigsegv.cx/
Er, the IBM SP2 inter node bandwidth is nothing to write home about.. I'm working with these beasts, and the SP switch can do around 30 megabytes/second at maximum. If you reach 25 MB/s in real life then you're lucky. I haven't heard about any big improvements on that speed on the newest models either, in any event they must increase that speed by several orders of magnitude if you want to compare with e.g. Origin 2000. And considering the price of an SP2 rack the price/performance is, eh, interesting..
TA
Oh I have read that page, and I have used all the tuning tricks in the (IBM) books, and the SP switch bandwidth *still* sucks. Other companies we work with on a big project have done a lot of testing as well, and the TCP bandwidth is just bad. As I said, if you get 30MB/sec in real life then you're good (and don't even think about UDP, that's really terrible). Now, we're not using the latest and greatest hardware, the nodes we use are 133 MHz. But we don't see much improvement from the 66 MHz nodes.
No, I'm not impressed by the SP switch. And besides, it's a terrible beast to work with.
TA
Yeah, we're using slightly old models, as I mentioned in another posting. That's the deal with IBM, however I didn't think they're pushing 1994 models on us! The first rack had 66MHz nodes and the HP switch, the next (which came a couple of months later) had changed to the SP switch (less reliable from our experience btw), the newest nodes are now 133MHz which are still a bit behind the specs you can find on the latest and greatest.
It's interesting that you have measured 100MB on the latest equipment, the application should in theory be running on new hardware when it gets operational. It's very useful to have an idea of how the switch will perform, so thanks a lot for that info.
TA
It's amusing to note that IBM didn't compare the Linux cluster to its own hardware. I'd be curious to know how a 12-way RS/6000 S70/S7A running AIX under HACMP would stand up to the Netfinity/Linux assult.
InitZero
Well, (this is just speculation) I don't think they just used one Dual PII. I'm mainly making this guess based on the cost ($12,000). Perhaps the Dual PII is what one node is, and they have several of these nodes?
---- Dave
I use Red Hat myself (well mandrake, but it amounts to the same), but what Big Blue have done is exactly what you ghave just said they ahve highlighted the open source community ina way that no one else has the profile to do...the reason they used RedHat is becaues its the distribution that was in the back of the book they bought, not becaues its any better or worse than any other distribution, or UNIX derivative, I wouldn't have been surprised to see the same article arounf FreeBSD but it was RedHat and Linux...and mighty top stuff it was too!!
Notice that 5 of the top 10 systems were Linux based - fromc h=Parallel%3A&submit=List+all+Parallel+Res ults.
http://www.haveland.com/cgi-bin/getpovb.pl?sear
D
It's amusing to note that IBM didn't compare the Linux cluster to its own hardware. I'd be curious to know how a 12-way RS/6000 S70/S7A running AIX under HACMP would stand up to the Netfinity/Linux assult.
This demo was done at a Linux show. Linux on RS/6000 is still a work in progress, so it is not surprising that they aren't ready to show that. Doing a demo with AIX at that show would have been a political faux pas.
bnf
this space intentionally left blank (oops)
The graceful failover is a red herring to me. It was ascribed to IBMs "X architecture" but it sounded more like an application level adaptation.
Bogus:
Neat but not really. A cluster of PCs connected over fast ethernet is not as flexible as a Cray. On the other hand, a Cray is a waste of money for rendering.
There has for some time been a rule of thumb that adding a second (or third, or 17th) processor to a problem doesn't get you double the performance, because there is overhead deciding "which processor is going to do what."
What this means is that at some point, adding parallel processors to a problem ceases to be cost effective answer.
See my other note on the main thread (this one got submitted first, so be a little patient!!) as to what I think is of greater long-term significance to the Linux world.
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
This demonstration all about mind share -- something that Microsoft doesn't want Linux to achieve. Let me explain.
For many years, Cray build absolutely the highest performing mainframe number crunching computer in the world -- and every computer scientist knew it. We used to joke about having our own desktop Crays -- if someone would just lend us (in this case) $5.5 million dollars per workstation. So here's the point that IBM was really trying to get across to the corporate IS people out there -- that Linux is competitive with anything Microsoft can produce. Follow the steps:
- Take off the shelf hardware. [in this case, IBM NetFinities.
- Take a common Linux distribution (RedHat, from the back of a book purchased at Barnes and Noble
- Give a set of assumedly talented engineers an EXTREMELY limited period of time (they bought the book ONE day before Linux World) to:
- install and configure a set of 17 parallel machines,
- set up the network,
- install the software,
- Tune the installation...
[Note: in my book Just setting up the machines to run in parallel in the time they did is awesome enough!] But IBM specifically to demonstrate to the world that this chose this relatively inexpensive Linux cluster could match the performance of a Cray.Whether or not it could be done with NT-based machines misses the point.
Although I'm not always a fan of Big Blue, in this case we should all thank them for a great job in once again proving the power of Linux to the rest of the computing world.
Take that, Microsoft!!
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
I have not actually built any Xenon cluster (or Celeron cluster for that matter), but when you run an application like povray, why the heck would you need such large L2 cache anyway? The bottle-neck is still with the FPU and I/O activity such as the network. FPUwise, a 450mhz xenon does not outperform an o/c'ed 450mhz Celeron.
It's math - you require double the ponies to get the time in half - and it gets worse as computational times approach zero. For an analogy, in 1980 a TF Dragster could do the 1/4 mile in just under 6 seconds, with ~2000HP. Today, it takes 6000 HP to do the 1/4 in 4.5. It would take 9000 to get under 4 they say - and so it goes.
"Depression is merely anger without enthusiasm." - Anonymous
My brilliant coworkers woulda loaded Win98 on it with no hesitation.
Last year Jim Gray (now at Microsoft Research, bleh) was out here at Stanford giving a talk on what he thought the future of computing was. He seemed to think that clusters were the way to go--- when I asked him about the latency issue, he seemed to be of the opinion that all the interesting computational tasks of the future _were_ "embarassingly parallel", and that anything that wasn't was pretty much good already.
:)
I'm not sure I agree with that assessment... But I'm just a dumb systems researcher. What do I know about applications?
Everyone knows that a cluster won't perform well for computations that can't be easily parallized without massive internodal communication - No one would use a cluster for these types of problems. The point is that it _is_ a fair test for the types of computations that you _would_ use a cluster for. For these types of applications, you're better off spending $150,000 for a PC cluster than millions for a Cray.
Just in case anyone was in a coma all last year and doesn't know what Avalon is, here's the link.
I wonder if the Avalon folks ever tried anything as trivial as Ray Tracing.
I am not your blowing wind, I am the lightning.
Ok, someone who actually knows what (s)he's talking about.
There's nothing special about this news other than the fact that the individual nodes are running linux. Which basically makes this an SP2 minus the superfast network (and the dent in the wallet).
The measuring stick for all computer hardware issues is application. There is supercomputing (vectorcomputing) (like Cray, traditionally) and there is parallel computing (like any old cluster of workstations). The distinction is the type of operation. In vectorcomputing, each node computes a very small part of the big picture, making communication time a very big (or small in this case) bottleneck. Thus these computers need super fast, specialized networking connections. There are many problems/programs which may be parallelized and yet, still have a significant sequential segment, causing the bulk of the processor cycles to be spent on processing, as opposed to waiting for data communication.
The problems described by the later are becoming more and more popular. Vector computing, however, is primarily core scientific applications (physics, math, weather prediction, etc.) which have not seen dramatic computational advances in the last decade.
An SP2 is sort of in the middle of the spectrum since on top of having high powered nodes, it has a fast network. A COW running linux catches the bottom end of this spectrum, it's nodes are high powered by its network is slow. With ethernet, fast ethernet, or ATM it could never match the performance of Crays or Connection Machines. But then what do you expect for $2000 a node.
Also
I concur.
(by the way, I bet I could probably piss further than you)
Whatever. These are old school Alphas...who would build a new Alpha 450 now?? Besides...not to knock Linux clusters (our ACM chapter just brought one online), but this is kind of a bad comparison...as this kind of stuff doesn't show the HUGE difference in internodal bandwidth between these two systems. If you get something that needs a lot of talking between nodes going, the Cray would pretty much rape the cluster like no tomorrow...the latency on switched fast Ethernet (even Gbit Ethernet) just can't compare to these whack (and horrendously expensive) supercomputer interconnection systems.
CJK
I don't know if that's a fair statement...
Sure, the architecture of the system may be old school, but that's not why people set up Beowulf clusters...they buy them for the untouchable price/performance for coarsely-grained problems. End of story.
Don't worry, though...Linux development won't stand still...we'll see changes in the future to allow for more flexible architectures.
CJK