Fermi's 2000 Node Beowulf Cluster
Our eagle eye
Chris DiBona is stuck
in Chicago this week at Comdex, but he did yell to us that
Fermi Labs has announced a 2000 Node Beowulf cluster.
If anyone can send me some concrete data, I'd appreciate it.
← Back to Stories (view on slashdot.org)
Typical high energy physics experiments do not need the fine grained parallism supplied by systems like Beowolf. An interaction is recorded with 5Kbytes to 100Kbytes of information which is called an event. All of the data for one event is sent to one processor. The next event is taken and sent to the next processor. There is no need for direct communication between those processors, so the network topology is simple. The new experiments could use 2000 nodes profitably. The compute time for many types of jobs is large compared to the time to get data into the processors. It may take seconds to finish one event so bandwidth into one machine only needs to be 100's kbytes per second. What is needed is a
queuing system to send events to processors as they become available.
High energy theory calculations can use the fine grained parallism of Beowulf but I doubt they would try to build a cluster as big as 2000 nodes.
I admit I'm not too familiar with everything, but I was wondering where the parallelism/load balancing in the programs they run on massive clusters like this comes from?
I work with a research group at my school which develops an architecture and a programming language designed to explicitly tell the program what functions you want run in parallel and where. The language is, for lack of a better description, rather complicated relative to C, which it's based on. We have ports for SMP machines, Beowulf clusters, SP2's etc, but this is the only distributed multithreaded implementation I've been exposed to and since ours is not public, I was wondering what everybody else was using. Pthreads can't be used for a distributed system can they? Thanks for any info.
--
Don't get your panties in a bunch - there was obviously a typo in the persons email, Probably due to the '0' key sticking.
Yes, I gave a tour to several students from my alma mater in Stillwater, MN.
No, I didn't show them a 600+ node cluster of old 486's (I can't even think of one 486 on site - they are there, I just don't know about them). I did show them the 20 node Run II prototype farm, the 10 node SAM farm and the production 37 node farm. We don't do Beuwulf clusters. We do compute farms. Why? They are 2 totally different things. A beowulf cluster is specifically designed to analyze a little bit of data with a lot of message passing between processors and nodes. This is for tightly bound compute processes. A farm is a cluster specifically designed to crunch huge amounts of data with absolutely no message passing between processes, CPU's, or nodes.
Quick physics lesson on what Fermilab does:
We take protons and anti-protons and accelerate them to darn near the speed of light. Then we collide them in the Tevatron at particular points on the ring that have detectors (currently CDF and D0 (D-Zero)). Each detector has about a million individual detectors and the amount of data that is actually produced is about a million megabytes of data per second... no, we don't keep it all! The 1st level trigger throws most of the data away 'cause it just isn't interesting. The second level trigger throws some more away. The 3rd level trigger, which will be a 128 node, dual CPU Linux farm, will throw some more away, but it will pass something like 20-250 MB/sec on to the tape library. This number is dependent on the budget allocated for tapes. Roughly, we'll be saving 1.5PB/year (that's Peta Bytes, or 10^15) to tape. After a while, the physicists will want to take better look at all that data. How? Linux farms. You see, the data that is taken, say, at 2:30:43.8238 on April 23, 2001 will have nothing to do with the data that is taken at 2:30:43.8239 on April 23, 2001, so it's safe to treat that as one individual data set. By doing so, we can put it on a single processor and let that processor grind on it for a while before spitting out the answer. Well, so the question is, should we send out one single data set to a worker node in the farm, let it analyze it, then send it another, or should we dump ALOT of data off to the worker node and let it grind for a lonf time before writing the finished data back to tape. It appears that it's best just to send a big ole chunk (~15GB) of data off to a node, and let it grind for some large (~24 hours) amount of time.
The "plan" for this year is to purchase about 200 (that's hundred) dual P-II/III machines, depending on our budget. The year after that the "plan" is to purchase about 400 machines, and the year after that (when Run II starts in the Tevatron) the "plan" is to purchase another 400 machines. So, let's do some math - (200+400+400)*2 = 2000 _processors_, but only a thousand nodes.
Yes, the cabling is problem.
Hope that clears up some confusion.
Cheers,
Dan