Slashdot Mirror


Fermi's 2000 Node Beowulf Cluster

Our eagle eye Chris DiBona is stuck in Chicago this week at Comdex, but he did yell to us that Fermi Labs has announced a 2000 Node Beowulf cluster. If anyone can send me some concrete data, I'd appreciate it.

15 of 88 comments (clear)

  1. Hmmm by Russ+Steffen · · Score: 2

    Anyone know what they are using for the backplane fabric? I woundn't imagine switched 100Mb ethernet would work too well for that many nodes. I wonder if they are using Myrinet or gigabit ethernet.

  2. Yes, they have 2000 nodes by gavinhall · · Score: 2

    Posted by !ErrorBookmarkNotDefined:

    The trouble is there's this bug, see, and the cluster thinks there's only 1900 nodes for some reason. Any ideas?

    -----------------------------
    Computers are useless. They can only give answers.

  3. What's the network infrastructure? by Jerky+McNaughty · · Score: 2

    I'd be most interested in knowing how they'll be hooking those 2000 nodes together. That could get really expensive. Sounds like a cabling nightmare! :-)

  4. time for some infrastructure integration by xeno · · Score: 2

    With that volume of hardware & wattage, it probably makes sense to invest in some good heat exchange systems in the facility and toss the traditional HVAC.

    Why not locate the facilities in more extreme northern/southern/high altitude climes, and put all that heat to good use? The model waste treatment plant in Seattle uses its own methane to power the waste treatment systems, and even sells electricity back to the city. Apply the same idea to a computing facility -- If you know the average heat production of a system (over a large enough population the variances between systems wouldn't be that great even as you perpetually upgraded individual systems), it wouldn't be that hard to design it into a facility HVAC plan and make very efficient use it.

    --
    I think not...(*poof*)
  5. x86 scales up poorly because of power requirements by maynard · · Score: 2

    Wow. That must consume something like 250KW of power for CPU alone, never mind air conditioning power requirements. We have a ~300 machine Linux/Solaris/Irix cluster in my department and I can tell you the AC and UPS support costs are pretty serious.

    PPC might be a good challenger to x86 for this market, ARM is an even better contender from a power consumption issue, but those new PlayStation 2's will supposedly support firewire and very fast single and double precision floating point. Sony has said that they will support a Linux software development kifor the PS2 (though probably a cross compiler on x86 with a cartridge2PC dongle)... but if Linux gets ported to the PS2 hardware directly (and Sony provides a memory upgrade kit), I could see these in scientific and image rendering machine rooms all over the world. If not the PS2, then something like the ARM based Netwinder which consumes an astonishing 14 Watts of power per unit... unfortunately the Netwinder has poor very floating performance.

    What we need is a new machine which is small, consumes low power, and does floating point like mad. If the PS2 draws less than 20Watts per unit, it would really fit that bill!

  6. x86 scales up poorly because of power requirements by Jeffrey+Baker · · Score: 2

    You would be right except for one thing. Next to a superconducting supercooled particle accelerator, a few KW doesn't seem like a very big deal anymore.

  7. Hmmm (summary of Fermi's computing architecture) by mcelrath · · Score: 2
    The use of these "farms" is a little unusual compared to other "supercomputers". The ratio of computation to communication is large. Basically each computer is given one proton-proton collision and all its associated data to play with. It crunches this for a few seconds, and sends the results back for storage on tape. Nodes do not communicate with each other, so a massive network structure is not needed. Still, I know it's not going to be straight ethernet. ;)

    BTW, I saw another poster mention 1 Tb of data a second. Realize that there are 3 levels of "triggers" designed to isolate interesting events (1 event = 1 proton-antiproton collision). All 1 Tb doesn't reach the computing cluster (The 2000 nodes are probably the Level 3 trigger system and/or batch reconstruction). From my brief perusal of Fermilab's Farms page it seems that they will use several I/O PC's connected via fast or gigabit ethernet to a gigabit ethernet switch, which will be connected to the farm. The switch will also be connected to the cental mass storage system.

    -- Bob

    --
    1^2=1; (-1)^2=1; 1^2=(-1)^2; 1=-1; 1=0.
  8. 2000 seems a bit high by Spaceman7 · · Score: 2

    From my tour of Fermi on Friday, I recall seeing a 600+ cpu custom cluster, which is a low 486 class machine, and a ~37 node dual PII 450 cluster, as well as some SGI stuff. I think I was told they would be purchasing a few hundred dual PII's, meant for clustering, each year for the next few years. I don't remember the specific numbers, but I know for a fact they are intel machines - because they are the biggest bang for the buck.
    Still, it's a lot. They run a custom version of PVM, that processes part of the 1 terabyte of data that is collected every second.

  9. Linux clusters in particle physics... by rdm · · Score: 2

    We at CERN do basically the same as the guys in
    Fermi Lab. To find out a bit more on how clusters
    are used in particle physics see http://hp-linux.cern.ch/.

  10. Probably not Beowolf but 2000 node farm likely by bquark · · Score: 3

    Typical high energy physics experiments do not need the fine grained parallism supplied by systems like Beowolf. An interaction is recorded with 5Kbytes to 100Kbytes of information which is called an event. All of the data for one event is sent to one processor. The next event is taken and sent to the next processor. There is no need for direct communication between those processors, so the network topology is simple. The new experiments could use 2000 nodes profitably. The compute time for many types of jobs is large compared to the time to get data into the processors. It may take seconds to finish one event so bandwidth into one machine only needs to be 100's kbytes per second. What is needed is a
    queuing system to send events to processors as they become available.

    High energy theory calculations can use the fine grained parallism of Beowulf but I doubt they would try to build a cluster as big as 2000 nodes.

  11. Seems like a good time for this question. by Fizgig · · Score: 2

    This reminds me of a question that I've had but haven't gotten answered. If I have a 300W power supply, does that mean the coputer will be drawing 300W constantly or will it just draw what it needs? I ask because, when building a server or some other type of computer that would be up 24-7 but not necessarily saturated, it would be a pretty serious concern if you could maybe get by with 250W, saving about 50MW/yr (I've got to be doing that math wrong, but still a lot). So, who knows?

  12. I work at Fermi by rw2 · · Score: 2

    I work on the data access systems at Fermi and here is some skinny.

    1) All of the work I have been seeing is going to be done on farms. As stated elsewhere on slashdot the problems being solved here don't really get much help from many CPUs working on the same data set concurrently. The problems are shipped out essentially an event at a time and the analysis (e.g. event reconstruction) is sent back an event at a time. These reconstructed data sets are then used for lots of analysis processes but the worst part of the workload is the reconstruction, which is where the farms are used.

    2) There will be lots of machines in the farm, but I'm not sure it will be 2000 machines, that seems pretty high. In all likelyhood these boxes will be Linux based dual processor machines.

    3) For cooperative computing there are lots of multi-processor SGI boxes (O2K's).

    On the 'interesting bit of trivia' side, the data volumes we are talking about are on order of petabytes. Pretty interesting challenges! It's kind of fun when your thumbnail information approaches the limitations of most relational databases.

  13. Program Parallelism? PVM and MPI by tap · · Score: 2

    Most programs written for Beowulf clusters use PVM or MPI. PVM is older, most new codes use MPI. At least, all the ones I've written use MPI.

  14. Program Parallelism? by Kastagir · · Score: 3

    I admit I'm not too familiar with everything, but I was wondering where the parallelism/load balancing in the programs they run on massive clusters like this comes from?

    I work with a research group at my school which develops an architecture and a programming language designed to explicitly tell the program what functions you want run in parallel and where. The language is, for lack of a better description, rather complicated relative to C, which it's based on. We have ports for SMP machines, Beowulf clusters, SP2's etc, but this is the only distributed multithreaded implementation I've been exposed to and since ours is not public, I was wondering what everybody else was using. Pthreads can't be used for a distributed system can they? Thanks for any info.

    --

  15. Mysterious Fermi 2000 node cluster..... by Dan+Yocum · · Score: 3
    Hey Everyone,


    Don't get your panties in a bunch - there was obviously a typo in the persons email, Probably due to the '0' key sticking.


    Yes, I gave a tour to several students from my alma mater in Stillwater, MN.


    No, I didn't show them a 600+ node cluster of old 486's (I can't even think of one 486 on site - they are there, I just don't know about them). I did show them the 20 node Run II prototype farm, the 10 node SAM farm and the production 37 node farm. We don't do Beuwulf clusters. We do compute farms. Why? They are 2 totally different things. A beowulf cluster is specifically designed to analyze a little bit of data with a lot of message passing between processors and nodes. This is for tightly bound compute processes. A farm is a cluster specifically designed to crunch huge amounts of data with absolutely no message passing between processes, CPU's, or nodes.


    Quick physics lesson on what Fermilab does:


    We take protons and anti-protons and accelerate them to darn near the speed of light. Then we collide them in the Tevatron at particular points on the ring that have detectors (currently CDF and D0 (D-Zero)). Each detector has about a million individual detectors and the amount of data that is actually produced is about a million megabytes of data per second... no, we don't keep it all! The 1st level trigger throws most of the data away 'cause it just isn't interesting. The second level trigger throws some more away. The 3rd level trigger, which will be a 128 node, dual CPU Linux farm, will throw some more away, but it will pass something like 20-250 MB/sec on to the tape library. This number is dependent on the budget allocated for tapes. Roughly, we'll be saving 1.5PB/year (that's Peta Bytes, or 10^15) to tape. After a while, the physicists will want to take better look at all that data. How? Linux farms. You see, the data that is taken, say, at 2:30:43.8238 on April 23, 2001 will have nothing to do with the data that is taken at 2:30:43.8239 on April 23, 2001, so it's safe to treat that as one individual data set. By doing so, we can put it on a single processor and let that processor grind on it for a while before spitting out the answer. Well, so the question is, should we send out one single data set to a worker node in the farm, let it analyze it, then send it another, or should we dump ALOT of data off to the worker node and let it grind for a lonf time before writing the finished data back to tape. It appears that it's best just to send a big ole chunk (~15GB) of data off to a node, and let it grind for some large (~24 hours) amount of time.


    The "plan" for this year is to purchase about 200 (that's hundred) dual P-II/III machines, depending on our budget. The year after that the "plan" is to purchase about 400 machines, and the year after that (when Run II starts in the Tevatron) the "plan" is to purchase another 400 machines. So, let's do some math - (200+400+400)*2 = 2000 _processors_, but only a thousand nodes.


    Yes, the cabling is problem. ;)


    Hope that clears up some confusion.


    Cheers,


    Dan