Slashdot Mirror


CERN Collider To Trigger a Data Deluge

slashthedot sends us to High Productivity Computing Wire for a look at the effort to beef up computing and communications infrastructure at a number of US universities in preparation for the data deluge anticipated later this year from two experiments coming online at CERN. The collider will smash protons together hoping to catch a glimpse of the subatomic particles that are thought to have last been seen at the Big Bang. From the article: "The world's largest science experiment, a physics experiment designed to determine the nature of matter, will produce a mountain of data. And because the world's physicists cannot move to the mountain, an army of computer research scientists is preparing to move the mountain to the physicists... The CERN collider will begin producing data in November, and from the trillions of collisions of protons it will generate 15 petabytes of data per year... [This] would be the equivalent of all of the information in all of the university libraries in the United States seven times over. It would be the equivalent of 22 Internets, or more than 1,000 Libraries of Congress. And there is no search function."

7 of 226 comments (clear)

  1. OT: The size of the internet by chriss · · Score: 5, Informative

    Okay, the Library of Congress has been estimated to contain about 10 Terabyte, so I buy the 1000 * LoC = 15 Petabyte. But archive.org alone expanded its storage capacity to 1 Petabyte in 2005, so the CERN is not going to generate anything near "22 Internet" (whatever that might be). This estimate from 2002 calculates the size of the internet as about 530 Exabyte, 440 Exabyte of which are email, 157 Petabyte for the "surface web"

    1. Re:OT: The size of the internet by Anonymous Coward · · Score: 5, Funny

      We are from NASA, and would like to offer you a job in mission planning.

  2. All pages are identical by Laxator2 · · Score: 5, Interesting

    The main difference between the LHC data and the Internet is that all that 15 PB of data will come in a standard format, so a search is much easier to perform. In fact most of the search will consist on discarding non-interesting stuff while attempting to identify the very rare events that may show indications of new particles (Higgs for example). The Internet is a lot more diverse, the variety of information dwarfs the limited number of patterns LHC is looking for, so "no search available" for LHC data sounds more like "no search needed".

  3. Re:No Search Function by Benson+Arizona · · Score: 5, Funny

    Buy Higgs Boson now at e-bay.com

    Buy books about Bosons at Amazon.com

  4. Re:Remember by bockelboy · · Score: 5, Informative
    I do work with one of the LCG projects, so let me share some of my personal opinions with you (all this info is mostly available on the web, if you can find it. We keep no secrets.).

    I don't think CERN has HTTP/FTP servers right on a OC Internet backbone, or the server structure (think magnitudes greater than Google's) to drive the data.
    Oh yes we do. You are right though - buying network bandwidth is a lot more straightforward than building an disk / server infrastructure to handle all the data. It's difficult, but being accomplished.

    I think total - transatlantic fiber plus the European equivalent of Internet2 - bandwidth to CERN will amount to 100 Gbps - about 10 OC-192s. Universities buy into private global fiber networks, which are independent of the public internet.

    We then use gridFTP as a transport, which is basically PKI-protected FTP which transfers in N many parallel TCP streams. Then, we use a protocol called SRM to control the gridFTP transfers and (well, the CMS experiment) uses a higher-level application called PhEDEx to control worldwide data movement. Right now, PhEDEx directs about 8-10 Gbps worldwide, and we aren't "doing anything" big.

    GridFTP is a fairly effective protocol. I can get near-line speed - 2Gbps from a channel bonded RAID device. Locally, we've been buying large RAIDs - 30TB a box, building up to 200TB this fall. Some sites take a more "clustered" approach - they put a few 500-750 GB drives in each of the cluster's worker nodes, and build up to 200TB that way. Costs are lower, but you have to keep 2 copies of each file in the cluster, plus have the headache of swapping out drives. Of course, I like our method better. In addition, larger, T1 sites have a few petabytes in tape silos.

    Funding agencies don't just throw money into projects for years at a time, then wait for results. Two years ago, we did a test at 25% of the turn-on "complexity" (in terms of jobs run and data movement). Last year, we increased that to 50% complexity. Toward the end of this summer, we will have a challenge called CSA07 which should be between 75-100% complexity. Finally, turn-on should be around November this year.

    This is a multi-billion dollar project which has been under development for 10-15 years. We've been doing lots and lots of careful planning.
  5. Re:60% by lexarius · · Score: 5, Funny

    Talk like that gives me a large hadron.

  6. Re:Never underestimate the bandwidth of a 747 by fbjon · · Score: 5, Informative
    We obviously want to use maximum storage per HD weight, which is currently the Hitachi Deskstar 7K1000, we would have 1,000,000,000,000 bits per a maximum of 700 grams.

    Using the maximum payload weight of an A380F (freighter model), we get with Google calc: (152 400 kg / 700 grams) * 1Tbytes = 193.36913 petabytes, which is 12.8912753 years worth of CERN CMS data over a maximum distance of 5,600 nautical miles.

    The maximum useful load of a Cessna 172 is 371 kg, which gives a meager 0.0313823042 years worth of data over a maximum distance of 687 nm.

    The raw distance between CERN and Purdue University (not including distances to airports and such) is about 3838 nm, well within range of the A380F. The Cessna 172 falls into the ground/ocean long before that however. Since there's no air-refueling option for the Cessna, the plan calls for a fleet of at least 179 Cessna 172's constantly working in relay, just to keep up with the data production rate!

    So, to answer your question: If you want the same leisurely pace of using one A380F, you'll need a massive 2148 Cessnas flying for a full year, every 12 years (the total weight of which is equivalent to 531 A380F's, which should tell you something about the efficiency of said plan).

    --
    True confidence comes not from realising you are as good as your peers, but that your peers are as bad as you are.