CERN Collider To Trigger a Data Deluge
slashthedot sends us to High Productivity Computing Wire for a look at the effort to beef up computing and communications infrastructure at a number of US universities in preparation for the data deluge anticipated later this year from two experiments coming online at CERN. The collider will smash protons together hoping to catch a glimpse of the subatomic particles that are thought to have last been seen at the Big Bang. From the article: "The world's largest science experiment, a physics experiment designed to determine the nature of matter, will produce a mountain of data. And because the world's physicists cannot move to the mountain, an army of computer research scientists is preparing to move the mountain to the physicists... The CERN collider will begin producing data in November, and from the trillions of collisions of protons it will generate 15 petabytes of data per year... [This] would be the equivalent of all of the information in all of the university libraries in the United States seven times over. It would be the equivalent of 22 Internets, or more than 1,000 Libraries of Congress. And there is no search function."
Okay, the Library of Congress has been estimated to contain about 10 Terabyte, so I buy the 1000 * LoC = 15 Petabyte. But archive.org alone expanded its storage capacity to 1 Petabyte in 2005, so the CERN is not going to generate anything near "22 Internet" (whatever that might be). This estimate from 2002 calculates the size of the internet as about 530 Exabyte, 440 Exabyte of which are email, 157 Petabyte for the "surface web"
memomo: free web based language trainer DE-EN-ES-FR-IT
Well, there _is_ a search function, and that's what the tier-2 sites will be running. The data describes individual experiements (that is, individual collisions) and comes off LHC at a whacking rate. There's some front-end processing to throw away a lot of it before what's left gets sent to the tier-1 sites for further distribution.
The data is suitable for high-throughput (ie, batch processing) and the idea is to keep copies of the experimental data in several places during processing. Interesting results get flagged up by the batch processing for further study.
They could run a fiber across the Atlantic that could handle 4gbps.
They have been getting sustained performance (with simulated data) of more than that for several years now. This is the sort of thing that Internet2 does well, when it's not on fire.
That's a highly misleading figure (whatever figure you had in mind).
.. say .. 500 economy airline tickets (shooting from the hip here, I tried compounding business/first-class costs).. to get that through. That's a lot of cash. Then again, at 1TB/drive, it's a LOT of data.
When you add the amount of time, money, kit and effort that'd go into either burning that many optical disks or filling that many harddrives, then connecting them on the other end and reading it out makes it less attractive than fiber optics.
On the other hand, if the 747 is crammed full of ultra-high-capacity hard-drives (say, the new Hitachi 1TB) in high-density racks that do not need unloading from the aircraft (it lands, it plugs into a power/multiple-10GbE-grid, offloads the data to a local ground facility, then goes out for the next run), you get something that'd possibly be competitive with fiber, as well as a possible business model avenue.
You would, of course, need someone to be willing pay the rough equivalent of
-
For a lot of the physics, the researchers know what they are looking for. For example, with the Higgs boson, theories constrain the decay and production to certain channels that have characteristic signatures. So they would be looking for events that have a muon at a certain energy with a hadron jet with another given energy coming off x degrees away and so on. There have been monte carlo simulations and other calculations done to predict what the interesting events should look like using various different theories. Of course there maybe interesting events that pop up that no one has predicted but everyone has a fairly good idea of what the expected events should look like.
"When you sit with a nice girl for two hours, it seems like two minutes. When you sit on a hot stove for two minutes, it
Truth: There are several news agencies that have booked flights to descend upon CERN at the "supposed" start of the LHC in November. What will they come and see, lots of hype and not much!
What will happen? Single beam commissioning earliest in May. Collisions probably in August. Not earlier.
I hate being a Anon Coward, but there you go... Yes, I am sitting at a CERN office right now.
I think total - transatlantic fiber plus the European equivalent of Internet2 - bandwidth to CERN will amount to 100 Gbps - about 10 OC-192s. Universities buy into private global fiber networks, which are independent of the public internet.
We then use gridFTP as a transport, which is basically PKI-protected FTP which transfers in N many parallel TCP streams. Then, we use a protocol called SRM to control the gridFTP transfers and (well, the CMS experiment) uses a higher-level application called PhEDEx to control worldwide data movement. Right now, PhEDEx directs about 8-10 Gbps worldwide, and we aren't "doing anything" big.
GridFTP is a fairly effective protocol. I can get near-line speed - 2Gbps from a channel bonded RAID device. Locally, we've been buying large RAIDs - 30TB a box, building up to 200TB this fall. Some sites take a more "clustered" approach - they put a few 500-750 GB drives in each of the cluster's worker nodes, and build up to 200TB that way. Costs are lower, but you have to keep 2 copies of each file in the cluster, plus have the headache of swapping out drives. Of course, I like our method better. In addition, larger, T1 sites have a few petabytes in tape silos.
Funding agencies don't just throw money into projects for years at a time, then wait for results. Two years ago, we did a test at 25% of the turn-on "complexity" (in terms of jobs run and data movement). Last year, we increased that to 50% complexity. Toward the end of this summer, we will have a challenge called CSA07 which should be between 75-100% complexity. Finally, turn-on should be around November this year.
This is a multi-billion dollar project which has been under development for 10-15 years. We've been doing lots and lots of careful planning.
Another thing to point out is that, at least for ATLAS, researchers don't get their data directly from CERN. CERN has fat dedicated pipes to what are called Tier-1 data centers, which are spread around the world. I think these centers build the raw data into structured events. Then there are smaller Tier-2 data centers (I worked for one of the Universities hosting a Tier-2 center) which get these structured events and that is where Joe physicist gets his data from. Also, these data centers have processing power on site to run programs submitted by physicists, so most of this data will never touch the everyday internet.
For some reason, ATLAS and CMS don't use the same techniques and technologies for just about anything from detector design down to the style of pen carried in their pocket protectors. So anything said for ATLAS does not necessarily hold true for CMS (the other big detector on the LHC).
They're not going to run the particle accelerator for a day and then spend half a year transferring all the data generated, the lifetime of a particle accelerator is longer than 173 days.
// MD_Update(&m,buf,j);
Using the maximum payload weight of an A380F (freighter model), we get with Google calc: (152 400 kg / 700 grams) * 1Tbytes = 193.36913 petabytes, which is 12.8912753 years worth of CERN CMS data over a maximum distance of 5,600 nautical miles.
The maximum useful load of a Cessna 172 is 371 kg, which gives a meager 0.0313823042 years worth of data over a maximum distance of 687 nm.
The raw distance between CERN and Purdue University (not including distances to airports and such) is about 3838 nm, well within range of the A380F. The Cessna 172 falls into the ground/ocean long before that however. Since there's no air-refueling option for the Cessna, the plan calls for a fleet of at least 179 Cessna 172's constantly working in relay, just to keep up with the data production rate!
So, to answer your question: If you want the same leisurely pace of using one A380F, you'll need a massive 2148 Cessnas flying for a full year, every 12 years (the total weight of which is equivalent to 531 A380F's, which should tell you something about the efficiency of said plan).
True confidence comes not from realising you are as good as your peers, but that your peers are as bad as you are.
More like 6 or more extra zeros, actually. There seems to be a lot of confusion about this, so let me try to explain.
Generally the data coming out of these experiments is filtered in two or more stages. It has to run in real time since the data volume is enormous. A detector like this can easily spew out several TB a second of raw data. The first layer of filtering will look at very small portions of the data and make very loose requirements on it, but can run very fast in dedicated electronics. This might discard 99.99% of the events and keep 90% of the interesting stuff, for instance. Now you have a much smaller volume of data, so you can afford to spend more time on it. So maybe you run a pared down version of the full reconstruction software. This is much more sophisticated software, so maybe you can get rid of 99% of what remains and only toss out 10% of the interesting interactions. This stage might be done on a cluster of 1000 computers or more. At the end, you've kept one out a million events and only thrown away 20% of what might be useful. But you need both steps. Skip the first step and you need a network with 10,000 times as much bandwidth and a computer cluster of 10 million computers. Skip the second step and instead of 100 PB of storage, you need 10,000. And you need to deal with all that data in the next step.
The initial filtering is not the end of the story. The one event in a million that passes will be reconstructed with the full, best software available along with the other billion events that pass. Then those will be filtered again based on different types of physics signatures and sent to the researchers looking at that one particular type of interaction. This process also requires thousands of CPUs. The big LHC experiments will have 40 million interactions/second and each interaction might contain 25 collisions. The vast majority of these are understood (not interesting) but the challenge is to sort through those 1 billion interactions a second in a finite amount of time to find the interesting ones. The two stages I've described are called "triggering" and "offline event reconstruction and filtering" if you want to try to find out more.
There go the mod points I assigned earlier in this discussion.
wow, i almost spit coffee all over my laptop when i read that. careful, yo.