CERN Collider To Trigger a Data Deluge
slashthedot sends us to High Productivity Computing Wire for a look at the effort to beef up computing and communications infrastructure at a number of US universities in preparation for the data deluge anticipated later this year from two experiments coming online at CERN. The collider will smash protons together hoping to catch a glimpse of the subatomic particles that are thought to have last been seen at the Big Bang. From the article: "The world's largest science experiment, a physics experiment designed to determine the nature of matter, will produce a mountain of data. And because the world's physicists cannot move to the mountain, an army of computer research scientists is preparing to move the mountain to the physicists... The CERN collider will begin producing data in November, and from the trillions of collisions of protons it will generate 15 petabytes of data per year... [This] would be the equivalent of all of the information in all of the university libraries in the United States seven times over. It would be the equivalent of 22 Internets, or more than 1,000 Libraries of Congress. And there is no search function."
Okay, the Library of Congress has been estimated to contain about 10 Terabyte, so I buy the 1000 * LoC = 15 Petabyte. But archive.org alone expanded its storage capacity to 1 Petabyte in 2005, so the CERN is not going to generate anything near "22 Internet" (whatever that might be). This estimate from 2002 calculates the size of the internet as about 530 Exabyte, 440 Exabyte of which are email, 157 Petabyte for the "surface web"
memomo: free web based language trainer DE-EN-ES-FR-IT
I hope they're planning on running their own fiber optic line across the Atlantic, or shipping a lot of hard drives, cause thats too much data to pass over the public internet.
FYI 15 petabytes per year = 120 petabits per year = 120,000,000 gigabits per year
120,000,000 gigabits per year / ~30,000,000 seconds per year = 4gbps of continuous transmission. They could run a fiber across the Atlantic that could handle 4gbps.
Google it?
If Google is so awesome, maybe they can put their money where there mouth is and do something commendable. Of course, they'll probably have a hard time turning this data into marketing material.
The CERN collider will begin producing data in November, and from the trillions of collisions of protons it will generate 15 petabytes of data per year... [This] would be the equivalent of all of the information in all of the university libraries in the United States seven times over. It would be the equivalent of 22 Internets, or more than 1,000 Libraries of Congress. And there is no search function.
And 60% of it will be porn.
-
- - You can't take something off the Internet! That's like trying to take pee out of a swimming pool.
What about the backups?
The real fundamental question is not about beginning of the universe, but something much much more important: Are they going to backup the data?
On the other hand, I'm sure it will be available on some torrent soon.
You know with the right sort of particle accelerator you could send messages straight through the Earth and save a heap of latency.
http://michaelsmith.id.au
Lepton dancers wearing gluons.... WHOA!
So long as it's not needed right now pretty much any amount of data can be transmitted.
Would that be 0.84 Internet per forthnight? Or 1 kiloLibrary per Congress session? How much in tubes?
Well, yeah, but the probability is about the same as that of you generating a small black hole by clapping your hands together really hard.
qntm.org
The main difference between the LHC data and the Internet is that all that 15 PB of data will come in a standard format, so a search is much easier to perform. In fact most of the search will consist on discarding non-interesting stuff while attempting to identify the very rare events that may show indications of new particles (Higgs for example). The Internet is a lot more diverse, the variety of information dwarfs the limited number of patterns LHC is looking for, so "no search available" for LHC data sounds more like "no search needed".
Physics locker room.
No folly is more costly than the folly of intolerant idealism. - Winston Churchill
This is really bad news. By defining the amount of data in LoC's, they leave themselves open to a huge exploit... If the LoC ever includes this data, then there will be a recursive loop of definitions and the LoC will expand to fill the universe.
Okay... maybe not, but if they ever did put this data in the LoC, the effort required to re-factor all the LoC based measurements would bankrupt the world. And the confusion that goes on while this re-factoring is happening will surely crash at least one probe into Mars, where the English have used the new LoC units and the Americans will have used the old LoC units.
FTA:"catch a glimpse of the subatomic particles that are thought to have last been seen at the Big Bang."
Who was at the Big Bang to see them then? I suspect that the numbers are a lot lower than the number of people that heard that tree fall in the woods and heard the sound of one hand clapping put together.
[The Universe] has gone offline.
That line is some of the worst hyperbole ever. Here's why. First, there was (almost by definition) no one there to 'see' anything at the Big Bang. (Supernatural explanations aside, and this purports to be a science article.) Second, these subatomic particles are formed frequently in nature, as high-energy astronomy has found various natural particle accelerators that are FAR more powerful than anything we're likely to build on Earth.
One hopes the author will do better next time.
Galileo: "The Earth revolves around the Sun!"
Score: -1 100% Flamebait
I'm willing to bet that they're all over it. And have even considered the possibility of a lot more than your 'average' figures given that a significant event may increase this data deluge. There is a lot at stake with this experiment (series of). A lot of future funding is dependent on how well this project has been managed, down to the smallest (pun originally not intended) detail.
.
Truth: There are several news agencies that have booked flights to descend upon CERN at the "supposed" start of the LHC in November. What will they come and see, lots of hype and not much!
What will happen? Single beam commissioning earliest in May. Collisions probably in August. Not earlier.
I hate being a Anon Coward, but there you go... Yes, I am sitting at a CERN office right now.
I think total - transatlantic fiber plus the European equivalent of Internet2 - bandwidth to CERN will amount to 100 Gbps - about 10 OC-192s. Universities buy into private global fiber networks, which are independent of the public internet.
We then use gridFTP as a transport, which is basically PKI-protected FTP which transfers in N many parallel TCP streams. Then, we use a protocol called SRM to control the gridFTP transfers and (well, the CMS experiment) uses a higher-level application called PhEDEx to control worldwide data movement. Right now, PhEDEx directs about 8-10 Gbps worldwide, and we aren't "doing anything" big.
GridFTP is a fairly effective protocol. I can get near-line speed - 2Gbps from a channel bonded RAID device. Locally, we've been buying large RAIDs - 30TB a box, building up to 200TB this fall. Some sites take a more "clustered" approach - they put a few 500-750 GB drives in each of the cluster's worker nodes, and build up to 200TB that way. Costs are lower, but you have to keep 2 copies of each file in the cluster, plus have the headache of swapping out drives. Of course, I like our method better. In addition, larger, T1 sites have a few petabytes in tape silos.
Funding agencies don't just throw money into projects for years at a time, then wait for results. Two years ago, we did a test at 25% of the turn-on "complexity" (in terms of jobs run and data movement). Last year, we increased that to 50% complexity. Toward the end of this summer, we will have a challenge called CSA07 which should be between 75-100% complexity. Finally, turn-on should be around November this year.
This is a multi-billion dollar project which has been under development for 10-15 years. We've been doing lots and lots of careful planning.
The NSA will have to scan the data for potential terrorist Tachyons hiding among the Bosons. That will slow things down a bit.
There are some other benefits to building such a huge network of high powered computers. And it's not the teleportation you thought, it's more copying of metadata and re-creating the original.
Think about it, the only thing stopping us is the ability to store and transfer large amounts of data necessary to describe the precise makeup of a human being. I have a feeling this project will branch off into that area.
Sounds like the article was written by Senator Stevens. Nothing to fear, 22 emails can't possibly clog our tubes.
"A deadlock has been reached. One task must die. We must now choose between murder and suicide."
My quantum computer has been working on downloading the torrent for the past few weeks.
Hmmm witty sig or funny sig? Maybe elitest techy sig!
Another thing to point out is that, at least for ATLAS, researchers don't get their data directly from CERN. CERN has fat dedicated pipes to what are called Tier-1 data centers, which are spread around the world. I think these centers build the raw data into structured events. Then there are smaller Tier-2 data centers (I worked for one of the Universities hosting a Tier-2 center) which get these structured events and that is where Joe physicist gets his data from. Also, these data centers have processing power on site to run programs submitted by physicists, so most of this data will never touch the everyday internet.
For some reason, ATLAS and CMS don't use the same techniques and technologies for just about anything from detector design down to the style of pen carried in their pocket protectors. So anything said for ATLAS does not necessarily hold true for CMS (the other big detector on the LHC).
That's a lot of tubes.
oh marmalade.