Slashdot Mirror


disCERNing Data Analysis

technodummy writes: "Wired is reporting how CERN is driving the Linux-based, EU funded, DataGRID project. And no, they say, it's nothing like Seti@Home. The description on the site of the project is: ' The objective is to enable next generation scientific exploration which requires intensive computation and analysis of shared large-scale databases, from hundreds of TeraBytes to PetaBytes, across widely distributed scientific communities.'" If you're interested in this, check out the Fermi Lab work with LinuxNetworkX data as well as the all-powerful Google search on the Fermi Collider Linux project. As jamie points out, "Colliders produce *amazing* amounts of data in *amazingly* short time periods... on the order of "here's a gigabyte, you have 10 milliseconds to pull whatever's valuable out of it before the next gigabyte arrives".

30 of 82 comments (clear)

  1. Technology transfer by Matey-O · · Score: 2, Funny

    And when this stuff becomes comodity hardware, Quake can have a Real Quantum Effects Railgun(tm)!

    --
    "Draco dormiens nunquam titillandus."
  2. EU funding by reachinmark · · Score: 2, Flamebait
    It's good to see that this EU funded project isn't wasting precious money on things like website design. We can't have people getting past the first page and actually joining in on the project now can we!

    Has anyone actually seen an IT related EU project that achieved something? The company I work for has been involved in two EU project proposals so far, and nothing came of either of them -- though they both consumed a large ammount of resources from universities to get through the three failed applications each.

    1. Re:EU funding by san · · Score: 5, Informative

      The WWW, developed at CERN by Tim Berners-Lee springs to mind..

    2. Re:EU funding by pubjames · · Score: 5, Informative

      Has anyone actually seen an IT related EU project that achieved something?

      Government funded work, in the EU, US and internationally, actually drive changes in the IT industry a lot more than most people realise (or perhaps would care to admit).

      For christssakes, the web itself came out of a CERN project! Also many other web standards originated in EU funded projects, for instance JPEG and MPEG. So, the most common formats on the web for text (HTML), images (JPEG), and video (MPEG), all owe something to funding from the EU.

      And of course the Internet itself comes from US government funded projects. Even commonly used business process have resulted from government funded work (project management methodologies).

      Both Americans and Europeans like to bitch about the inefficies of their governments, but the fact of the matter is that if you look at the history of IT, more fundamental innovations come from government funded work than from industry. Of course Bill Gates, Larry Ellison etc. don't want you to think that, but that's the way it is.

    3. Re:EU funding by pubjames · · Score: 3, Interesting

      Has anyone actually seen an IT related EU project that achieved something? The company I work for has been involved in two EU project proposals so far, and nothing came of either of them -- though they both consumed a large ammount of resources from universities to get through the three failed applications each.

      Perhaps you are expecting the wrong results.

      I have been involved in a couple of large EU funded projects, and have spoken to the project managers about the aims and motives of the projects.

      One principal point is that just because a new successful product/standard/format whatever does not arise from a project, does not mean that it has been a failure.

      The EU is made up of lots of different countries with lots of different types of people speaking different languages and with different working mentalities. This is a major competitive disadvantage for us compared to a country like the US. If a company in San Francisco wants to work with a company in New York, there aren't many barriers to them doing that. In the EU, there are lots of barriers. One of the main aims of EU funded projects (and the EU in general) is to break down these barriers by getting different companies and universities working together across the EU. If new technologies come our of these projects, so much the better, but that's not necessarily the principal aim.

  3. shear quantity of data by Alien54 · · Score: 3, Interesting
    This first reaction I have is that I wonder if a distributed model of computing would even be able to make a dent if the amount of data is that big.

    Does anyone have a idea on how much data Seti at home has processed? This would certainly be useful as a yard stick of sorts.

    --
    "It is a greater offense to steal men's labor, than their clothes"
    1. Re:shear quantity of data by ajcpi · · Score: 2, Informative

      332,321,524 results (on average, 98.27 results per user)

      Chunks of data are perhaps 0.5 MB

    2. Re:shear quantity of data by ajcpi · · Score: 2, Informative

      Actually, perhaps this is more useful, (from the Seti Site)

      http://setiathome.ssl.berkeley.edu/totals.html

      Total
      Users 3383619 1872
      Results received 399604453
      Total CPU time 799230.603 years
      Floating Point
      Operations 1.142642e+21 (29.64 TeraFLOPs/sec)
      Average CPU time
      per work unit 17 hr 31 min 13.7 sec

  4. distributed computing by sam@caveman.org · · Score: 5, Interesting

    here's a gigabyte, you have 10 milliseconds to pull whatever's valuable out of it before the next gigabyte arrives.

    let's see. 1 GB in 10 ms works out to 100 GB per second. how recently did GB ethernet come about? and what would the average bandwidth of users be? i would guess much less, but let us assume 100KB per second.

    so you have 107374182400 bytes of data per second. your users can take 102400 bytes per second each. even if everyone was connected directly to your network (no delays or bottlenecks... ha!) you would still require 1048576 users (that is over 1 million).

    and this is not taking into effect sending any data BACK to the source or actual computation time on the users.

    -sam

    --
    burn the computers. go back to the abacus.
    1. Re:distributed computing by fiziko · · Score: 5, Informative

      The data figure stated above is at the actual data collection stage, not the anlysis stage, so it's not being transmitted via ethernet. The project I'm working on (ATLAS, which should be running on the LHC when it gets built in the next few years) has actually found that magnetic media cannot keep up with the data rate, so they had to figure out another means of storing the data while they were sorting it between particle bursts. They decided on a switched capacitor array, since that can keep up. The data actually goes through (IIRC) three stages of analysis before it's finally approved and recorded indefinitely. This filtered data is the stuff that will be transmitted via the Grid.

      --
      - W. Blaine Dowler
      http://www.bureau42.com
    2. Re:distributed computing by PSC · · Score: 3, Informative

      let's see. 1 GB in 10 ms works out to 100 GB per second. how recently did GB ethernet come about? and what would the average bandwidth of users be? i would guess much less, but let us assume 100KB per second.

      Well 100 GB per second is the raw data rate, as read out (heavily parallel) from the detector, i.e. the data rate the DAQ (Data AQuisition) system has to keep up with. That's pretty difficult really, but done completely in hardware: the readout chips have relatively large on-chip buffers for each read-out channel. NOST OF THIS DATA IS DISCARDED RIGHT AWAY from the so-called Level 1 Trigger, whose purpose is to throw away the most obviously uninteresting collisions.

      Since the data rate after L1 is still WAY too large to be all stored, another trigger, unimaginatively called Level 2 Trigger, sorts out even more crap. Since the data rate is lower than for L1, L2 can use more sophisticated algorithms to figure out which event is crap and which is an ever-famous Higgs decay :-)

      One more trigger, Level 3 (you guessed it), is used to even further reduce the amount of data, again with more sophisticated means.

      Still, the required bandwidth is quite impressive. At CDF II, the data rate after Level 3 will be about 75 events per second, at half a meg each, summing up to 30-40 MB per second (well enough to saturate Gbit ethernet), which are all reconstructed right away.Note that for the LHC experiments (CMS, ATLAS) the amount of data is more than an order of magnitude larger than for CDF and D0 (at Fermilab).

      The LHC data will be spread all over the world, using a multi-tier architecture with CERN being Tier 0, and national computing centers as Tier 1 centers, universities being Tier 2, etc. No national computing center will be able to store ALL data, so the idea is that e.g. your Higgs search will be conducted on the U.S. Tier 1 center, B physics on the German Tier 1 center and so on. Obviously not only US scientists will search for the Higgs, so others will also submit analysis jobs on the US Tier 1 and vice versa. To get this working, the GRID is designed. A current implementation is GLOBUS.

      Having said this, it is important to note that right now, the GRID is nowhere near this goal. To submit jobs in this "fire and forget" way is not possible yet. There is a shitload of problems to yet solve, the most important ones: trust and horsepower.

      Trust: you must allow complete strangers to utilize your multi-million dollar cluster, and they haven't even signed a term-of-use form.

      Horsepower: everybody expects to get more CPU cycles out of the GRID than he/she contributes. Obviously, this will not work. (Albeit the load levveling might improve the overall performance.)

      --
      --- The light at the end of the tunnel is probably a burning truck.
    3. Re:distributed computing by fiziko · · Score: 2

      Complete BS, huh? Have you read the papers you can find here, here, and other places not on the first page of results on a Google search. Maybe I should have mentioned that speed wasn't the only concern, but it was the prime concern. In any event, I've got evidence to back up what I said. You?

      --
      - W. Blaine Dowler
      http://www.bureau42.com
  5. Here's how you do it... by Anonymous Coward · · Score: 2, Funny

    Daisy-chain 10^400 Timex Sinclair computers. Make sure you buy the preassembled kind cause otherwise it would take too long to set up.

  6. Storage to the rescue by KarmaBlackballed · · Score: 2, Insightful

    "here's a gigabyte, you have 10 milliseconds to pull whatever's valuable out of it before the next gigabyte arrives".

    ...or just write it all as it comes in and analyze it later. That's how most other science takes place. Since when is scientific analysis "real-time?"

    In general, the scientific process does not require conclusions during an experiment. I think CERN should cite a different reason for this project, there are many valid ones.

    --

    --- -- - -
    Give me LIBERTY, or give me a check.
    1. Re:Storage to the rescue by fiziko · · Score: 3, Informative

      That data rate doesn't apply to the analysis stage. Magnetic media can't keep up with the data as it comes in, so it has to be sifted through in a first pass to eliminate the boring cases. (These would be the times two particles passed each other in the detector without colliding, and things like that.) Most of the analysis is done later on. (In fact, the analysis I'm doing today is on data collected in August 2000.)

      --
      - W. Blaine Dowler
      http://www.bureau42.com
    2. Re:Storage to the rescue by jamie · · Score: 4, Informative
      BTW, don't hold me to those exact numbers, Hemos copy'n'pasted something I just typed into IRC without reflecting on it too much :)

      The problem is that there's way too much data to write to any storage medium to analyze later. The bandwidth makes hard drives look like tiny, tiny straws. When they throw the switch and the protons or whatever start smacking into each other, they get many collisions in a row, several every millisecond, maybe dozens every millisecond (depending on collider circumference I imagine). The huge array of detectors around the collision point stream out big chunks of data for each collision. The first line of defense is a network of computers that get handed each collision, or parts of it broken down, in round-robin order or something. Their job is to sift through X megabytes very quickly to decide whether there's anything "interesting" in this collision that warrants being remembered. If no flags go up, the data just gets dropped on the floor.

      The datagrid described in the article is, as far as I can tell, set up to process data after that "first line of defense" -- even after dropping the majority of the bits on the floor, there is still a prodigious amount that has to be sifted through, just to check that the Higgs didn't leave a track or something. That's a different sort of engineering project.

      My point was just that, yes, the amount of data involved here really is amazingly large.

    3. Re:Storage to the rescue by sam@caveman.org · · Score: 4, Informative

      or just write it all as it comes in and analyze it later.

      1 GB per 10 ms comes out to 100 GB per second. after 24 hours of experimentation, you find yourself with 8.6 million gigabytes. hard drives are cheap, but not THAT cheap. and even if you had LOTS of 100 GB hard drives, you still need to find a place to PUT 86 thousand of them.

      every 24 hours.

      after 1 week's worth of data collection, you have 600 thousand 100 GB hard drives of data.

      this is why 'store now, analyze later' is not as good of an option for collision data. you have to take that 100 GB of data per second, and first filter and say, 'which of these collisions might be interesting to look at? which ones produced the particles we are trying to study?'

      -sam

      --
      burn the computers. go back to the abacus.
    4. Re:Storage to the rescue by tr1n0 · · Score: 2, Insightful

      40 million _events_ per sec, actually; multiplied by about 20 collisions per event gives almost a billion collisions per sec. Which results in a primary data rate ( before low level triggering ) of about 100 Terabyte - 1 Petabyte per sec.

  7. Grid computing? by Exmet+Paff+Daxx · · Score: 3, Insightful

    Taking a look at Gridcomputing, it's pretty clear to see that Grid Computing is actually... Distributed Computing. There's no new concept here - so why the new name? It doesn't make sense until you read the sound bite: "I believe grid computing will revolutionize the way we compute".

    Yes, if you can't invent an idea, rename it, and maybe you'll get some credit. What the hell, it's worked before.

    Oh well. More power to them. It looks like a great opportunity for the world to learn that Linux is a powerful tool.

    --
    If guns kill people, then CmdrTaco's keyboard misspells words.
    1. Re:Grid computing? by sam@caveman.org · · Score: 2

      iirc, the main difference between grid computing and distributed computing is that for grid computing, it works like a utility company, you pay for the processing power as you use it. either that, or some networking professor wanted to write a book with a new title.

      -sam

      --
      burn the computers. go back to the abacus.
    2. Re:Grid computing? by Technodummy · · Score: 2

      It doesn't make sense until you read the sound bite: "I believe grid computing will revolutionize the way we compute"

      In that article it says:

      "One example of how this is not done is SETI," said Ellis, referring to the popular screensaver program beloved by millions of home and work computer users. The program processes chunks of satellite data for the Search for Extra Terrestrial Intelligence project.

      "It's not real-time, and it's not online," said Brian Coghlan of Trinity College Dublin, an Irish participant in DataGRID. "You go to SETI and laboriously download data and then laboriously send it back."

      With DataGRID, they're talking about a network that can do real-time processing of petabytes of data -- a barely imaginable amount of information. One petabyte is a quadrillion bytes -- equal to all the information that could be held on 125,000 PC hard drives."


      SETI data can be delayed. If you don't get online for awhile, your data is held back from the grid. Doesn't that make it different?

  8. Solid State Niche by Cesaro · · Score: 2, Interesting

    This is exactly what the niche market for solid state drives is. You have gigs of data you need to get there FAST...then you can worry about picking it apart afterwards. After you have it on the solid state drive, then as long as you don't lose power and your UPS power, you can leisurely use however many computers you want to nit pick it without having to worry about missing data.

  9. It's a little-known fact... by JohnPM · · Score: 2, Interesting

    That 1 petabyte, if stored as an area of black and white 8mm square bathroom tiles with 2mm grout would cover an area of 900,720 square kilometres which is about 741 times the area of Los Angeles.

    Bring on the pixie dust!

    (source)

    --
    Karma police, I've given all I can, it's not enough, I've given all I can, but we're still on the payroll.
  10. Custumization of PCs by hrieke · · Score: 2

    Why not just use something like PC on a card (ala Transmeta or one of the others mention here at /. any number of times) to have each PC house multiple systems to compute the results.
    I'm sure that a custom system could be designed and built for the problem on the cheap side (using off the shelf products and parts) and the cost could be spread around the various coliders around the world.
    Heck, it would make for a good DARPA grant- hint hint.
    Also, thinking about the amount of data generated, I'm sure that the collectors have some sort of system to buffer all that data (ungodly amount of RAM anyone?) which is then sent down the wire to storage over multiple NICs.
    I also don't think that coliders are run 24/7 as someone else suggested / wrote.

    Henry

    --
    III.IIVIVIXIIVIVIIIVVIIIIXVIIIXIIIIIIIIVIIIIVVIIIV IIVIIIIIIVIII...
    1. Re:Custumization of PCs by krlynch · · Score: 2

      I also don't think that coliders are run 24/7 as someone else suggested / wrote.

      Actually, they are :-) They are run 24/7 for months at a time, then taken down for a few days/weeks for minor repairs, swap outs, and minor upgrades, then they go back up. And they do this for a few years on end. Then they go down for major overhauls and upgrades, and hopefully a few more runs.

  11. A few Corrections by Roger+W+Moore · · Score: 4, Informative

    Actually the Fermilab article pointed to concerns a cluster of machines used for the L3 trigger of the D0 experiment (of which I'm a member). This actually has very little to do with the GRID since it is used as the final stage of a three stage trigger process which decides when an "interesting" event has been produced by the collider. The previous stage, L2, also uses Linux/Alpha machines but is not really a cluster since these custom built boards sit in various crates of electronics and process only a fraction of the data that the L3 sees (however our time budget at L2 is 100 microseconds compared to L3's 100 milliseconds!).

    However, that said, D0 is heavily involved with the GRID project and has what is arguably one of the first production GRID applications, called SAM. This system essentially manages all of our data files around the entire globe and allows any member to run an analysis job on a selected set of data files. SAM then handles the task of getting those files to the machine where the job is running using whatever means is required (rcp or fetching it from a tape store). SAM also allows remote institutes to add data to the store which is used primarily by large farms of remote Linux boxes which run event simulations. We are also currently working on integrating SAM into our desktop Linux cluster which will allow us to use the incredibly cheap disk and CPU which is available for Linux machines. For more details you can consult the followng web pages:

    http://www-d0.fnal.gov/ - the D0 homepage
    http://d0db.fnal.gov/sam - the SAM homepage

  12. I was there! by GroovBird · · Score: 3, Funny

    I can see it already.


    *** jamie(~who@gives.a.fl.us) joined #slashdot
    <CmdrTaco> lookin' for cyber msg me
    <Hem0s> Hey jamie
    * KatzAWAY is now away [logger:on]
    <jamie> hey hemos
    <Hem0s> whazzup?
    <jamie> oh got this gr8 link here but got no access to the backend right now. can u help me out?
    <Hem0s> sure thing.. what you got?
    <CmdrTaco> jamie a/s/l?
    <jamie> i found this link about this grid computing whatsimagigger and i just thought it's cool ... u know linux and all
    <Hem0s> u uh
    <CmdrTaco> jamie a/s/l?
    <jamie> shut up taco
    <Hem0s> so what's the link?
    <timothy> boooooring
    <jamie> i found it while zapping through wired somehow my browser crashed on me again can u go find it?
    <Hem0s> sure ... hold on a sec
    <CmdrTaco> timothy a/s/l?
    *** CmdrTaco (rob@home) Quit (Connection reset by peer)
    <jamie> gotta tell you i LOVE that post you did on OpenGL a minute ago
    <Hem0s> thx ... can't find it though
    <jamie> it's there somewhere
    *** CmdrTaco (rob@home) joined #slashdot
    *** bill{Taco} sets mode: +b CmdrTaco
    <jamie> ok lemme try again
    <Hem0s> hurry jamie i already fired up mah mozilla dont know how long she stays put
    <CmdrTaco> lookin for a good time? msg me
    *** KatzAWAY left #slashdot
    <jamie> here it is ... CERN is driving the Linux-based, EU funded, DataGRID project.
    <jamie> The objective is to enable next generation scientific exploration which requires intensive computation and analysis of shared large-scale databases, from hundreds of TeraBytes to PetaBytes, across widely distributed scientific communities.'
    <Hem0s> great stuff... lemme copy'npaste here..
    <jamie> somethin bout amazing amounts of stuff in short timed periods ... like you know here's a gig fill it and you've got a split second to pull the goods outtathere
    <Hem0s> you don't mind if i edit this a bit don't you
    <jamie>gotta go bye!
    <Hem0s> you don't mind if i redo this a bit don't you?
    *** jamie left #slashdot (gotta reboot bye)
    <CmdrTaco> lookin for cyber. msg me
    <Hem0s> great ... now i gotta work this
    *** michael sets mode: +ms
    *** You were kicked by michael (spyin on us?)

  13. Virtual science by Sloppy · · Score: 3, Interesting

    This reminds me of an astronomy-related story I saw yesterday. Some projects are generating more data than the people doing the projects can handle.

    --
    As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
  14. From the ATLAS TDR... by krlynch · · Score: 3, Informative

    So I went and found the ATLAS Technical Design Report, which gives all the numbers:

    • The detector itself will experience events at the rate of 10^9 per second.
    • Now, not all of that data even makes it out of the detector; it would correspond to somewhere around 10^11 MB/s (yes, megaBYTES per second) if they tried to dump all the data out to computers, so the detector has a number of levels of "triggering", which is specialized recognition hardware distributed all over the detector that "recognizes" and integrates the information coming off of small clusters of detector elements and decides whether or not there is anything "interesting" in that information, without reference to what is going on in the rest of the detector. This data bubbles up through a small number of layers of triggers that integrate increasingly larger segments of the detector, until it actually comes out and is sent into computers to be "recognized" as actual events.
    • Those events are analyzed by banks of computers, which sift through about 100 events per second, and if they are all stored, it adds up to about 100MB/s (of course not all of those events will be stored, but that is an approximate ceiling on the storage rate).


    The final data rate is expected to be about 1PB/year (1 PB = 10^15 B = 10^7 MB). The LHC collider will probably run for about 25 years, there will be at least two experiments (and maybe up to four) running for most of that time ... you do the math on how much data will be collected and have to be analyzed :-)
  15. Re:And you forget... by vrt3 · · Score: 2, Informative

    Actually MP3 is a part of MPEG: MP3 is short for "MPEG Audio Layer 3". More info at this page of the Fraunhofer institute. And yes, I believe it was government funded (but I'm not sure).

    --
    This sig under construction. Please check back later.