disCERNing Data Analysis
technodummy writes: "Wired is reporting how CERN is driving the Linux-based, EU funded, DataGRID project. And no, they say, it's nothing like Seti@Home. The description on the site of the project is: '
The objective is to enable next generation scientific exploration which requires intensive computation and analysis of shared large-scale databases, from hundreds of TeraBytes to PetaBytes, across widely distributed scientific communities.'" If you're interested in this, check out the Fermi Lab work with LinuxNetworkX data as well as the all-powerful Google search on the Fermi Collider Linux project. As jamie points out, "Colliders produce *amazing* amounts of data in *amazingly* short time periods... on the order of "here's a gigabyte, you have 10 milliseconds to pull whatever's valuable out of it before the next gigabyte arrives".
And when this stuff becomes comodity hardware, Quake can have a Real Quantum Effects Railgun(tm)!
"Draco dormiens nunquam titillandus."
Has anyone actually seen an IT related EU project that achieved something? The company I work for has been involved in two EU project proposals so far, and nothing came of either of them -- though they both consumed a large ammount of resources from universities to get through the three failed applications each.
Does anyone have a idea on how much data Seti at home has processed? This would certainly be useful as a yard stick of sorts.
"It is a greater offense to steal men's labor, than their clothes"
here's a gigabyte, you have 10 milliseconds to pull whatever's valuable out of it before the next gigabyte arrives.
let's see. 1 GB in 10 ms works out to 100 GB per second. how recently did GB ethernet come about? and what would the average bandwidth of users be? i would guess much less, but let us assume 100KB per second.
so you have 107374182400 bytes of data per second. your users can take 102400 bytes per second each. even if everyone was connected directly to your network (no delays or bottlenecks... ha!) you would still require 1048576 users (that is over 1 million).
and this is not taking into effect sending any data BACK to the source or actual computation time on the users.
-sam
burn the computers. go back to the abacus.
Daisy-chain 10^400 Timex Sinclair computers. Make sure you buy the preassembled kind cause otherwise it would take too long to set up.
"here's a gigabyte, you have 10 milliseconds to pull whatever's valuable out of it before the next gigabyte arrives".
...or just write it all as it comes in and analyze it later. That's how most other science takes place. Since when is scientific analysis "real-time?"
In general, the scientific process does not require conclusions during an experiment. I think CERN should cite a different reason for this project, there are many valid ones.
--- -- - -
Give me LIBERTY, or give me a check.
I think they are assuming they'll be able to actually get all this raw data out to people around the world. That's going to be a problem for people on dial-up.(still the majority in the US, what about europe?) Plus the fact that it's going to cost a hell of a lot of money to keep thier end of the data pipe from bursting. Even if they only have a couple hundered megabytes per second that's quite a bit to maintain.
I know broadband is getting more accepted, but I don't think real-time is going to work on this kind of scale. SETI is successful because anyone can run it (evenif it is slow) and there's competition to get the most work units done. Without something to keep people interested, no one is going to run anything from CERN. Without the ability for a broad range of people to run a client or something, there's not going to be enough people anyway.
Harddrive space is cheap (compared to a super-colider) why can't they store all these petabytes of data? When the project gets more successful, they'll be able to actually analyse all the extra data they've got. I mean if you're going to spend that much money on a colider, you might as well get as much info as you can from it.
good luck,
sopwath
Taking a look at Gridcomputing, it's pretty clear to see that Grid Computing is actually... Distributed Computing. There's no new concept here - so why the new name? It doesn't make sense until you read the sound bite: "I believe grid computing will revolutionize the way we compute".
Yes, if you can't invent an idea, rename it, and maybe you'll get some credit. What the hell, it's worked before.
Oh well. More power to them. It looks like a great opportunity for the world to learn that Linux is a powerful tool.
If guns kill people, then CmdrTaco's keyboard misspells words.
This is exactly what the niche market for solid state drives is. You have gigs of data you need to get there FAST...then you can worry about picking it apart afterwards. After you have it on the solid state drive, then as long as you don't lose power and your UPS power, you can leisurely use however many computers you want to nit pick it without having to worry about missing data.
That 1 petabyte, if stored as an area of black and white 8mm square bathroom tiles with 2mm grout would cover an area of 900,720 square kilometres which is about 741 times the area of Los Angeles.
Bring on the pixie dust!
(source)
Karma police, I've given all I can, it's not enough, I've given all I can, but we're still on the payroll.
MP3...created by the Fraunhofer Institut, and yes, that is Germany. Thus european, but I don't know if it was Government funded.
Why not just use something like PC on a card (ala Transmeta or one of the others mention here at /. any number of times) to have each PC house multiple systems to compute the results.
I'm sure that a custom system could be designed and built for the problem on the cheap side (using off the shelf products and parts) and the cost could be spread around the various coliders around the world.
Heck, it would make for a good DARPA grant- hint hint.
Also, thinking about the amount of data generated, I'm sure that the collectors have some sort of system to buffer all that data (ungodly amount of RAM anyone?) which is then sent down the wire to storage over multiple NICs.
I also don't think that coliders are run 24/7 as someone else suggested / wrote.
Henry
III.IIVIVIXIIVIVIIIVVIIIIXVIIIXIIIIIIIIVIIIIVVIII
There is a great deal of activity here in the US w.r.t. the transfer of large amounts of data via advanced networks. Internet2 is working with the International Physics community from the US side. The HENP Networking Working Group (High Energy and Nuclear Physics). Additionally, there is work with with the National Earthquake Engineering Simulation Grid. NEES is going to be collecting similar amounts of information from earthquake simulation experiments.
Some of the most interesting work is being done by those involved with the End to End Performance Initiative. These folks are trying to figure out what it takes to support the data transfer rates that will soon be necessary.
It continues to amaze me that it is now possible to use a network to transfer data to a disk/array faster than the disk/array can process it. I believe that many have pointed out that hardware (in terms of Moore's law and data acquisition/processing) has is not keeping up with the rate of data creation. But that is prob a bit obvious to most of us.
Actually the Fermilab article pointed to concerns a cluster of machines used for the L3 trigger of the D0 experiment (of which I'm a member). This actually has very little to do with the GRID since it is used as the final stage of a three stage trigger process which decides when an "interesting" event has been produced by the collider. The previous stage, L2, also uses Linux/Alpha machines but is not really a cluster since these custom built boards sit in various crates of electronics and process only a fraction of the data that the L3 sees (however our time budget at L2 is 100 microseconds compared to L3's 100 milliseconds!).
However, that said, D0 is heavily involved with the GRID project and has what is arguably one of the first production GRID applications, called SAM. This system essentially manages all of our data files around the entire globe and allows any member to run an analysis job on a selected set of data files. SAM then handles the task of getting those files to the machine where the job is running using whatever means is required (rcp or fetching it from a tape store). SAM also allows remote institutes to add data to the store which is used primarily by large farms of remote Linux boxes which run event simulations. We are also currently working on integrating SAM into our desktop Linux cluster which will allow us to use the incredibly cheap disk and CPU which is available for Linux machines. For more details you can consult the followng web pages:
http://www-d0.fnal.gov/ - the D0 homepage
http://d0db.fnal.gov/sam - the SAM homepage
I can see it already.
*** jamie(~who@gives.a.fl.us) joined #slashdot ... u know linux and all ... hold on a sec ... can't find it though ... CERN is driving the Linux-based, EU funded, DataGRID project. ... like you know here's a gig fill it and you've got a split second to pull the goods outtathere ... now i gotta work this
<CmdrTaco> lookin' for cyber msg me
<Hem0s> Hey jamie
* KatzAWAY is now away [logger:on]
<jamie> hey hemos
<Hem0s> whazzup?
<jamie> oh got this gr8 link here but got no access to the backend right now. can u help me out?
<Hem0s> sure thing.. what you got?
<CmdrTaco> jamie a/s/l?
<jamie> i found this link about this grid computing whatsimagigger and i just thought it's cool
<Hem0s> u uh
<CmdrTaco> jamie a/s/l?
<jamie> shut up taco
<Hem0s> so what's the link?
<timothy> boooooring
<jamie> i found it while zapping through wired somehow my browser crashed on me again can u go find it?
<Hem0s> sure
<CmdrTaco> timothy a/s/l?
*** CmdrTaco (rob@home) Quit (Connection reset by peer)
<jamie> gotta tell you i LOVE that post you did on OpenGL a minute ago
<Hem0s> thx
<jamie> it's there somewhere
*** CmdrTaco (rob@home) joined #slashdot
*** bill{Taco} sets mode: +b CmdrTaco
<jamie> ok lemme try again
<Hem0s> hurry jamie i already fired up mah mozilla dont know how long she stays put
<CmdrTaco> lookin for a good time? msg me
*** KatzAWAY left #slashdot
<jamie> here it is
<jamie> The objective is to enable next generation scientific exploration which requires intensive computation and analysis of shared large-scale databases, from hundreds of TeraBytes to PetaBytes, across widely distributed scientific communities.'
<Hem0s> great stuff... lemme copy'npaste here..
<jamie> somethin bout amazing amounts of stuff in short timed periods
<Hem0s> you don't mind if i edit this a bit don't you
<jamie>gotta go bye!
<Hem0s> you don't mind if i redo this a bit don't you?
*** jamie left #slashdot (gotta reboot bye)
<CmdrTaco> lookin for cyber. msg me
<Hem0s> great
*** michael sets mode: +ms
*** You were kicked by michael (spyin on us?)
Well, i didn't accept any cookies from the datagrid page and the result was amusing - I've never seen such an artistic error page. Try for yourself.
Tested under Netscape 6.2 only...
e-mail: karol at tls-technologies.com
www: http://www.tls-technologies.com
sig: not found
This reminds me of an astronomy-related story I saw yesterday. Some projects are generating more data than the people doing the projects can handle.
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
So I went and found the ATLAS Technical Design Report, which gives all the numbers:
The final data rate is expected to be about 1PB/year (1 PB = 10^15 B = 10^7 MB). The LHC collider will probably run for about 25 years, there will be at least two experiments (and maybe up to four) running for most of that time
You can read about it at: www.griphyn.org
Buried on the web site is the original proposal they made, and it gives you some idea of the amount of data we're working with.
Some approximate statistics from the paper:
SDSS gets data at 8MB/s, 10TB/year.
LIGO will get data at 10MB/s, 250TB/year.
CMS will get data at 100MB/s, 5 Petabytes per year.
Work has already been done with simulated data for CMS, and a demo of virtual data (may be pre-calculated, or calculated on demand) for CMS was shown at the Supercomputing 2001 conference last week. They used Condor clusters from a few different sites. I'm not sure which sites made it into the final demo, but it may have included U. Florida, Argonne, and U. Wisconin.
Its just a "little" bit of data to store. The particle accelerator at Fermilab called the Tevatron can cause 2.5 million particle collisions per second. The new CERN PA will be able to produce 100 times more collisions per second, its due to come online around 2006.
Fun facts of the Fermilab PA:
700 scienteist and engineers work there.
1000 giant superconducting magnets.
$10 million in annual elictricity bills.
15 miles of pipes to carry the liquid helium to the magnets.
An optimist believes we live in the best world possible; a pessimist fears this is true.
You can have a project that gets tons of hits yet no one seems to actually want to maintain. Case in point, what has to be one of the most pointless pieces of eye candy ever created, CMatrix. Gets lots of downloads, mind you, but I posted (one year to the day as a matter of fact, what kind of coincidence is that) about needing a new maintainer. There have been about three volunteers since then, none of the applicants really mentioned having any experience in programming in curses (the toolkit used), let alone managing a project written in it.
Basically what it comes down to is most people (even GNU/Linux users) want to download and run the program, MAYBE poke at the code a little. But take over actual maintainership (even if it's next no no actual work), fugedabouit!
v2sw7CUPhw5ln6pr5Pck4ma7u7LFw0m6g/l7Di5e6t5Ab6TH.
I did the same kind of calculation here:
5 31 974
... not bad.
http://slashdot.org/comments.pl?sid=23464&cid=2
900,720 km^2
The United States of America is 9,372,143 km^2
Alaska is 1,518,800 km^2
Texas is 692,405 km^2
Arizona is 295,024 km^2
The Atlantic Ocean is 82,362,000 km^2
Europe is 10,360,000 km^2
Denmark (my home country) is a measly 43,069 km^2
Great Britain is 244,044 km^2
Germany is 356,733 km^2
France is 547,026 km^2
The Pacific Ocean is 181,300,000 km^2
Australia is 7,686,810 km^2
Greenland (the largest island in the world) is 2,175,600 km^2
We do not live in the 21st century. We live in the 20 second century.
I miss it too. I've got a really fucking huge diary entry though.
501 Not Implemented
In August Berkeley posted on their website that they had reached the ZettaFLOP (10 to 21st power floating-point operations) mark - a world record!
A SETI workunit is 330-360K of data and needs 3.5 to 4 Billion FLOPs. I average 8 hours per WU in the background on my PIII@733.
Is SETI a complete waste of time ? Maybe, but if we never look we will never know if anything is there. My own inspiration comes from the "Vimmin" flying cars that the Vedas described 30,000 years ago.
DISCLAIMER: I work for one of the centres involved in the DataGrid project.
One of the things DataGrid is designed to do is to give researchers easy access to the data they need.
It's kind of like a distributed data store with a tree like structure. The collider feeds data to national centres, they feed data to regional centres, regional centres feed data to local research groups, the researchers analyse the data.
What's more interesting, is what happens when these researchers start to exchange their results... terabytes of data flying around in all directions, not just downstream.
As for Grid Computing, yes - most of the technology isn't new, but then again neither was the World Wide Web. The Web was successful because it took existing good ideas, added a killer application (Mosaic) and proved to be useful to other fields than the one it was developed for.
The problem is that "grid" computing is being used to describe a number of distinctly different things: distributed data stores, clustered supercomputers, run-anywhere computing resources, commodity computing...
See the GlobalGridForum pages at: http://www.gridforum.org
for more details about Grid research and projects across the world.