Storing CERN's Search for God (Particles)
Chris Lindquist writes "Think your storage headaches are big? When it goes live in 2008, CERN's ALICE experiment will use 500 optical fiber links to feed particle collision data to hundreds of PCs at a rate of 1GB/second, every second, for a month. 'During this one month, we need a huge disk buffer,' says Pierre Vande Vyvre, CERN's project leader for data acquisition. One might call that an understatement. CIO.com's story has more details about the project and the SAN tasked with catching the flood of data."
Interesting article.
Many years ago when the SSC (Superconducting Super Collider) was still being built in Texas, I went to an HP users group meeting as I was working primarily with HP-3000 systems at the time. The fellow addressing the meeting was the head of the physics department at the SSC. It was a really neat presentation, in which he described a similar, though orders of magnitude smaller data storage requirement, though he was talking terabytes of data per month IIRC. At the time, they were planning on using two arrays of 40 workstation computers to handle the load. This would have been fairly early loosely coupled setup similar to a Beowulf cluster.
After the presentation I went up to him and told him that all I wanted to do is sell him mag-tapes.
These types of experiments evidently produce tons of data. I wonder if the processing could be parcelled out like Stanford's Folding@Home or SETI to speed up data correlations.
This is an ex-parrot!
They're probably using an object based parallel filesystem like Lustre or something similar. I heard at At Sun they build these all the time with one customer striping data against 214 PCs acting as data engines all within one Lustre Filesystem. All the storage is direct attach but SAN can't even come close to the speeds generated and all the equipment being used is commodity hardware.
A standared dual CPU dual core HP server with Windows can keep a 4Gb FC pretty full if set up correctly. I work for a large bank, and we have many a Solaris box that can keep 4 or even 8 2Gb FC cards full into our FC and SATA disk arrays. Not to trivialize the extreme coolness of what they are doing at all, but a PB of data with a few PB of I/O in a day isn't what it used to be. I'm just glad to see they don't use Polyserve, it is worthless for clustering and has caused more downtime at work than it has ever prevented. If they really have that much data they should use 10Gb FC or Infiband. Even our stodgy old bank is implementing our first infiband system so we can move IO at 12Gb instead of the slow 4Gb links.
based on 1GB/sec * ((3600 * 24) * 31) means over 2.5 Petabytes.
Wow.
Something like 3000 of the current ITB drives.
How long until Exabyte level storage is required for some project or another?
Trying to associate Microsoft with "fun" is like trying to associate Satan with aromatherapy. -Tycho
I'm not so sure about the "huge disk buffer". Smaller disks can be spun faster and tend to have lower latency. I'd like to see the drum drive make a comeback for disk cache...expensive, but fast!
"You're young, you're drunk, you're in bed, you have knives; shit happens." -- Angelina Jolie
From my experience, generic blue work clothes (preferably with your name on the breast pocket) work best. I once got into some research facility (they had lasers and everything) because I got out of the elevator on the wrong floor and some guy in a lab coat opened the door for me (I was wearing my work clothes because I was on my lunch break). I wandered about at the place for something like 10 minutes before I found a way out. There was even a security guy of some type sitting at a hallway but he lost interest in me after I looked him in the eye and said hello.
Absolutely correct. (I didn't read the article - i work with the Grid [LCG])
Just two points which may seem to ignore:
Firstly, the Data is of no use if it just sits on some tape/disk drives at cern, because it has to be analyzed as well if you actually want to find something. Back when the whole thing started, it was deemed to expensive to build a central analysis facility at CERN, so the LHC Community Grid was created, some ~100 datacenters around the world with lots (>20k) of CPUs and lots of diskspace. The Data from CERN is automatically distributed over high-speed links to the main site in every "cloud" (called Tier 1, for example Karlsruhe in Germany) and then from there to the smaller centers. Then, if a physicist sends an analysis job, it finds its way to the site where the data is and works there, so there is no unnecessary copying.
Secondly, in addition to the real data coming out of the detector physicists need also quite a lot of simulated "Monte-Carlo" data. The production and storage of that has already been going on for some time, and is already taking up some millions of Gigabytes.
By the way, the data storage management system preferred by a lot of the lhc guys is called D-Cache ( dcache.org ), developed at DESY in Hamburg and free for non-commercial use (this is only for you if you have lots of disks. and preferably a tape robot as a backend.)
Right now, the average event size for ATLAS is 1.6 MByte and the system is designed to keep around 200 events per second, or roughly 300 MByte. This isn't much of course, but you have to consider that the bunch crossing rate (i.e. the rate at which bunches of protons will collide and generate events) is 40 MHz.
So you have to design a system that boils this rate from 40 MHz down to 200 Hz and only keeps the interesting parts, while also buffering all the data in the meantime. For this reason, the first trigger level is entirely implemented in hardware right in the detector and reduces the rate down to 75 KHz with a latency of 2.5 s. The rest of the trigger works on clusters using Linux computers and has a latency of o(1s).
...all this data will be distributed to a handfull of TIER1 sites (CERN is TIER0) all over the world (about 10). At the TIER1 sites the data will be preprocessed. The TIER1 sites distribute their preprocessed data to TIER2 sites which are the places where the international scientists work. I work at a TIER1 site and we face a lot technical challenges with this project. At a TIER1 site as I mentioned, the data is preprocessed too, so we will need a compute cluster and the necesary bandwith internally to move the data around. With each new software release (about every six months), ALL raw data has to be reprocessed with the new software. All results have to be stored. So for every part of raw data we will have to store preprocessed data for every software release. Of course a lot of data will be stored on tape but we expect that the dataflow from CERN (for us 150MB/s to disk and 75 MB/s to tape) will be the least of our problems. Moving the data around and preprocessig the data is probably a bigger problem in the long run. An the fact that the machine will be running for about 15 years or so, this will be a very long run!
I managed to see this at Easter. It's huge. I've posted some photos at: http://grantpe.googlepages.com/cernpics [googlepages.com]. The last shows one of the rooms of computers they're using. The others are just views of the huge detector. It's in a man-made canvern 100 metres tall and 100 metres wide, all below ground! All just taken on the visitors tour.
It is! Sorta, at least... On my experiment (CMS), data gets a first pass handling on site at CERN, then gets parceled out to about 7 other sites (of which Fermilab is one) where their section of data gets another look. Each Tier 1 station, as it's called, also services requests from affiliated research institutions, both to get reconstructed data, and also to run and store their simulated data.
It's a really neat system that makes the geek in me happy =)
The folks at CERN maintain a set of libraries for analyzing nuclear and high-energy physics data sets, known as 'root'. These also include the Parallel ROOT Processing Facility, or PROOF. I'm guessing that PROOF will play an important role in the analysis of this experiment once it comes online.