Computer Science Tools Flood Astronomers With Data
purkinje writes "Astronomy is getting a major data-gathering boost from computer science, as new tools like real-time telescopic observations and digital sky surveys provide astronomers with an unprecedented amount of information — the Large Synoptic Survey Telescope, for instance, generates 30 terabytes of data each night. Using informatics and other data-crunching approaches, astronomers — with the help of computer science — may be able to get at some of the biggest, as-yet-unanswerable cosmological questions."
All the data storage capacity in the world cannot change the fundamental laws of optics, the speed of light, or the coming cutting by the government of scientific research spending.
methinks its the sensors that are doing the flooding. Not "computer science tools"
But I think "tools" is a bit offensive. They're trying to help the astronomers in a meaningful way.
My biggest issue would be if there is too much information. What if the scientists are using the wrong search queries and missing something important? Or maybe something important is just buried on page 931 of a 2,000 page data report. Still, it's better than the opposite problem, of just not having the data to search.
Many sciences are experiencing this trend. A branch of biochemistry known as metabolomics is a growing field right now (in which I happen to be participating). Using tools like liquid chromatography coupled to mass spectrometry we can get hundreds of megabytes of data per hour. Even worse is the fact that a large percentage of that data is explicitly relevant to a metabolomic profile. The only practical way of analyzing all of this information is through computational analysis, either through statistical techniques used to condense and compare the data, or though searches on painstakingly generated metabolomic libraries.
That is just my corner of the world, but I imagine that many of the low hanging fruits of scientific endeavor have already been picked, going forward, I believe that the largest innovations will come from the people willing to tackle data sets that a generation ago would be seen as insurmountable.
Try GalaxyZoo.com, it uses the data from the mention Sloan Digital Sky Survey mentioned in the article to supply information for any user that signs up to view it. They give you a windowsill position. So to say they give you an open window inside the world of astronomy and let you supply your own inferences or generally process the data as they have bottlenecked for the terms of a white paper.
Plenty of items in the array from planet hunting to spectroscopic analysis. Given, those fit in the same overflow on the venn diagram, of course.
I'm not an expert in Astronomy, but in general, I don't think you can collect too much data, as long as its stored in an at least somewhat intelligible format. This way, even if professional astronomers miss something today, amateurs and/or future astronomers will have tons of data to pick apart and scavenge tomorrow.
Plus, more data should make it easier to test hypotheses with more certainty. Hopefully, the data will be made publicly available after the gatherers have had a shot or two at it.
That's a tremendous amount of data to get from a telescope that hasn't even been built yet.
"the Large Synoptic Survey Telescope, for instance, generates 30 terabytes of data each night."
You know what they say, garbage in, garbage out...
I have no idea if that telescope is garbage or not, but I do know that if we keep canceling new telescope development then we will quickly be left with just the garbage.
30TB per day works out to about 10 petabytes per year. If you compare this to the total amount of data produced in a year (from all human sources), around a zetabyte, it's not that huge. In fact, IIRC, the yearly transfer rate of the internet is around 250 exabytes. The people with the really hard job of data processing are internet search engines. Not only do they have to through several orders of magnitude more data, they have to do it faster, and with much less clearly defined queries.
I sometimes wonder how generally useful something like Google's page rank system is. It might be possible (if Google ever runs out of other things to do :) to apply this to arbitrary scientific datasets. This could tremendously speed such calculations. Unfortunately, it may be a while before it is possible for any of the major search engines to release significant parts of their algorithm without being at a serious disadvantage competitively.
However, there is another obstacle as well; one dataset doesn't cross reference itself anywhere near as much as the internet, and it is fairly certain that Google (at least) uses this for a good part of its ranking. So we would also want to incorporate the opinions of scientists (both amateur and professional), and many datasets, to give the pagerank system the detail it would need.
For some reason, that word scares me..
For justice, we must go to Don Corleone
*WILL* generate. LSST isn't operating yet.
And yes, 30TB is a lot of data now, but we have some time before they finally have first light.
Operations isn't supposed to start 'til 2019 : http://www.lsst.org/lsst/science/timeline
We just need network and disk drive sizes to keep doubling at the rate they have, and we'll be laughing about how we thought 30TB/night was going to be a problem.
SDO finally launched last year with a date rate of over 1TB/day ... and all through planning, people were complaining about the data rates ... it's a lot, but it's not insurmountable as it might've been 8 years ago, when we were looking at 80 to 120GB disks.
Although, it'd be nice if monitor resolutions had kept growing ... if anything, they've gotten worse the last couple of years.
(Disclaimer : I work in science informatics; I've run into Kirk Bourne at a lot of meetings, and we used to work in the same building, but we we deal with different science disciplines)
Build it, and they will come^Hplain.
In fact, they just started blasting the site. I actually live next door to the LSST's architect, which is pretty cool.
Astronomers generate a tremendous amount of data, bested only by particle physicists. Storing it all is a challenge, to put it mildly. Backup is basically impossible.
The real problem is that the data lines that go from the summit to the outside world are still not fast. The summits here are pretty remote and even when you get to a major road, it's still in farm country. And then getting it out of the country is tough--all of our network traffic to North America hits a major bottleneck in Panama, so if you're trying to mirror the database or access the one in Chile, it can be frustratingly slow.
Astronomers are used to taking pictures and storing them, but that doesn't mean that it is the best way to operate.
The fact that you can store the data, doesn't mean that you have to or should do so. Why not capture it again when it is needed again, the way other monitoring systems do?
As far as I understand it, the data will be available also to the general public. I assume that means they will need to have a global network of caches?
Astronomers generate a tremendous amount of data, bested only by particle physicists.
Earth scientists will merrily generate far more — they're purely limited by what they can store and process, since deploying more sensors is always possible — but they're mostly industrially funded, so physicists and astronomers pretend to not notice.
"Little does he know, but there is no 'I' in 'Idiot'!"
As far as I understand it, the data will be available also to the general public. I assume that means they will need to have a global network of caches?
Possibly. It depends on how much the general public actually wants to download the data; if it is just selected images instead of the bulk (most of which will be boring "not much happening here" stuff) then serving it from a single site will be quite practical.
"Little does he know, but there is no 'I' in 'Idiot'!"
*WILL* generate. LSST isn't operating yet.
This, unless they have a time machine. ;)
The first Pan-STARRS scope with its 1.3-gigapixel camera has been doing science for a little while now, and I think it might do something like 2.5TB a night. That's still a lot of disk (and keep in mind that they originally planned to have 4 of those scopes), but I think their pipeline reduces it all to coordinates for each bright thingy in the frame and then throws away the actual image (though I could be wrong).
Where I work, our highest-resolution toy is 80 megapixels right now, but we're supposed to get a shiny new one next year with a FOV three times wider and close to a gigapixel of resolution... that'll chew through disk and bandwidth like crazy.
Village idiot in some extremely smart villages.
Theoreticians surely generate most because they're only limited by how far a CPU can churn out floating-point numbers.
Ok, I know this doesn't solve the problem of actually ANALYZING the data but for storing and moving the data around, what's the best compression algorithm for astronomical (I mean the discipline, not the size!) data.
I used to work for a company that developed a really good compression algorithm using wavelets. At the time it was the only one to be accepted by A-list movie directors (the people with the real power in Hollywood); they refused to go with any of the JPEG or MPEG variants (this was before JPEG 2000 which I understand also uses wavelets). We pitched to JPL the idea that they use some of this technology for some of their mission imaging requirements but they said the data was almost priceless and they couldn't risk losing any data an admittedly "lossy" compression algorithm. Of course they were forced to break this policy with Galileo because the main antenna never opened so only had a tiny fraction of the bandwidth that they originally planned. (Interestingly enough after the Columbia disaster, our equipment was later heavily used by NASA for imaging requirements related to observing the space shuttle. It helped make the conversion to a digital workflow practical which really sped up the time needed to distribute Hi-Res launch videos to all the NASA engineering sites around the country.)
So do astronomers use some lossless compression algorithm? In the case of space based data collection, do they have the computer power to compress it on board? Do they "clean up" the images first to make it easier to compress?
Deploying more telescopes is always possible as well.
This isn't a race about who can fill up storage space the quickest.
Slashdot social media options: AIM, ICQ, Yahoo, Jabber and Mobile Text. Why no MySpace?
“Data is a lot like garbage. You need to know what you are going to do with it before you start collecting it.” - Mark Twain
At least you aren't at Dome A. You might would have to you some tropospheric (to no pay outrageous SAT usage rates).
Atlas Shrugged : Thematic Story