Creating The UniServer
bmongar writes " DrDobbs has an article about a project for a mirrored universal astronomy database. Jim Gray basically wants a netowrk of observatories around the world to publish their data and mirror other observatories' data. Basically creating a quadruple redundant system of data all avaliable online. He wants to create a new type of astronomer, the astronomer that is a data miner." As the article also says, the guy behind this is the guy behind the TerraServer as well.
Well there's an insightful AC.
It would help me if you pointed out what you thought was naive and why.
There is a whole gaggle of scientific work that at first seemed totally worthless commercially but eventually had commercial uses within 100 years or even much, much faster.
It makes me wonder. Researchers usually have their own datasets, and they spend gobs of time working on them with their specialized prorams. It seems to me that the really valuable stuff out there is in these closed datasets, not in the everyday stuff that's available on the internet.
As an example, you can get a gazillion CD-ROM's with the Magellan data from Venus. But what good is that raw data? Not much. You'd probably want to get a look at the data on the Venus-geologist's computer instead, because it's been analyzed and selected and generally picked over to produce something meaningful.
If tits were wings it'd be flying around.
And finally, we are discovering universes
that are farther away and therefore younger than any we have previously discovered
I REALLY hope you meant to say galaxies instead of universes. I am a physics major with a healthy lean towards computational cosmology, and I would really really hope that if we had discovered entire new universes, I would know about it. SDSS, however, has discovered quite a few new galaxies, so I will assume that is what you were referring to
If you start out with mounds of raw data you aren't a scientist.
A scientist starts out with a hypothesis.
This is not true. The father of modern science, Francis Bacon, believed that science should be done by collecting as much data as possible and seeing what conclusions the data support.
Hypothesis driven research is actually in a sense cheating, because in such research the data gathered is biased -- the researcher is not considering all the data which could bear upon the situation but only those data which the researcher believes could support or refute a preconcieved hypothesis. Nevertheless, hypothesis driven research is the norm in science because until recently, that was the only efficient way to do science.
But with new techniques in data mining, we can begin to recapture the promise of Baconian science.
Your average astronomer is already a major data miner. From the Hubble Deep Field to the images taken in the back yard with a home-built CCD camera, much of modern observational astronomy is entirely built around being able to mine those images for correspondance, object attributes, clustering in either position, colour, or some other feature. Even with a basic catalogue built off one single wavelength plate will assign position, size, brightness, orientation, semi-major and semi-minor size, positional error, orientation error, brightness error, isophotal brightness, local background level and half-a-dozen other attributes to each object in the catalogue. There may be several thousand objects in a single frame. Making sense of this data set requires time, some ideas about what you are searching for and some luck.
All that said, you'd be missing a lot as an astronomer if all you looked at was optical images. Going to other images for the same area of sky, be it infra-red, radio, x-ray and so on, will give you a deeper insight into the likely environment of your object and also into any likely confusions due to multiple structures along the line of sight.
So having a vast data repository is important, and astronomers have had the tools to go and query multiple surveys at multiple wavelengths for several years. So there is nothing new here either from a data access point of view. The only really new thing in this proposal is to collate all the data together onto four super-mirrors and ensure that these supermirrors remain in sync, so if one system dies, it can be restored from the other mirrors without having to go back to tape backups.
Cheers,
Toby Haynes
Anything I post is strictly my own thoughts and doesn't necessarily have anything to do with the opinions of IBM.
.. as you can throw enough storage space at the problem. Just having a giant stockpile of data isn't going to be of much use (except for archival purposes) unless we also have efficient access to the data onsite (we don't want to send Terrabytes over the newtwork) and have the correct tools to allow different datasets to be compared and correlated. The possibilities for doing large scale data comparisons or comparing a wide range of wavelength datasets is surely what is most interesting here and the major point for having an online store (as opposed to data archive). I wonder what research tools are proposed?
----------------------------------- My Other Sig Is Hilarious -----------------------------------
Give me a break! Does this guy know anything about the field of astronomy from a professional point of view?
Most astronomers/astrophysicists don't spend the time looking through the telescopes themselves - the majority use data that someone else has already gathered. I agree that this would greatily increase their ability find pertinent data, however, it would hardly bring about a new 'type' of astronomer, the majority are already data miners.
UBU
Apart from having more observatories publish their data (most already do), having a central point to index it (not really here today, but if you want it you can generally find it - if it's not in the sky survey, it's not in the sky), and having M$ run things (please, no!), what does he hope to accomplish?
I love vegetarians - some of my favorite foods are vegetarians.
----
It's called 'SETI at Home' isn't it?
A pizza of radius z and thickness a has a volume of pi z z a
Scientists are already and have been for a long time working together, standing hand in hand. Maybe it seems Utopian from a selfish viewpoint but it's very natural to scientists.
Well, what I am interested about is that this would make it so that astronomers don't have to use a telescope. Remember, we only have one sky. For each image, there would be attributes for the time-date the image was taken, the celestrial coordinates it was taken at, the magnification, the geographical coordinates it was taken at, and perhaps even the weather conditions.
I don't see a reason why images that aren't in the visible spectrum can't be put into this database. Then you would need perhaps a spectrum range attribute.
The exciting thing in my opinion is what can be done with all this data. Imagine creating a starmap of the entire sky based on real observation, it may be zoomable at some points. Everytime a telescope takes a picture of the sky, it gets put into this database. That could yield a huge amount of data in relatively short time. I can very much see astronomers using this data instead their own observations. Imagine a "video" of the same part of the sky in twenty years.
This can be done from software if all the data is there. I know I would love this kind of thing to be publicly archivable. If I see something in the sky, I can then look onto the internet to see if there was any other images of it.
Sorry if my post is less than coherent, but this seems exctiting to me.
Um, I don't think this was a white paper. I think it was an idea...perhaps a proposal to astronomers.
I don't think it matters what OS they use or what database system they use, etc. etc. until they start implementing it.
I think the astronomers would very much appreciate this use of technology. It is one of the purist uses of technology I have known.
But I am interested in details as well though. So for those of you who specialize in this sort of stuff, how would you go about implementing this sort of system? Would GNU/Linux be able to handle it?
404 - Universe Not Found
Please contact the Universe Master at...
Hammer of Truth
The article definitely gets the ol' geek hairs on the back of your neck standing up. Petabyte backups, tape recovery that takes 5 days..
Lots of stuff that makes geek men howl.
However, it leaves out a *TON*. Like, what technology are they going to use to DO data mining? What database will run this monster? Which OS will it run on?
Further, what license/restrictions are there on the data once it gets published? Is it totally public knowledge, free of copyright?
Fundamental questions of large scope and size, not easily ignored.
However, the question *I* have is, why not do the data storage on online companies KNOWN for hosting data, instead of at astronomies, who have little experience at that.
GPL'd web-based tradewars themed space game
Although I'm not aware of any database of pure raw data, NASA at least have the Distributed Astronomy Library, described here, which is a repository of astronomical *information*. An example is here
Free Anne Tomlinson!!
*FUD start*Such thing reminds me of some M$ ideas on concentrating everything all around the world in one bucket. Somehow this resulted in the .NET idea. So now we are up to the Universe...*FUD end*
Well, anyway the idea is not so bad at all. But I don't see how to realise it without making some radical changes in the system. First we have to deal with communication channels. For such volumes like astronomical databases they are highly unreliable. We are not going to run pentabytes on them but surely there will be gigabytes going back and forth. Let's note. A Mars raw image from PDS weighs sometimes up to 20 Megabytes. Processing such images leads sometimes to data volumes 10-30 times bigger. On some cases it is possible to apply JPEG to compress these images. But sometimes it is highly undesirable to do it. So we get something weighing 100-200 Megs. On a 100Mb network, that will take a few minutes to pass from station to station. Now imagine a widespread, worldwide network working such way.
On one side we have archives all spread over the world. On the other side this rises a community of astronomers also working all over. It will be a big challenge to achieve such thing. And a big financial adventure. Maybe dumb burrocritters will think that data will be cheaper if it keeps rotting in a magnetic tape.
It is happening in other sciences. For example, my field "bioinformatics" deals with analyzing molecular biological data, much of which is in public databases such as GenBank. Once experimental molecular biologists could be expected to analyze all their data themselves because there just wasn't very much of it. That just isn't true anymore.
This should have been implemented a long time ago, because the amount of information we are pulling in right now is tremendous and it will only increase with the release of the more and more satellites we send up. We need this database for three very important reasons
We are all concerned, due to recent movies, that we might get hit by an asteroid, which is a valid concern, so we need to carefully track the asteroids that we find because we are only currently searching 10percent of the sky. Secondly with newer and more powerfull telescopes we are mapping more and more planets outside our solar system everyday, soon they will role in by the dozens a day. And finally, we are discovering universes that are farther away and therefore younger than any we have previously discovered
jbischof
Somehow the Utopian thought behind it makes my logical circuits sputter...
I'm sure you are jesting, but anyhoo...
There are people that are only motivated by money that can't seem to understand that not everyone is motivated by same. If everyone were motivated solely for financial windfall, would Linux exist at all?
Outside of the "hacker" community, I believe that the academic and scientific type communities have contributed the most effort to Linux software in the first ten years (is it 10 years old yet? Maybe eight years), so it's not that much of a stretch. Scientific papers are about trying to share information in a hope furthering knowledge.
People wanting to get master's and doctorates were able to contribute some effort on their thesis papers.
The biggest problem is, of course, data entry. A lot of the texts pose a challenge for OCR for a number of reasons, including the large number of special characters often used.
Another problem is people who insist on copyrighting and refusing to freely share their collections of online documents in the older languages, which is a real shame, because it prevents me from creating all kinds of interesting derived works (e.g. web pages of Old English texts where you can click any word to get information about it). It basically means that all this work has to be repeated by anyone who wants to make those texts freely available-- never mind that we're talking about works over 1000 years old!
I don't know if it was angular momentum that Kepler figured out from his data, but he did study Tycho's data.
:-) by using stellar precession, etc.
Tycho Brahe may not have been much of an astrophysist, scientist, or whatnot, but he was a hell of an observer, ESPECIALLY when you consider the crappy tools he had -- an eyeball, a sextant, and an optical telescope.
Scientists today still study his data, because there is so much of it, for such a long time, with such a high degree of accuracy. It's useful for all kinds of things; dating stars (or human events, like pyramid building
--
Do daemons dream of electric sleep()?
I'm no scientist, but I don't think they should use a lossful file format for this kind of thing.
Scientist: Hmm...what's this shady pixel on mars here? Could it be...could it be life!
Geek: Nahh...that just a result of the JPEG algorithm just making up pixels it lost in its compression algorithm.
Putting multimedia data into the file system is the implementation strategy many commercial databases (including some versions of DB2) take behind the scenes for storing multimedia objects, even if they hide it behind a database API. They can still provide all the database facilities (transactions, indexing, access control, etc.) on top of such an implementation.
With that kind of architecture, you don't need a very powerful machine or high performance database to be able to serve image data at disk bandwidth or network bandwidth.
Wasn't it Kepler who looked over Brahe's work to work out his law of conservation of angular momentum?
--- It is not the things we do which we regret the most, but the things which we don't do.
the academic and scientific type communities have contributed the most effort to Linux software in the first ten years (is it 10 years old yet? Maybe eight years)
If you count from when emacs started being worked on in the mid 70s, the Linux software canon is about 25 years old.
But the 1.0 kernel was released in mid 1994, so six years counting from then.
There's an article out in Slashdot that pans the Space Station, but then gets into some actually interesting matter, like the increasing ability to actually do data mining. Data mining has long been a staple of hard science fiction, but the benefits of being able to /really/ do it are immense - less pollution, really clean data. There's just that nasty get-the-material to the factory issue. But that's why we need a space elevator, right?
Got Rhinos?
As usual, Microsoft is late to the party and comes with their own agenda. Microsoft products are oriented towards small business and desktop applications. That's what their evolution is driven by and that's what they are designed for. Whether this kind of data should be in a relational database is questionable to begin with. And it certainly doesn't need to be on an expensive, proprietary operating system and in a proprietary format.
Scientists already have excellent open-source tools to build long-term, stable, large-scale data collections. They would be foolish to tie research projects that can span decades to the fortunes of a company in the middle of a battle for the US business computing market, merely to gain some trinkets and give that company a publicity boost.
Scientists are sometimes co-operative, sometimes bitterly competitive. Sometimes they share their data, sometimes they guard it jealously. Sometimes they go to great lengths to sneak a look at each other's data.
For an exampe, see The Double Helix by James Watson, where Watson and Crick win a Nobel prize, partly by gaining access to Rosalind Franklin's X-ray pictures of DNA.
You're painfully naive if you think the opportunists who've taken over academia in the last generation and a half are anything but self-serving parasites.
Finally something I can work with, at least more informative than just accusing someone as being naive without doing the littlest thing to fix the problem. I had forgotten about this.