Creating The UniServer
bmongar writes " DrDobbs has an article about a project for a mirrored universal astronomy database. Jim Gray basically wants a netowrk of observatories around the world to publish their data and mirror other observatories' data. Basically creating a quadruple redundant system of data all avaliable online. He wants to create a new type of astronomer, the astronomer that is a data miner." As the article also says, the guy behind this is the guy behind the TerraServer as well.
Your average astronomer is already a major data miner. From the Hubble Deep Field to the images taken in the back yard with a home-built CCD camera, much of modern observational astronomy is entirely built around being able to mine those images for correspondance, object attributes, clustering in either position, colour, or some other feature. Even with a basic catalogue built off one single wavelength plate will assign position, size, brightness, orientation, semi-major and semi-minor size, positional error, orientation error, brightness error, isophotal brightness, local background level and half-a-dozen other attributes to each object in the catalogue. There may be several thousand objects in a single frame. Making sense of this data set requires time, some ideas about what you are searching for and some luck.
All that said, you'd be missing a lot as an astronomer if all you looked at was optical images. Going to other images for the same area of sky, be it infra-red, radio, x-ray and so on, will give you a deeper insight into the likely environment of your object and also into any likely confusions due to multiple structures along the line of sight.
So having a vast data repository is important, and astronomers have had the tools to go and query multiple surveys at multiple wavelengths for several years. So there is nothing new here either from a data access point of view. The only really new thing in this proposal is to collate all the data together onto four super-mirrors and ensure that these supermirrors remain in sync, so if one system dies, it can be restored from the other mirrors without having to go back to tape backups.
Cheers,
Toby Haynes
Anything I post is strictly my own thoughts and doesn't necessarily have anything to do with the opinions of IBM.
----
Scientists are already and have been for a long time working together, standing hand in hand. Maybe it seems Utopian from a selfish viewpoint but it's very natural to scientists.
404 - Universe Not Found
Please contact the Universe Master at...
Hammer of Truth
The article definitely gets the ol' geek hairs on the back of your neck standing up. Petabyte backups, tape recovery that takes 5 days..
Lots of stuff that makes geek men howl.
However, it leaves out a *TON*. Like, what technology are they going to use to DO data mining? What database will run this monster? Which OS will it run on?
Further, what license/restrictions are there on the data once it gets published? Is it totally public knowledge, free of copyright?
Fundamental questions of large scope and size, not easily ignored.
However, the question *I* have is, why not do the data storage on online companies KNOWN for hosting data, instead of at astronomies, who have little experience at that.
GPL'd web-based tradewars themed space game
*FUD start*Such thing reminds me of some M$ ideas on concentrating everything all around the world in one bucket. Somehow this resulted in the .NET idea. So now we are up to the Universe...*FUD end*
Well, anyway the idea is not so bad at all. But I don't see how to realise it without making some radical changes in the system. First we have to deal with communication channels. For such volumes like astronomical databases they are highly unreliable. We are not going to run pentabytes on them but surely there will be gigabytes going back and forth. Let's note. A Mars raw image from PDS weighs sometimes up to 20 Megabytes. Processing such images leads sometimes to data volumes 10-30 times bigger. On some cases it is possible to apply JPEG to compress these images. But sometimes it is highly undesirable to do it. So we get something weighing 100-200 Megs. On a 100Mb network, that will take a few minutes to pass from station to station. Now imagine a widespread, worldwide network working such way.
On one side we have archives all spread over the world. On the other side this rises a community of astronomers also working all over. It will be a big challenge to achieve such thing. And a big financial adventure. Maybe dumb burrocritters will think that data will be cheaper if it keeps rotting in a magnetic tape.
Putting multimedia data into the file system is the implementation strategy many commercial databases (including some versions of DB2) take behind the scenes for storing multimedia objects, even if they hide it behind a database API. They can still provide all the database facilities (transactions, indexing, access control, etc.) on top of such an implementation.
With that kind of architecture, you don't need a very powerful machine or high performance database to be able to serve image data at disk bandwidth or network bandwidth.
There's an article out in Slashdot that pans the Space Station, but then gets into some actually interesting matter, like the increasing ability to actually do data mining. Data mining has long been a staple of hard science fiction, but the benefits of being able to /really/ do it are immense - less pollution, really clean data. There's just that nasty get-the-material to the factory issue. But that's why we need a space elevator, right?
Got Rhinos?
As usual, Microsoft is late to the party and comes with their own agenda. Microsoft products are oriented towards small business and desktop applications. That's what their evolution is driven by and that's what they are designed for. Whether this kind of data should be in a relational database is questionable to begin with. And it certainly doesn't need to be on an expensive, proprietary operating system and in a proprietary format.
Scientists already have excellent open-source tools to build long-term, stable, large-scale data collections. They would be foolish to tie research projects that can span decades to the fortunes of a company in the middle of a battle for the US business computing market, merely to gain some trinkets and give that company a publicity boost.