Obama Administration Places $200 Million Bet On Big Data
wiredmikey writes "As the Federal Government aims to make use of the massive volume of digital data being generated on a daily basis, the Obama Administration today announced a 'Big Data Research and Development Initiative' backed by more than $200 million in commitments to start. Through the new Big Data initiative and associated monetary investments, the Obama Administration promises to greatly improve the tools and techniques needed to access, organize, and glean discoveries from huge volumes of digital data. Interestingly, as part of a number of government announcements on big data today, The National Institutes of Health announced that the world's largest set of data on human genetic variation – produced by the international 1000 Genomes Project (At 200 terabytes so far) is now freely available on the Amazon Web Services (AWS) cloud. Additionally, the Department of Defense (DoD) said it would invest approximately $250 million annually across the Military Departments in a series of programs. 'We also want to challenge industry, research universities, and non-profits to join with the Administration to make the most of the opportunities created by Big Data,' Tom Kalil, Deputy Director for Policy at OSTP noted in a blog post. 'Clearly, the government can't do this on its own. We need what the President calls an 'all hands on deck' effort.'"
All the taxes paid over a lifetime by the average American are spent by the government in less than a second. -- Jim Fiebig
When it comes to big data, there's going to be little privacy.
Clearly, the government can't do this on its own. We need what the President calls an 'all hands on deck' effort
So the Obama wants to pick and choose how this will be handled but he wants everyone else to do it? Whatever happened to representation?
I'm a hard science/computer science guy who's livelihood is working on various NIH/NSF projects. A common thread talking to other scientists the past few years has been the theme that the tools for data analysis have not kept pace with the tools for data acquisition. Companies like National Instruments sell sub-$1000 USB DAQ boards with resolution and bandwidth that would make a scientist from the early 1990's weep for joy. But most data analysis is done the same way it's been done since that same era: with a desktop application working with discrete files, and maybe some ad-hoc scripts. (Only now the scripts are Python instead of C...)
The funny thing is, most researchers haven't yet wrapped their brains around the notion of offloading data onto cloud computing solutions like Amazon AWS. I was at an AWS presentation a couple months ago, and the university's office of research gave an intro talking about their new supercomputer that has 2000 cores, only to get upstaged 10 minutes later when the Amazon guys introduced their 17000 core virtual supercomputer (#42 on the top 500 list, IIRC). There's a lot of untapped potential right now for using that infrastructure to crunch big data.
Amazon is using the idle time of their huge cloud when it's not being used for christmas shopping ... so the cost of CPU is relatively cheap. Bandwidth and storage is *not* with most cloud sevices.
So, say I need to calibrate a year's worth of SDO/AIA data ... that'd mean pushing to them somewhere in the range of 500TB of data, and then pulling it back again. They've changed their pricing so the transfer in is now free ... but if I'm doing the math right, that'd cost somewhere on the order of $30k for the transfers, and if we assume we're pushing it in and deleting it as soon as it's done, we don't need a lot of storage. For other processes, people *do* need the storage, which runs around $100/TB/month, so $50k ... per month.
It's not as impressive, but it's more cost effective in the long run to build in your own processing near the data. Would it be nice to redo two years of calibration in a day, rather than the ~3hrs to process 1 day's data that it takes now? Yes, but we don't have the funding to pay for it. (every launch delay costs money (gotta keep the scientists employed, store satellites in machine rooms, pay for offices, etc.) ... and that money, without fail, gets taken from the actual running of the mission and the data analysis.
What I'd personally like to see is more large scale infrastructure coordination, and for any project where the PI team's composed entirely of physicists yet they're designing and implementing their own data system be immediately de-funded.
I'm not going to say that everyone should be using iRODS or OODT or whatever the next new sexy thing is ... but a physicist writing the drivers that run the tape drives? That's a sign something's gone horribly wrong, and yet it's still happening.
Build it, and they will come^Hplain.