Laying the Groundwork For Data-Driven Science
aarondubrow writes The ability to collect and analyze massive amounts of data is transforming science, industry and everyday life. But what we've seen so far is likely just the tip of the iceberg. As part of an effort to improve the nation's capacity in data science, NSF today announced $31 million in new funding to support 17 innovative projects under the Data Infrastructure Building Blocks (DIBBs) program, including data infrastructure for education, ecology and geophysics. "Each project tests a critical component in a future data ecosystem in conjunction with a research community of users," said said Irene Qualters, division director for Advanced Cyberinfrastructure at NSF. "This assures that solutions will be applied and use-inspired."
This sounds suspiciously like something written by someone with an online MBA: "Each project tests a critical component in a future data ecosystem in conjunction with a research community of users," said said Irene Qualters, division director for Advanced Cyberinfrastructure at NSF. "This assures that solutions will be applied and use-inspired."
If we want the public to continue to support federal funding of the sciences we have to do better than this. I understand the point, but it this needlessly laden with buzz-phrases and it is clumsy.
Alan Alda and others have done work with scientists where it comes to communicating to non-scientists, and I'm grateful for it.
from TFA:
"In fiscal year (FY) 2014, its budget is $7.2 billion. NSF funds reach all 50 states through grants to nearly 2,000 colleges, universities and other institutions. Each year, NSF receives about 50,000 competitive requests for funding, and makes about 11,500 new funding awards. NSF also awards about $593 million in professional and service contracts yearly."
and: "awards support research in 22 states"
This particular investment is a tiny fraction of the budget. A low priority.
Note that each congressperson attempts to get government funding for his/her state as part of the obligatory re-election process. Often this funding is for nonsense activity that may provide jobs or incentive for corporate supporters.
Not saying that any of this is pork, but I'd like to know.
...omphaloskepsis often...
If the NSF grant process is like the one for NASA, there's still a little bit of flexibility for the program manager after they've gotten the scores.
I know because I was on a panel that specifically gave two proposals 'poor' reviews (the lowest possible), and the program manager asked us to consider changing it. In this case, he's a rather nice guy, and it may just be that he didn't want to have to write the 'your proposal sucks' letter to them ... but those of us on the panel knew that there is _no_ way for them to fund a 'poor'. They have leeway with any other score, and could give something with a marginal rating some seed money (fund 'em for a year, so they might be able to put in a more competitive bid next round).
We told the program manager that no, we wanted to make sure that there was no possible way that those two proposals could get funded.
Build it, and they will come^Hplain.
... is that data isn't evidence. And the simple fact that most people don't understand that simply underscores the danger of it.
Now, science must be empirical. It must be based on observation, experimentation, and the results should drive theory.
However, something that has been worrying for years is a lazy tendency for people... scientists included... to grab a data set, point out some correlating variables, and then conclude a discovery... or propose a theory that is supposed to be taken seriously.
That is wrong. And we all know that is wrong. I'm fine with it if we don't take the study seriously or if they don't just cite correlative statistics. But they do that with depressing consistency. Correlative statistics are not evidence. It is data. But basically anything is data. Having data isn't an accomplishment. It is having some readings or information that could mean anything including nothing at all.
Serious efforts have to be taken to ensure the data is pure of distorting influences. And then you have to set up devil's advocate tests/experiments to make sure that there is some causation going on in the data. Often as not, this isn't happening especially with the "data driven" science which in so far I have seen is code for people that sit at their computers calling up spreadsheets and then concluding things from them. That isn't good enough.
Here someone is going to say I know nothing and that people that call up spreadsheets are doing entirely valid science. Which ignores all sorts of points such them not knowing exactly where the data came from or how it was collected. Oh sure, they might have some notation that says how that was done... but who really knows. Scientists need to be willing to get their hands dirty and get the data themselves. The arm chair stuff is good to a point when the data is known to be good or when someone else went to the effort to sort out all the problems with it. But that isn't terribly common. Most of the sets have issues even after being declared good.
Anyway... I hope this all works out for the best and that my fears in this matter are unfounded. Truly. I just worry that this is going to be more of a giant waste of time.
I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
Perhaps they should start with an annual walrus census (see story following this one).
This sounds suspiciously like something written by someone with an online MBA: "Each project tests a critical component in a future data ecosystem in conjunction with a research community of users," said said Irene Qualters, division director for Advanced Cyberinfrastructure at NSF. "This assures that solutions will be applied and use-inspired."
If we want the public to continue to support federal funding of the sciences we have to do better than this. I understand the point, but it this needlessly laden with buzz-phrases and it is clumsy.
I understand your point about the technobabble. However, Ms. Qualters' résumé appears to be somewhat less fluffy that the quote would suggest.
If it weren't for deadlines, nothing would be late.
The NSF groups it's funding towards various targets. If you know what they are targeting, then it's easier to get funding.
In this case, they are targeting computer related things, and grouping it under the name, "Data Infrastructure Building Blocks." The actual funding goes towards things as diverse as an MOOC, and some kind of scientific library for super computers.
"First they came for the slanderers and i said nothing."
All science is data driven. Without data there is no hypothesis, and without hypothesis there is nothing to test (falsify). This is just another hype, like nanotechnology or now nanobiotechnology etc. Nearly all molecules are nanoscale: their size is measured in nanometers, and in the same way all science is data driven.
There is nothing wrong with good old "science driven science" where people think, do experiments, and think again.
Science may be data-driven, but historically scientists have not been trained to be good data custodians. They know reasonably well how to use data, but they don't know how to store it, label it, transfer it, etc. Go pick an article from 5 years ago which is data-heavy and try to get the original dataset from the authors: 95 times out of a hundred you'll spend a month emailing people and you'll end up with nothing. Four more out of the 100 you'll get an Excel spreadsheet without labels on the columns. Scientists desperately need to become better at managing data.
Personally, I think that this program is targeting a small subset of the people who need help, and as such it won't be very effective. These look like infrastructure projects, but infrastructure only drives trends in extremely rare cases. Here's a quote from one funded proposal:
This project develops web-based building blocks and cyberinfrastructure to enable easy sharing and streaming of transient data and preliminary results from computing resources to a variety of platforms, from mobile devices to workstations, making it possible to quickly and conveniently view and assess results and provide an essential missing component in High Performance Computing and cloud computing infrastructure.
Will that project help teach scientists they shouldn't email files to themselves as a method of long-term archival? Yes, that really is extremely common. We should be focusing on building data tools which are extremely simple, extremely broad in scope, and encourage or force adoption of those tools.
I can save the NSF a bunch of money with this initiative. There's a data center in Utah that's not being used (for anything legal) with a huge amount of data storage capacity. The NSF should have it.