Slashdot Mirror


Laying the Groundwork For Data-Driven Science

aarondubrow writes The ability to collect and analyze massive amounts of data is transforming science, industry and everyday life. But what we've seen so far is likely just the tip of the iceberg. As part of an effort to improve the nation's capacity in data science, NSF today announced $31 million in new funding to support 17 innovative projects under the Data Infrastructure Building Blocks (DIBBs) program, including data infrastructure for education, ecology and geophysics. "Each project tests a critical component in a future data ecosystem in conjunction with a research community of users," said said Irene Qualters, division director for Advanced Cyberinfrastructure at NSF. "This assures that solutions will be applied and use-inspired."

7 of 55 comments (clear)

  1. Difficult, not impossible. by oneiros27 · · Score: 3, Insightful

    If the NSF grant process is like the one for NASA, there's still a little bit of flexibility for the program manager after they've gotten the scores.

    I know because I was on a panel that specifically gave two proposals 'poor' reviews (the lowest possible), and the program manager asked us to consider changing it. In this case, he's a rather nice guy, and it may just be that he didn't want to have to write the 'your proposal sucks' letter to them ... but those of us on the panel knew that there is _no_ way for them to fund a 'poor'. They have leeway with any other score, and could give something with a marginal rating some seed money (fund 'em for a year, so they might be able to put in a more competitive bid next round).

    We told the program manager that no, we wanted to make sure that there was no possible way that those two proposals could get funded.

    --
    Build it, and they will come^Hplain.
  2. Re:The problem with data driven science.. by binarstu · · Score: 4, Insightful

    The problem with data driven science... is that data isn't evidence.

    Correlative statistics are not evidence.

    I think you are confusing "evidence" with "proof". Data, and more specifically, the patterns in data, most certainly are evidence. If that were not true, then there would be no reason to even try doing science.

    Having data isn't an accomplishment.

    Any scientist who has spent years obtaining a hard-won dataset would strongly disagree with you. Consider, for example, the ground-breaking data generated a few years ago by the Human Genome Project, or the current explosion of data about exoplanets. These data most certainly do represent substantial intellectual and technical accomplishments. Now, if what you mean is that simply downloading someone else's data from the Web is not an accomplishment, then I agree with you.

    Scientists need to be willing to get their hands dirty and get the data themselves.

    I think you will find that, in the hard sciences at least, that's usually how it's done. The researchers who write the papers are usually the same people who were involved in collecting the data. However, for very large-scale studies (e.g., global biodiversity research), there is no way that a single scientist, or even a single research team, could gather all of the necessary data. In these cases, the only way to make the research tractable is to integrate multiple datasets.

    Your points about the importance of understanding where the data one uses in a study came from, how they were collected, and any potential biases are all well taken. However, ignoring any of these factors is simply sloppy science, no matter whether the researcher collected the data him or herself, or if someone else collected it.

  3. Re:The problem with data driven science.. by slimme · · Score: 2

    I also see a trend that people look for correlations, find correlations and then draw some conclusion without any proof of causation. To me it strikes me most for economics. Public policy is set based on those correlations.

    It is very counterintuÃtive but correlation research means nothing, especially in economics. Correlation research would be an amusing way to spend your time and get to know some variables, but correlation research is being used to inflence people. Repeat after me: correlation means nothing. If you find a correlation luck has hit you. Or luck has been manipulated to serve some point. Correlation means nothing whatsoever. Articles describing correlation are a waist of your time. You should not act based on correlation research.

    Now if you take big datasets with lots of variables and you test correlations between those variables, you will find strong correlations. Correlation here, correlation there, correlation everywhere. If you do millions of tests en tweak your parameters, correlation is all yours.

    But luckily now you know: correlation has no pratical use in your live.

  4. science driven science? by louic · · Score: 4, Insightful

    All science is data driven. Without data there is no hypothesis, and without hypothesis there is nothing to test (falsify). This is just another hype, like nanotechnology or now nanobiotechnology etc. Nearly all molecules are nanoscale: their size is measured in nanometers, and in the same way all science is data driven.

    There is nothing wrong with good old "science driven science" where people think, do experiments, and think again.

  5. Re:The problem with data driven science.. by StripedCow · · Score: 2

    Data isn't evidence, but it can be used to find useful hypotheses, starting points for further research.

    Remember:

    The most exciting phrase to hear in science, the one that heralds new discoveries, is not “Eureka” but “That’s funny...”

    (Isaac Asimov)

    --
    If Pandora's box is destined to be opened, *I* want to be the one to open it.
  6. Re: The problem with data driven science.. by Vesvvi · · Score: 2

    Volume 338, 2012: "Detecting Causality in Complex Ecosystems" http://www.uvm.edu/~cdanfort/c...

  7. science driven science? by Vesvvi · · Score: 2
    This particular push may not be effective, but it's not hype.

    Science may be data-driven, but historically scientists have not been trained to be good data custodians. They know reasonably well how to use data, but they don't know how to store it, label it, transfer it, etc. Go pick an article from 5 years ago which is data-heavy and try to get the original dataset from the authors: 95 times out of a hundred you'll spend a month emailing people and you'll end up with nothing. Four more out of the 100 you'll get an Excel spreadsheet without labels on the columns. Scientists desperately need to become better at managing data.

    Personally, I think that this program is targeting a small subset of the people who need help, and as such it won't be very effective. These look like infrastructure projects, but infrastructure only drives trends in extremely rare cases. Here's a quote from one funded proposal:

    This project develops web-based building blocks and cyberinfrastructure to enable easy sharing and streaming of transient data and preliminary results from computing resources to a variety of platforms, from mobile devices to workstations, making it possible to quickly and conveniently view and assess results and provide an essential missing component in High Performance Computing and cloud computing infrastructure.

    Will that project help teach scientists they shouldn't email files to themselves as a method of long-term archival? Yes, that really is extremely common. We should be focusing on building data tools which are extremely simple, extremely broad in scope, and encourage or force adoption of those tools.