Slashdot Mirror


Freeing and Forgetting Data With Science Commons

blackbearnh writes "Scientific data can be both hard to get and expensive, even if your tax dollars paid for it. And if you do pay the big bucks to a publisher for access to a scientific paper, there's no assurance that you'll be able to read it, unless you've spent your life learning to decipher them. That's the argument that John Wilbanks makes in a recent interview on O'Reilly Radar, describing the problems that have led to the creation of the Science Commons project, which he heads. According to Wilbanks, scientific data should be easy to access, in common formats that make it easy to exchange, and free for use in research. He also wants to see standard licensing models for scientific patents, rather than the individually negotiated ones now that make research based on an existing patent so financially risky." Read on for the rest of blackbearnh's thoughts. "Wilbanks also points of that as the volume of data grows from new projects like the LHC and the new high-resolution cameras that may generate petabytes a day, we'll need to get better at determining what data to keep and what to throw away. We have to figure out how to deal with preservation and federation because our libraries have been able to hold books for hundreds and hundreds and hundreds of years. But persistence on the web is trivial. Right? The assumption is well, if it's meaningful, it'll be in the Google cache or the internet archives. But from a memory perspective, what do we need to keep in science? What matters? Is it the raw data? Is it the processed data? Is it the software used to process the data? Is it the normalized data? Is it the software used to normalize the data? Is it the interpretation of the normalized data? Is it the software we use to interpret the normalization of the data? Is it the operating systems on which all of those ran? What about genome data?'"

20 of 114 comments (clear)

  1. Again with the IP by Anonymous Coward · · Score: 1, Insightful

    Einstein said "If I have seen farther than most it is because I have stood on the shoulders of giants."
    Where does that begin to apply in a society of lawyers, profiteers, and billion dollar industries based on exploiting shortsighted IP management?

    1. Re:Again with the IP by wisty · · Score: 3, Insightful

      There is a rumor that Newton meant it as an insult to Hooke. Newton had refined DesCarte's wave theory, while Hooke had backed the corpuscul theory. Also, Hooke was a short man.

  2. I don't know! by blue+l0g1c · · Score: 2, Insightful

    I was reading through the summary quickly and almost had a panic attack at the deluge of questions at the end. We get the point already!

  3. What's most important to keep. by MoellerPlesset2 · · Score: 2, Insightful

    What's most important to keep is quite simple and obvious really:
    The results. The published papers, etc.

    It's an important and distinctive feature of Science that results are reproducible.

    1. Re:What's most important to keep. by Anonymous Coward · · Score: 2, Insightful

      How can the results be reproducible if you don't keep the original data?

    2. Re:What's most important to keep. by MoellerPlesset2 · · Score: 4, Insightful

      How can the results be reproducible if you don't keep the original data?

      The relevant results are supposed to be included in the paper, as well as the information necessary to reproduce the work. Most data doesn't fall into that category.

      To make an analogy the computer geeks here can relate to: All you need to reproduce the output of a program is the source code and parameters. You don't need the executable, the program's debug log, the compilers object files, etc, etc.

      The point is you want to reproduce the general result. You don't usually want to reproduce the exact same experiment with the exact same conditions. Supposedly you already know what happens then.

    3. Re:What's most important to keep. by mako1138 · · Score: 5, Insightful

      Let's say the LHC publishes its analysis, and then throws away the data. What happens when five years later it's discovered that a flawed assumption was used in the analysis? Are we going to build another LHC any time soon, to verify the result?

      For a billion-dollar experiment like the LHC, that dataset is the prize. The dataset is the whole reason the LHC was built. Physicists will be combing the data for rare events and odd occurrences, many years down the road.

    4. Re:What's most important to keep. by MoellerPlesset2 · · Score: 2, Insightful

      Let's say the LHC publishes its analysis [..]

      Let's stop right there. There are no general lessons to be had from the LHC. It's an exception, not the rule.
      First: 99.9% of scientists are not working at LHC, or any other billion dollar, world-unique facility. They are working in ordinary labs, with ordinary equipment that's identical or similar to equipment in hundreds of other labs around the world.
      Second: Primary data, actual measurement results, are already kept, as a rule.
      Third: The vast majority of experiments are never ever reproduced to begin with. You're lucky enough to get cited, really. Most papers don't even get cited apart from by those who wrote them.
      Fourth: Very little science is done by re-interpreting existing results. That only applies to the unique cases where the actual experiment can't be reproduced easily.

      What happens when five years later it's discovered that a flawed assumption was used in the analysis? Are we going to build another LHC any time soon, to verify the result?

      Truth is, you'd still have to rebuild the LHC then, because you didn't test your 'corrected' assumption against the actual machine to show that your 'corrected' results are valid. Until the actual experiment is re-done it'll remain an unanswered question.

    5. Re:What's most important to keep. by oneiros27 · · Score: 4, Insightful

      Let's stop right there. There are no general lessons to be had from the LHC. It's an exception, not the rule. First: 99.9% of scientists are not working at LHC, or any other billion dollar, world-unique facility. They are working in ordinary labs, with ordinary equipment that's identical or similar to equipment in hundreds of other labs around the world.

      There are two types of science. What you're referring to is called 'Little Science' (not to be derogatory), but it's the type of thing that a small lab can do, with a reasonable amount of funding. And then there's what we call "Big Science" like the LHC, Hubble Space Telecope, Arecibo Observatory, Large Synoptic Space Telescope, etc.

      Second: Primary data, actual measurement results, are already kept, as a rule.

      I wish. Well, okay, it might be kept, but the question is by who, and have they put it somewhere that people can analyze it?

      I was at the AGU last year, and there was someone from a solar observatory that I wasn't familiar with. As I do work for the Virtual Solar Observatory, I asked them if we could put up a web service to connect their repository to our federated search. They told me there was no repository for the observatory -- the data walks out the door with whoever the observer was.

      Then there's the issue of trying to to tell from the published research exactly what the original data was. But then, I've been harping on the need for data citation for years now ... it's an issue that's starting to get noticed.

      Third: The vast majority of experiments are never ever reproduced to begin with. You're lucky enough to get cited, really. Most papers don't even get cited apart from by those who wrote them.

      For the type of data that I deal with, none of it is technically reproducible, because it's observations, not experiments. And that's precisely why it's important to save the data.

      Fourth: Very little science is done by re-interpreting existing results. That only applies to the unique cases where the actual experiment can't be reproduced easily.

      In your field, maybe. But we have folks who try to design systems to predict when events are going to happen and need training data. Others do long-term statistical analysis with years or decades of data at a time. Still others find a strange feature that hadn't previously been identified as important (eg, coronal dimmings) and want to go back through all of the data to try to identify other occurrences.

      --
      Build it, and they will come^Hplain.
    6. Re:What's most important to keep. by mako1138 · · Score: 2, Insightful

      You seem to be using "results" in a wider sense than "published papers". Yes, nobody is going to throw out papers. But the raw data from instruments? It is not clear whether those will be kept.

      You say that the analysis and interpretations can be thrown out, but those portions are precisely what go into published papers. And for small-scale science, it makes little sense to throw away anything at all.

    7. Re:What's most important to keep. by mako1138 · · Score: 2, Insightful

      I agree that there is no simple answer, but I am uneasy with your "experiment is larger than the data" concept. Today we think of the Michelson-Morley and Millikan experiments as canonical and definitive investigations in Physics. But we do not often remember that each was preceded by a string of less-successful experiments, and followed by confirmations. It the accumulation of a body of data that leads to the gradual acceptance of a physical concept.

      See chart:
      http://en.wikipedia.org/wiki/Michelson-Morley_experiment#The_most_famous_failed_experiment

  4. What's the goal, really? by Rostin · · Score: 4, Insightful

    I'm a working scientist (ok, PhD student), so I read journal articles pretty often. I can understand the rub in principle, but let's say that we come up with some way for all scientific data to be freely shared. So what? In almost all cases, the only people who actually benefit from access to particular data are a small handful of specialists. Could someone explain to me why this is a real problem and not just something that people with too much time on their hands (and who would never actually read, let alone understand, real research results) get worked up about?

    It reminds me of the XKCD this morning...

    1. Re:What's the goal, really? by TapeCutter · · Score: 1, Insightful

      "I'm a working scientist (ok, PhD student), so I read journal articles pretty often."

      And how would you read them if your institution did not foot the bill for subscriptions?

      "In almost all cases, the only people who actually benefit from access to particular data are a small handful of specialists."

      When you amalgamate "almost all cases" you end up with "almost all publications". The rest of your post smacks of elitisim, trivializes scientific curiosity and completely ignores the social and scientific impact of radical improvements in communicating knowledge.

      I would have thought working scientists would actually be proud of their work and want to diseminate it to the largest audience possible but in your case I'm obviously mistaken.

      --
      And did you exchange a walk on part in the war for a lead role in a cage? - Pink Floyd.
    2. Re:What's the goal, really? by Beetle+B. · · Score: 3, Insightful

      Typical comments from someone in the first world.

      First, just on the side, I know lots of people who got PhD's but did not really stay in research and academia. They still want to read papers, though, as they still maintain an interest.

      But the main benefit of opening up journal papers is for the rest of the world to benefit. Yes, if you have a very narrow perspective, you could just dismiss that as charity. If you're open minded, you'll realize that shutting out most of the world to scientific output means much less science globally, and much less benefits to you as a result.

      Imagine if all researchers in Japan published papers only in Japanese, and the journals had a copyright condition that prevented the content from ever being translated to another language, and you'll see what I mean. Whereas current journals require a lot of money for access, these ones also have a price: Just learn Japanese. It's not exactly promoting science.

      Then again, of course, journals do need a base amount of money to operate. Just that Elsevier kind of companies charge so much more than is needed to make a profit.

      --
      Beetle B.
  5. not results- grant dollars by SuperBanana · · Score: 1, Insightful

    The results. The published papers, etc. It's an important and distinctive feature of Science that results are reproducible.

    Having worked around academic groups that do medical research for three years now, I can tell you that is absolutely not what drives research.

    Researchers will love to tell you about how it is the quest for knowledge and other pie-in-the-sky ideals, but when it comes down to it- it's mostly about making a living (or more than a living), and fame/prestige.

    See, journals have what's called an "impact factor." An impact factor is how many times an article in a particular journal ends up being cited by other papers. In one lab I worked at, it was closely tracked who was published where, and how many times.

    At the end of the year, when it came time to decide who went and who stayed, the scores were lined up and however many people needed to go came from the bottom. The top ones get a little closer to becoming a PI (Principle Investigator, aka someone who has postdocs and grad students working for them.)

    PIs, all the people you read about in the paper- they survived the process, but they're now nothing more than management. They don't do lab work, they don't do research. They solicit ideas from their postdocs, put the final polish on a grant proposal the postdoc slaved over, and get big fat checks from NIH for millions of dollars. The PIs then pass the work down to postdocs, who dole it out to grad students. The grad students do it because a PhD is dangled in front of them while they run on the treadmill of endless, monotonous, repetitive lab work and analysis work. The postdocs do it because faculty positions and PI slots are dangled in front of them.

    The problem with "the system" is that nobody is rewarded for reaching that brass ring. Just like Ford has no incentive to build a very durable car (no service/parts sales after the vehicle hits the end of the warranty, and the market quickly becomes saturated) researchers have no incentive to completely solve issues facing us today; their incentive is to come close enough to say "aha, look, we did find SOMETHING, so your grant money wasn't wasted."

    What incentive does a massive industry have to solve cancer, when it would put them out of business? Tens of thousands of people have dedicated most of their adult lives, usually to studying specific mechanisms and biological functions so narrow that if cancer were cured tomorrow, they would be useless- their training and knowledge is so focused, so narrow- they cannot compete with the existing population of researchers in other biomedical fields. Journals which charge big bucks for subscriptions also would be useless. Billions of dollars of materials, equipment, supplies, chemicals- gone. "Centers", hospitals, colleges, universities which each rake in hundreds of millions of dollars in private, government, and non-profit sourced money would be useless.

    1. Re:not results- grant dollars by smallfries · · Score: 4, Insightful

      What incentive does a massive industry have to solve cancer, when it would put them out of business? Tens of thousands of people have dedicated most of their adult lives, usually to studying specific mechanisms and biological functions so narrow that if cancer were cured tomorrow, they would be useless- their training and knowledge is so focused, so narrow- they cannot compete with the existing population of researchers in other biomedical fields. Journals which charge big bucks for subscriptions also would be useless. Billions of dollars of materials, equipment, supplies, chemicals- gone. "Centers", hospitals, colleges, universities which each rake in hundreds of millions of dollars in private, government, and non-profit sourced money would be useless.

      That's an old argument and although it sounds reasonable it is completely unsound. An industry does not function as a single cohesive entity with wants and desires. It is composed of many different individuals with their own wants and desires.

      I know enough academics to say for certain that if any one of those individuals could discover a cure that would put their entire employer out of business then they would leap at the chance. The fame that would follow would make another job easy enough to get, and the recognition is what they're really in it for anyway.

      --
      Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
  6. Science is hard - news at 11 by jstott · · Score: 2, Insightful

    And if you do pay the big bucks to a publisher for access to a scientific paper, there's no assurance that you'll be able to read it, unless you've spent your life learning to decipher them.

    I know that this is a real shock to you humanities majors, but science is hard. And yes, for the record, I do have degrees in both [physics and philosophy, or will as of this May — and the physics was by far the harder of the two].

    Here's another shocker. If you think the papers are hard to read, you should see the amount of work that went into processing the data until it's ready to be written up in an academic journal. Ol' Tom Edison wasn't joking when he said its "1% inspiration and 99% perspiration." If you think seeing the raw data is going to magically make everything clear, well, I'm sorry, the real world just doesn't work that way. Finally, if you think professional scientists are going to trust random data they downloaded off the web of unknown provenance, well, I'm sorry but that isn't going to happen either. I spend enough time fixing my own problems; I certainly don't have time to waste fixing other peoples' data for them.

    -JS

    --
    Vanity of vanities, all is vanity...
    1. Re:Science is hard - news at 11 by Anonymous Coward · · Score: 1, Insightful

      I fully agree.

      Furthermore, I've read the entire, long interview and get the feeling this is a person looking for a problem. Yes, taxpayer-funded research should be freely available. Yes, we could all benefit from more freely available data. But he builds up a massive and poorly defined manifesto with very little meat around a few good points.

      I'd love to have access to various data sets that I know exist, because others have published their results and described the data collection. But they likely invested multiple years of experimental work (and grant writing) to generate said data - so I see why they may be reluctant to hand it out to others. The solution must be based on giving credit where credit is due, yet this is precisely the problem: if I use someone else's experimental data, even old data, for a new analysis, how do I ensure they receive credit? My sense is that even if my paper clearly stated the generous source of the data, and cited the publications describing the original research, many readers would read past these bits and focus on my results only.

      I write this from the point of view of someone who a) has data and would like to share it, b) spends most of his time painstakingly generating more experimental data, c) is a bit conflicted, because I'd feel bad to see someone else get credit for an analysis of my hard-earned data without some of that rubbing off on me.

      As for the storage problem: there is none, in general. Certain special disciplines, yes, but those call for specific, appropriate solutions. The cost of storing most experimental data that I am aware of is completely dwarfed by, say, the cost of a single experiment's reagents, or the monthly health insurance of a single graduate student.

      Finally, the suggestion given in the interview that scientists must come up with standard "ontologies", etc. is misguided at best. Every area of specialization already is, and long has, maintained an ongoing, iterative process whereby new terms are introduced, debated, used, and accepted. And yet, as methods and ideas change, sometimes new things don't fit into the established vocabulary: but it would be wrong to suppose (or require) a consensus procedure for such cases. It's science, it's original, it's creative, and every scientist feels some ownership for their ideas and their specific use of terminology. So let us have sort it out, we're not bad at that stuff. Besides, "standard ontologies" usually become out of date the moment they are "ratified" ... better to be flexible and unstructured. That's how Google sees the world's information, and I'm fine with that.

  7. Re:And the scientists goes mooo! by wisty · · Score: 2, Insightful

    Why should science be more complex than necessary? For every String Theory area (where complexity is unavoidable) there are plenty of theories like economics, which just rely on weird jargon to fence out the interlopers.

  8. Scientific data is niether free nor cheap... by w0mprat · · Score: 2, Insightful

    Research data is typically large. In the mid-late 90s I recall a researcher planning to move 10 TB of data internationally. It wasn't exactly unprecedented either. The internet was simply not capable of such a transfer. Eventually they had to ship it on many disks.

    The problem is with such raw data, ie from a radio telescope, is you need all of it, you can't really cut any out before it's even processed.

    This is a lot less of a issue today with research networks all hooked into multi-gigabit pipes. But there are still very large datasets researchers are attempting to work with that are simply not cheap to handle.

    I think this is a great idea, it's nice being able to share it but as far as the really sexy big research going on these days I don't see it being much of a point-click-download service!

    --
    After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.