Slashdot Mirror


Major Scientific Journal Publisher Requires Public Access To Data

An anonymous reader writes "PLOS — the Public Library of Science — is one of the most prolific publishers of research papers in the world. 'Open access' is one of their mantras, and they've been working to push the academic publishing system into a state where research isn't locked behind paywalls and subscription services. To that end, they've announced a new policy for all of their journals: 'authors must make all data publicly available, without restriction, immediately upon publication of the article.' The data must be available within the article itself, in the supplementary information, or within a stable, public repository. This is good news for replicating experiments, building on past results, and science in general."

29 of 136 comments (clear)

  1. Good policy by MtnDeusExMachina · · Score: 5, Interesting

    It would be nice to see this result in pressure on other publishers to require similar access to data backing the papers in their journals.

    1. Re:Good policy by Pseudonym · · Score: 2, Interesting

      You know who needs to introduce this rule? The ACM.

      I'm fed up with so-called scientific papers with results based on proprietary software. It doesn't even have to be open source, though that would clearly be good for peer review. If I can't (given appropriate hardware and other appropriate caveats) run your software, I can't replicate your results. If I can't replicate your results, it's not science.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
  2. Fantastic. by jpellino · · Score: 2

    Will cut a lot of nonsense out of reading stuff into the results.

    --
    "Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."
  3. Does it really say ALL data? by RogueWarrior65 · · Score: 5, Insightful

    And not just the data that was cherry-picked to support the hypothesis?

  4. Not such good news for getting paid. by Kaz+Kylheku · · Score: 2

    Public results? Anyone can take your work and use it for something profitable, while you scrape for grants to continue.

  5. U.S. funding agencies too by PvtVoid · · Score: 4, Informative

    Actually, the Obama administration has mandated open data for all federally supported research. Good news indeed.

  6. Yes! by AndyKron · · Score: 2

    Awesome. Simply awesome

  7. good and bad by eli+pabst · · Score: 3, Interesting

    Will be interesting to see how this is balanced with patient privacy, in particular with the increasing numbers of human genomes being sequenced. I know a large proportion of the samples I work with in the lab have restrictions on how the data can be used/shared due to the wording of the informed consent forms. Many would certainly not allow public release of their genome sequence, so publishing in PloS (or any other journal with this policy) would be impossible. So while I think the underlying principle is good, I think an unintended consequence might be less privacy for patients wanting to participate in research (or less patients electing to participate at all).

    1. Re:good and bad by canowhoopass.com · · Score: 3, Informative

      The linked blog specifically mentions patient privacy as an allowable exception. They also have exceptions for private third party data, and endangered species data. I suspect they want to keep the GPS locations for white rhino's hidden.

  8. Bad news for ecologists--new license needed by Bueller_007 · · Score: 4, Insightful

    This is bad news for ecologists and others with long-term data sets. Some of these data sets require decades of time and millions of dollars to produce, and the primary investigators want to use the data they've generated for multiple projects. Current data licensing for PLOS ONE (and--as far as I know-- all others who insist on complete data archiving) means that when you publish your data set, it is out there for anyone to use for free for any purpose that they wish; not just for verification of the paper in question. There are plenty of scientists out there who poach free online data sets and mine them for additional findings.

    Requiring full accessibility of data makes many people reticent to publish in such a journal, because it means giving away the data they were planning on using for future publications. A scientist's publication list is linked not only to their job opportunities and their pay grade, but also to the funding that they can get for future grants. And of course those grants are linked to continuing the funding of the long-term project that produced the data in the first place.

    What is needed is a new licensing model for published data that says "anyone is free to use these data to replicate the results of the current study, however it CANNOT be used as a basis for new analyses without written consent of the primary investigator of this paper or until [XX] years after publication." Journals would also need to agree that they would not accept any publications based on data that was used without consent.

    It seems to me that this arrangement would satisfy the need to get data out into the public domain while respecting the scientists who produced it in the first place.

    1. Re:Bad news for ecologists--new license needed by JanneM · · Score: 4, Insightful

      On the other hand, if I don't have your data I can't check your results. If you want to keep your data secret for a decade, you really should plan to not publish anything relying on it for that time either. Release all the papers when you release the data.

      Also, who gets to decide when a study is a replication and when it is a new result? Few replication attempts are doing exactly the same thing as the original paper, for good reason. If you want to see if it holds up you want to use different analysis or similar anyway. And "use" data? What if another group produces their own data and compares with yours? Is that "using" the data? What if they compare your published results? Is that using it?

      A partial solution, I think, is for a group such as yours to pre-plan the data use already when collecting it. So you decide from start to publish a subset of that data early and publish papers based on that. Then publish another subset for further results and so on.

      But what we really need is for data to be fully citeable. A way to publish the data as a reserach result by itself - perhaps the data, together with a paper describing it (but not any analysis). ANyone is free to use the data for their own research, but will of course cite you when they do. A good, serious data set can probably rack up more citations than just about any paper out there. That will give the producers the scientific credit it deserves.

      --
      Trust the Computer. The Computer is your friend.
    2. Re:Bad news for ecologists--new license needed by Arker · · Score: 2

      "What is needed is a new licensing model for published data that says "anyone is free to use these data to replicate the results of the current study, however it CANNOT be used as a basis for new analyses without written consent of the primary investigator of this paper or until [XX] years after publication." "

      I could not disagree more.

      What is needed here is to deal with the real problem - the issues that force working scientists into a position where doing good science (publishing your data) can harm your career.

      Slapping a band-aid on a symptom without addressing the fundamental malfunction here is guaranteed to make things worse, not better.

      --
      =-=-=-=-=-=-=-=-=-=-=-=-=-=-
      Friends don't let friends enable ecmascript.
    3. Re:Bad news for ecologists--new license needed by Crispy+Critters · · Score: 2
      "There are plenty of scientists out there who poach free online data sets and mine them for additional findings."

      Right. This leads to a two-class system where the scientists that collect the data (and understand the techniques and limitations) are treated as technicians while those that perform high-level analysis of others' results get the publications. This can lead to unsound, unproductive science in may cases. Those who understand the details are not motivated, and the superficial understanding of those that write the publications leads to errors.

    4. Re:Bad news for ecologists--new license needed by the+gnat · · Score: 3, Interesting

      Some of these data sets require decades of time and millions of dollars to produce, and the primary investigators want to use the data they've generated for multiple projects. . . There are plenty of scientists out there who poach free online data sets and mine them for additional findings.

      I work in a field (structural biology) that had this debate back when I was still in grade school: the issue was whether journals should require deposition of the molecular coordinates in a public database, or later, should these data be released immediately on publication, or could the authors keep them private for a limited time. The responses at the time were very instructive: one of the foremost proponents of data sharing was accused of trying to "destroy crystallography as we know it", to which his response was yes, of course, but how was that a bad thing? Skipping to the punchline: nearly every journal now requires immediate release of coordinates and underlying experimental data immediately upon publication, during which time the field has grown exponentially and there have been at least six Nobel prizes awarded for crystallography (at least one of which went to an early opponent of data sharing). The top-tier journals (Science, Nature) average about a paper per week reporting a new structure. Not only did the predicted dire consequences never happen, the availability of a large collection of protein structures has actually accelerated the field by making it easier to solve related sturctures (and easier to test new methods), and facilitated the emergence of protein structure prediction and design as a major field in its own right.

      The question I'm worried about: what form do the data need to take? Curating and archiving derived data (coordinates and structure factors) is already handled by the Protein Data Bank, but the raw images are a few orders of magnitude larger, and there is no public database available. Most experimental labs simply do not have the resources to make these data easily available. (The exceptions are a few structural genomics initiatives with dedicated computing support, but those are going away soon.)

    5. Re:Bad news for ecologists--new license needed by Bueller_007 · · Score: 2

      Release all the papers when you release the data.

      Not going to happen. You need to publish during the data collection period in order to continue getting the funding you need for data collection.

      Few replication attempts are doing exactly the same thing as the original paper, for good reason.

      Right, but replication of the experiment is the EXACT reason that we're making the data available. If you want to use the data for something else, that's fine, but if it's data that the original author is still using, then you should contact them about it first.

      A partial solution, I think, is for a group such as yours to pre-plan the data use already when collecting it. So you decide from start to publish a subset of that data early and publish papers based on that. Then publish another subset for further results and so on.

      Again, this is not realistic in the overwhelming majority of cases. One of the benefits of long-term studies are the unexpected findings. Imagine that I've been collecting data on a population of lemmings over the last 20 years. It seems to me that the lemmings have been getting smaller since I first started capturing them, so one day I decide to regress body size on year and I discover that the lemmings have indeed been shrinking, and I can show that it is probably linked to changes in vegetation driven by climate change. I shouldn't have to give away my entire 20-year data set (which I had been collecting for a different purpose) for anybody to use for any purpose in order for me to get this one study out in a timely fashion.

      Besides, many researchers are already dealing with data sets that are >50 years old, and your "plan to release the data before you start collecting the data" suggestion is moot for those people with inherited data sets.

      But what we really need is for data to be fully citeable.

      Getting your data cited is not NEARLY the same as publishing. Not even close. To get academic positions, pay increases, grants, etc., you need authorship. No one really cares about how often your paper or your data has been cited. That info isn't even on your CV or your grant applications, so no one will even have a rough idea unless it's a particularly preeminent paper.

    6. Re:Bad news for ecologists--new license needed by the+gnat · · Score: 2

      This leads to a two-class system where the scientists that collect the data (and understand the techniques and limitations) are treated as technicians while those that perform high-level analysis of others' results get the publications.

      Maybe in some fields, but in genomics and molecular biology, the result tends to be exactly the opposite: the experimentalists (and their collaborators) get top-tier publications, while the unaffiliated bioinformaticists mostly publish in specialty journals.

    7. Re:Bad news for ecologists--new license needed by Michael+Woodhams · · Score: 2

      There are plenty of scientists out there who poach free online data sets and mine them for additional findings.

      And this is a good thing, despite your word "poach". Analyses which would not have occurred to the original experimenters get done, and we get more science for our money. For many big data projects (e.g. the human genome project, astronomical sky surveys), giving 'poaching' opportunities is the primary purpose of the project.

      A former boss of mine once, when reviewing a paper, sent a response which was something like this:

      "This paper should absolutely be published. The analysis is completely wrong, but it is a wonderful data set, and somebody will quickly publish a correct analysis once the data is available."

      Now I need to stop wasting time on /. and return to my work in hand, which, as it happens, is 'poaching' data from
      Ingman, M., H. Kaessmann, S. Paabo, and U. Gyllenstern. 2000.
      Mitochondrial genome variation and the origin of modern humans. Nature 408:708--713.

      --
      Quattuor res in hoc mundo sanctae sunt: libri, liberi, libertas et liberalitas.
  9. Practicalities by Roger+W+Moore · · Score: 5, Interesting

    Open data is a great idea but it is not always practical. Particle physics experiments generate petabytes of extremely complex, hard to understand data. Making this publicly accessible is extremely expensive and ultimately useless since, unless you understand the innards of the detector and how it responds to particles and spend the time to really understand the complex analysis and reconstruction code there is nothing useful that you can do with the data. In fact one of the previous experiments I worked on went to great trouble to put their data online in a heavily processed and far easier to understand format in the hope that theorists or interested members of the public would look at the data. IIRC they got about 10 hits on the site per year and 1 access to the data.

    So I agree with the principle that the public should be able to access all our data but for experiments with massive, complex datasets there needs to be a serious discussion about whether this is practical given the expense and complexity of the data involved. Do we best serve the public interest if we spend 25% of our research funding on making the data available to a handful of people outside the experiments with the time, skills and interest to access it given that this loss in funds would significantly hamper the rate of progress?

    Personally I would regard data as something akin to a museum collection. Museums typically own far more than they can sensibly display to the public and so they select the most interested items and display these for all to see. Perhaps we should take the same approach with scientific data. Treat it as a collection of which only the most interesting selections are displayed to/accessible by the public even though the entire collection is under public ownership.

    1. Re:Practicalities by RDW · · Score: 3, Informative

      There could be significant issues with biomedical data, too. For example, the policy gives the example of 'next-generation sequence reads' (raw genomic sequence data), but it's hard to make this truly anonymous (as legally and ethically it may have to be). For example, some researchers have identified named individuals from public sequence data with associated metadata: http://www.ncbi.nlm.nih.gov/pu...

    2. Re:Practicalities by Anonymous Coward · · Score: 3, Insightful

      Uploading and hosting it in the first place to meet such a requirement would be an extremely difficult & costly endeavor.

      Perhaps the compromise is to include a clause that requires the author to permit others to obtain a copy and/or access the data, but only if the receiver of the data pay for the cost to transfer/access the data. This is similar to state open records access laws, where you must pay for things like the cost to make copies of documents. So in the above case, satisfying the "must permit access" clause might be as simple as permitting the researcher to come to the facility and access the data from a terminal and browse or whatever it is they do to explore/analysis the data that results from these experiments, thus no costly copying of data is required.

      If that isn't agreeable or feasible for the author/institution, then perhaps such research would simply be more appropriately published in a different journal that isn't as focused on openness and verifyability.

    3. Re:Practicalities by Crispy+Critters · · Score: 3, Insightful
      "petabytes of extremely complex, hard to understand data"

      The point seems to be missed by a lot of people. RAW DATA IS USELESS. You can make available a thousand traces of voltage vs. time on your detector pins, but that is of no value whatsoever to anyone. The interpretation of these depends on the exact parameters describing the experimental equipment and procedure. How much information would someone require to replicate CERN from scratch?

      Some (maybe most, but not all) published research results can be thought of as a layering of interpretations. Something like detector output is converted to light intensity which is converted to frequency spectra and the integrated amplitudes of the peaks are calculated and are fit to a model and the parameters fit giving you a result that the amplitude of a certain emission scales with temperature squared. Which of these layers is of any value to anyone? Should the sequence of 2-byte values that comes out of the digitizer be made public?

      It is not possible to make a general statement about which layer of interpretation is the right one to be made public. Higher levels, closer to the final results, are more likely to be reusable by other researchers. However, higher levels of interpretation provide the least information for someone attempting to confirm that the total analysis is valid.

    4. Re: Practicalities by Obfuscant · · Score: 4, Informative

      Whether or not you make the data publically available, you have to store and make it privately available,

      I have boxes and boxes of mag tapes with data on it from past experiments. That's privately available. It will never be publicly available.

      putting in public access is a matter of creating a read-only user and opening a firewall port.

      It is clear that you have never done such a thing yourself. There is a bit more to it than what you claim. I've been doing it for more than twenty years, keeping a public availability to much of the data we have (but not all -- tapes are not easily made public that way), and there is a lot more to dealing with a public presence than just "a read-only user and a firewall port".

      The sad thing is that most scientists don't actually store their data properly, it sits on removable hard drives, cd or an older variant of portable media

      And now you point out the biggest issue with public access to data: the cost of making it online 24/7 so the "public" can maybe sometime come look at the data. Removable hard drives are perfectly good for storing old data, and they cost a lot less than an online raid system. For that data, that is storing it "properly".

      If you want properly managed, publicly open data for every experiment, be prepared to pay more for the research. And THEN be prepared to pay more for the archivist who has to keep those systems online for you after the grants run out. And by "you", I'm referring to you as the public.

      Researchers get X amount of dollars to do an experiment. Once that grant runs out there is no more money for maintenance of the online archive, if there was money for that in the first place. For twenty two years our online access has been done using stolen time and equipment not yet retired. When the next grant runs out, the very good question will be who is going to be maintaining the existing systems that were paid for under those grants. Do they just stop?

    5. Re:Practicalities by Pseudonym · · Score: 2

      There's precedent for this. In many biology experiments, the "raw data" is an actual organism, like a colony of bacteria or something. There are scientific protocols for accessing that "data", but you have to be able to prove that you are an institution that can handle it. Even if the public "owns" it, technically speaking, no reputable scientist is going to send an e. coli sample to just anyone.

      So I think we all understand that, in practice, we mean different things by "public access". Sometimes that means that anyone should be able to download the data, and sometimes that means that anyone should be allowed to go there and examine it for themselves.

      --
      sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
    6. Re:Practicalities by Jane+Q.+Public · · Score: 2

      "But if I have to spend $100k on lobbying before I get public funding, I don't want to have to share the results with freeloaders who didn't pony up the lobbying cash and didn't put the manpower into the research."

      You are describing exactly why the current system is broken.

      First off, if the research is worthwhile you shouldn't have to spend $100,000 to lobby for it. And I would argue that is an unethical practice: what about the little guy who is doing promising research but doesn't have the funds to lobby?

      Second: quite frankly I don't give a flying fuck how much you spent to get the grant. Public money is public money. If I'm paying for it, it belongs to me. Period. And I don't care even a little if you don't like that.

      "The rest of society benefits from the public funds after they have bought my product."

      Then go pay to get a patent on your own, and leave public funds out of it. Why should the public pay so that you can profit? Independent inventors do it all the time without public funding. What makes you so special?

      "Take Google, for instance."

      Is Google doing publicly-funded research? That's news to me. If so, I object very strongly.

      I suspect you are being sarcastic here. If you're not, I simply disagree with you. Very much.

    7. Re: Practicalities by guruevi · · Score: 2

      I actually do this for a living; Having data available for projects does require it to be on large data systems which are properly backed up etc. Heck, any halfway decent staged system (Sun used to make really good ones) will allow you to read tapes as if it were a regular network share. The problem will be (which is inevitable) that your PI is going to ask for the data 3 years after they left the institute and your tapes will be unreadable (either because they degrade or because you can't find a reader and associated busses and software)

      The mag tapes in boxes problem we fixed years ago by simply putting everything on spinning rust with ZFS. As capacity increases (we're 3 generations in now - 750GB, 2TB and now 4TB drives), the old stuff simply takes up a diminishing percentage of any expansion we put in. Individual data sets from ~10 years ago were 100MB, now they're close to 2GB, those 100MB sets aren't even a noticeable portion today whereas back in the day they filled up the entire *gasp* 3TB array.

      I do understand the grant issues, most of those grants will actually mandate a 20 year or-so archival period but never have the money for it. I've figured out that future grants will simply pay for today's "large amount" of data storage in a small overhead because 10 years from now, 2TB of storage for a study will be like today's 100MB for a study.

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
  10. This is not new at all by umafuckit · · Score: 2

    Standard policy. Nature have been doing this for some time. They state: authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications. So have Cell Press and Science. I stopped searching at this point, but I'm sure other major journals do the same thing.

  11. Prolific publishing by hubie · · Score: 2
    one of the most prolific publishers of research papers in the world.

    Their journals aren't in my field (they are all bio journals), so I have not heard of them, but is it true that they are that big? Their web site wasn't much help in terms of information on subscriptions or article numbers, or I simply missed it. Can anyone familiar with them provide any input?

    Their data policy might work for the biosciences, but good luck requiring all the many TB of raw data from a particle physics experiment to be put up somewhere. And in some instances, like that one, the raw data will most likely be useless without knowing what it all means, what the detectors were, what the detector responses are, etc. etc. etc. For experiments where it takes man-months or man-years to collect and process the data, making it all available in raw format will largely be a waste of time.

    In general, at least for experiments done in the lab that use specialized equipment, raw data will not be very useful if you don't understand what you're collecting or familiar with the equipment. You can end up with situations like that guy who took the Mars rover images and kept zooming in until he saw a life form.

  12. Re:HIPAA by Crispy+Critters · · Score: 2
    Unfortunately, it has been shown already that the few details relevant to medical studies can often be used to uniquely identify individuals even after name and address are removed. "Yaniv Erlich shows how research participants can be identified from 'anonymous' DNA" http://www.nature.com/news/pri...

    Same will be true for various kinds of employment data and census data.

  13. RIP PLOS by wanax · · Score: 2

    It goes way beyond just genes and patient data. First, there's the issue of regulation. In most biology/psychology related fields, there's a raft of regulations from funding sources, internal review boards, the Dept. of Agriculture (which oversees animal facilities) and IACUCs for example that make it impossible to comply with this requirement, and will continue to do so for a long time. No study currently being conducted using animal facilities can meet this criteria, because many records related to animal facilities (including the all important experimental protocol) must remain confidential by statute (with the attestation of compliance from the IRB and IACUC). Likewise in the case of (any) human research, you'll have to get a protocol past the IRB for protecting subject anonymity, and given the likelihood of inadvertent identity disclosure that will extremely difficult to do.

    Second, there's a deep flaw in how the policy is written and how it conceives of data. To wit, the policy defines: "Data are any and all of the digital materials that are collected and analyzed in the pursuit of scientific advances."

    Now for starters, there's a loophole big enough to drive several trucks through: In many experimental contexts material necessary for complete understanding of the 'raw data' are not in digital form, but rather in say, lab notebooks. Which leads to the broader issue: what most researchers would be actually interested in seeing publicly disclosed is the 'data set' which is not 'raw data', but data that's processed into a useful, compact form that's suitable for statistical analysis.

    However, in many experiments all of the material necessary to understand the 'raw data' (which I'll definite here as the measured result of an assay in a very general sense) is distributed between lab notebooks, digital data collection, calibration and compliance records in facilities archives and several levels of processing often using proprietary and very expensive software. Even if all of those things could be published (see above), the 'raw data' would be mostly worthless because of the vast amount of time and effort required in many cases to turn the 'raw data' into the 'data set'.

    The third problem of course, which has been addressed in several places already on this thread is that there's no money in grants to fund the required repositories.

    I think at some level this policy is a noble idea, but it's been implemented in a terrible way, and obviously written by people in fields that already have functioning, funded public databases. Either people are going to stop publishing in PLOS from many fields, or they'll drive the truck through the loopholes and it'll be just a toothless as Science and Nature's sharing requirements.

    If they really wanted to effectively push for greater transparency, what they should be pushing at the moment is simultaneous publication of the 'data set', which would let fields that don't have standardized databases in place to design standards that would allow their creation.