Major Scientific Journal Publisher Requires Public Access To Data
An anonymous reader writes "PLOS — the Public Library of Science — is one of the most prolific publishers of research papers in the world. 'Open access' is one of their mantras, and they've been working to push the academic publishing system into a state where research isn't locked behind paywalls and subscription services. To that end, they've announced a new policy for all of their journals: 'authors must make all data publicly available, without restriction, immediately upon publication of the article.' The data must be available within the article itself, in the supplementary information, or within a stable, public repository. This is good news for replicating experiments, building on past results, and science in general."
It would be nice to see this result in pressure on other publishers to require similar access to data backing the papers in their journals.
Will cut a lot of nonsense out of reading stuff into the results.
"Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."
And not just the data that was cherry-picked to support the hypothesis?
Public results? Anyone can take your work and use it for something profitable, while you scrape for grants to continue.
Actually, the Obama administration has mandated open data for all federally supported research. Good news indeed.
Awesome. Simply awesome
Will be interesting to see how this is balanced with patient privacy, in particular with the increasing numbers of human genomes being sequenced. I know a large proportion of the samples I work with in the lab have restrictions on how the data can be used/shared due to the wording of the informed consent forms. Many would certainly not allow public release of their genome sequence, so publishing in PloS (or any other journal with this policy) would be impossible. So while I think the underlying principle is good, I think an unintended consequence might be less privacy for patients wanting to participate in research (or less patients electing to participate at all).
This is bad news for ecologists and others with long-term data sets. Some of these data sets require decades of time and millions of dollars to produce, and the primary investigators want to use the data they've generated for multiple projects. Current data licensing for PLOS ONE (and--as far as I know-- all others who insist on complete data archiving) means that when you publish your data set, it is out there for anyone to use for free for any purpose that they wish; not just for verification of the paper in question. There are plenty of scientists out there who poach free online data sets and mine them for additional findings.
Requiring full accessibility of data makes many people reticent to publish in such a journal, because it means giving away the data they were planning on using for future publications. A scientist's publication list is linked not only to their job opportunities and their pay grade, but also to the funding that they can get for future grants. And of course those grants are linked to continuing the funding of the long-term project that produced the data in the first place.
What is needed is a new licensing model for published data that says "anyone is free to use these data to replicate the results of the current study, however it CANNOT be used as a basis for new analyses without written consent of the primary investigator of this paper or until [XX] years after publication." Journals would also need to agree that they would not accept any publications based on data that was used without consent.
It seems to me that this arrangement would satisfy the need to get data out into the public domain while respecting the scientists who produced it in the first place.
Open data is a great idea but it is not always practical. Particle physics experiments generate petabytes of extremely complex, hard to understand data. Making this publicly accessible is extremely expensive and ultimately useless since, unless you understand the innards of the detector and how it responds to particles and spend the time to really understand the complex analysis and reconstruction code there is nothing useful that you can do with the data. In fact one of the previous experiments I worked on went to great trouble to put their data online in a heavily processed and far easier to understand format in the hope that theorists or interested members of the public would look at the data. IIRC they got about 10 hits on the site per year and 1 access to the data.
So I agree with the principle that the public should be able to access all our data but for experiments with massive, complex datasets there needs to be a serious discussion about whether this is practical given the expense and complexity of the data involved. Do we best serve the public interest if we spend 25% of our research funding on making the data available to a handful of people outside the experiments with the time, skills and interest to access it given that this loss in funds would significantly hamper the rate of progress?
Personally I would regard data as something akin to a museum collection. Museums typically own far more than they can sensibly display to the public and so they select the most interested items and display these for all to see. Perhaps we should take the same approach with scientific data. Treat it as a collection of which only the most interesting selections are displayed to/accessible by the public even though the entire collection is under public ownership.
Standard policy. Nature have been doing this for some time. They state: authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications. So have Cell Press and Science. I stopped searching at this point, but I'm sure other major journals do the same thing.
soylentnews.org
Their journals aren't in my field (they are all bio journals), so I have not heard of them, but is it true that they are that big? Their web site wasn't much help in terms of information on subscriptions or article numbers, or I simply missed it. Can anyone familiar with them provide any input?
Their data policy might work for the biosciences, but good luck requiring all the many TB of raw data from a particle physics experiment to be put up somewhere. And in some instances, like that one, the raw data will most likely be useless without knowing what it all means, what the detectors were, what the detector responses are, etc. etc. etc. For experiments where it takes man-months or man-years to collect and process the data, making it all available in raw format will largely be a waste of time.
In general, at least for experiments done in the lab that use specialized equipment, raw data will not be very useful if you don't understand what you're collecting or familiar with the equipment. You can end up with situations like that guy who took the Mars rover images and kept zooming in until he saw a life form.
Same will be true for various kinds of employment data and census data.
It goes way beyond just genes and patient data. First, there's the issue of regulation. In most biology/psychology related fields, there's a raft of regulations from funding sources, internal review boards, the Dept. of Agriculture (which oversees animal facilities) and IACUCs for example that make it impossible to comply with this requirement, and will continue to do so for a long time. No study currently being conducted using animal facilities can meet this criteria, because many records related to animal facilities (including the all important experimental protocol) must remain confidential by statute (with the attestation of compliance from the IRB and IACUC). Likewise in the case of (any) human research, you'll have to get a protocol past the IRB for protecting subject anonymity, and given the likelihood of inadvertent identity disclosure that will extremely difficult to do.
Second, there's a deep flaw in how the policy is written and how it conceives of data. To wit, the policy defines: "Data are any and all of the digital materials that are collected and analyzed in the pursuit of scientific advances."
Now for starters, there's a loophole big enough to drive several trucks through: In many experimental contexts material necessary for complete understanding of the 'raw data' are not in digital form, but rather in say, lab notebooks. Which leads to the broader issue: what most researchers would be actually interested in seeing publicly disclosed is the 'data set' which is not 'raw data', but data that's processed into a useful, compact form that's suitable for statistical analysis.
However, in many experiments all of the material necessary to understand the 'raw data' (which I'll definite here as the measured result of an assay in a very general sense) is distributed between lab notebooks, digital data collection, calibration and compliance records in facilities archives and several levels of processing often using proprietary and very expensive software. Even if all of those things could be published (see above), the 'raw data' would be mostly worthless because of the vast amount of time and effort required in many cases to turn the 'raw data' into the 'data set'.
The third problem of course, which has been addressed in several places already on this thread is that there's no money in grants to fund the required repositories.
I think at some level this policy is a noble idea, but it's been implemented in a terrible way, and obviously written by people in fields that already have functioning, funded public databases. Either people are going to stop publishing in PLOS from many fields, or they'll drive the truck through the loopholes and it'll be just a toothless as Science and Nature's sharing requirements.
If they really wanted to effectively push for greater transparency, what they should be pushing at the moment is simultaneous publication of the 'data set', which would let fields that don't have standardized databases in place to design standards that would allow their creation.