Major Scientific Journal Publisher Requires Public Access To Data
An anonymous reader writes "PLOS — the Public Library of Science — is one of the most prolific publishers of research papers in the world. 'Open access' is one of their mantras, and they've been working to push the academic publishing system into a state where research isn't locked behind paywalls and subscription services. To that end, they've announced a new policy for all of their journals: 'authors must make all data publicly available, without restriction, immediately upon publication of the article.' The data must be available within the article itself, in the supplementary information, or within a stable, public repository. This is good news for replicating experiments, building on past results, and science in general."
It would be nice to see this result in pressure on other publishers to require similar access to data backing the papers in their journals.
Will cut a lot of nonsense out of reading stuff into the results.
"Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."
Tables with improvement percentage readings will be less excellent.
And not just the data that was cherry-picked to support the hypothesis?
Public results? Anyone can take your work and use it for something profitable, while you scrape for grants to continue.
Actually, the Obama administration has mandated open data for all federally supported research. Good news indeed.
Awesome. Simply awesome
They have resisted showing data for years.
I hope this helps, though the warmistas have their own favorite journals...not PLOS.
It would be nice also if journals got on the bandwagon and accepted open formats (OpenDocument) instead of proprietary file formats like .doc and not fully open formats like .docx.
Will be interesting to see how this is balanced with patient privacy, in particular with the increasing numbers of human genomes being sequenced. I know a large proportion of the samples I work with in the lab have restrictions on how the data can be used/shared due to the wording of the informed consent forms. Many would certainly not allow public release of their genome sequence, so publishing in PloS (or any other journal with this policy) would be impossible. So while I think the underlying principle is good, I think an unintended consequence might be less privacy for patients wanting to participate in research (or less patients electing to participate at all).
This may have severe repercussions for how patient samples are collected. Especially in this day and age with so many privacy concerns left and right.
This is bad news for ecologists and others with long-term data sets. Some of these data sets require decades of time and millions of dollars to produce, and the primary investigators want to use the data they've generated for multiple projects. Current data licensing for PLOS ONE (and--as far as I know-- all others who insist on complete data archiving) means that when you publish your data set, it is out there for anyone to use for free for any purpose that they wish; not just for verification of the paper in question. There are plenty of scientists out there who poach free online data sets and mine them for additional findings.
Requiring full accessibility of data makes many people reticent to publish in such a journal, because it means giving away the data they were planning on using for future publications. A scientist's publication list is linked not only to their job opportunities and their pay grade, but also to the funding that they can get for future grants. And of course those grants are linked to continuing the funding of the long-term project that produced the data in the first place.
What is needed is a new licensing model for published data that says "anyone is free to use these data to replicate the results of the current study, however it CANNOT be used as a basis for new analyses without written consent of the primary investigator of this paper or until [XX] years after publication." Journals would also need to agree that they would not accept any publications based on data that was used without consent.
It seems to me that this arrangement would satisfy the need to get data out into the public domain while respecting the scientists who produced it in the first place.
Open data is a great idea but it is not always practical. Particle physics experiments generate petabytes of extremely complex, hard to understand data. Making this publicly accessible is extremely expensive and ultimately useless since, unless you understand the innards of the detector and how it responds to particles and spend the time to really understand the complex analysis and reconstruction code there is nothing useful that you can do with the data. In fact one of the previous experiments I worked on went to great trouble to put their data online in a heavily processed and far easier to understand format in the hope that theorists or interested members of the public would look at the data. IIRC they got about 10 hits on the site per year and 1 access to the data.
So I agree with the principle that the public should be able to access all our data but for experiments with massive, complex datasets there needs to be a serious discussion about whether this is practical given the expense and complexity of the data involved. Do we best serve the public interest if we spend 25% of our research funding on making the data available to a handful of people outside the experiments with the time, skills and interest to access it given that this loss in funds would significantly hamper the rate of progress?
Personally I would regard data as something akin to a museum collection. Museums typically own far more than they can sensibly display to the public and so they select the most interested items and display these for all to see. Perhaps we should take the same approach with scientific data. Treat it as a collection of which only the most interesting selections are displayed to/accessible by the public even though the entire collection is under public ownership.
Standard policy. Nature have been doing this for some time. They state: authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications. So have Cell Press and Science. I stopped searching at this point, but I'm sure other major journals do the same thing.
soylentnews.org
I am getting a migraine. can someone tell me how to beta-block?
Their journals aren't in my field (they are all bio journals), so I have not heard of them, but is it true that they are that big? Their web site wasn't much help in terms of information on subscriptions or article numbers, or I simply missed it. Can anyone familiar with them provide any input?
Their data policy might work for the biosciences, but good luck requiring all the many TB of raw data from a particle physics experiment to be put up somewhere. And in some instances, like that one, the raw data will most likely be useless without knowing what it all means, what the detectors were, what the detector responses are, etc. etc. etc. For experiments where it takes man-months or man-years to collect and process the data, making it all available in raw format will largely be a waste of time.
In general, at least for experiments done in the lab that use specialized equipment, raw data will not be very useful if you don't understand what you're collecting or familiar with the equipment. You can end up with situations like that guy who took the Mars rover images and kept zooming in until he saw a life form.
I guess they don't want any more publications from medicine. There is no way to truly, fully anonymize patient data. This is why the data is rarely provided, or locked behind a "prove you're a researcher" wall, or only a small subset given decade(s) later such that it would be much harder to trace.
Well, that really wraps it up for the global warming crowd.
If their source data has to be publicly accessible, it'll be laughed out off the stage before their "studies" get any traction.
And many scientists that get published in these high profile journals are scofflaws when it comes to sharing... It's been covered many times but compliance is near zero.
I'm worried about the wording of around ALL DATA. In many experiments ALL DATA could easily be interpreted as their entire data sets running into the many Tera or even Petabytes. Making this much data publicly available could be prohibitively expensive for many papers.
There is a great deal of science, and public policy, that would benefit from public exposure. But medical and sociological research benefits from the privacy of the subject, who then feel more free to be truthful. The same is true of political survey data, and "anonymizing" it can be a lengthy, expensive, and uncertain process, especially when coupled with various metadata that is being collected with the experiments or in parallel with it. It can also be very expensive to make public, even without privacy issues, because transforming it from obsolete media and making it available for public download often takes real engineering time. Long term science projects can span decades, and the first sets of data are often on obsolete media.
Overall, it seems an excellent policy, but exceptions will have to be made.
I'd agree with that. I once tried, very politely, to get data from authors of an NPG paper. They stalled and it become awkward. In the end I gave up because my interest was purely motivated by curiosity and I didn't want to make an enemy (even if the person in question was in a different field). Glad I backed off now as I've ended up moving into that field...
soylentnews.org
In the case of publicly funded research, all the advantage accrues to those who receive grants
Really? That's a rather ironic argument given that you are posting it on the web which was something invented and developed at CERN using publicly funded research money.
Your idea of practicality has nothing to do with open access, it's a justification for keeping a lid on it.
So why are you also not complaining that museums with publicly owned collections are not displaying every single item they own? Do you want them to stop researching collections and making acquisitions in the public interest and instead spend money on building thousands of square metres of new display space so every item they own can be displayed?
The public may own the data but there is a cost to making that data publicly available. My own experience has shown that even when that cost is met the public actually have almost no interest in looking at that data. I absolutely zero objections to making all the data publicly accessible provided someone is going to pay for all the network bandwidth, servers, system administration, disk and tape storage, network connections etc. needed to access the data. However as a member of the public I would question whether that is a sensible way to spend all the money required to provide that access and argue that that money would be better going on research. After all that additional money going on data access corresponds to fewer postdocs and graduate students working on the experiment which, unless the data is wildly popular, probably means fewer people using it not more.
"This is good news for replicating experiments, building on past results, and science in general."
It is, unless the data can't be made "publicly available, without restriction" (very important emph. added), in which case you can't publish there. Yes, there are others, but demanding dropping all restrictions in all cases is simply an approach blind to reality. Also, if they demand so, they must provide free storage, which in some cases could range to multiple gb of data - and you won't want to pay for indefinite storage of large datasets, for certain.
Also, I wish to repeat my hatred towards the kind of open access publication methods most (if not all) major sci outlets use, namely charging the author many thousands of USD/EUR for publication, costs which most grants don't cover (e.g. my institute mandates open access publications, but of course they don't provide the financial resources to do so). This in turn shifts the focus, since now it's in the best interest of a publisher to accept as many as they can (keep the money flowing), instead of accepting the best ones and get the money from interested readers (and yes, if it's good, they come). Of course politician-scientists like the publicity they get from folks for trying to 'set science free'. I just wish they'd do a bit more thinking, they are scientists after all (or so they claim to be).
I am putting myself to the fullest possible use, which is all I can think that any conscious entity can ever hope to do.
Yeah, it's getting colder outside on the global scale. Just look out the window every winter. It's all the proof I need. This winter the snow excess here was a football field-size snowflake. Those damn alarmists don't know what they're saying, let's just wait and see how wrong they are.
uhm...
There would seem to be a relatively easy solution to this problem - make the raw data available from the article itself, or at least as an attachment. If that requires petabytes of storage, then presumably PLOS will provide the necessary infrastructure. That way they can ensure that as long as the article is being offered, all data used is also available. Does that sound unreasonable considering their requirement?
Because agribusiness, Biotech and Medical is Big Bux, they don't want free and open access to their data. They won't let people send to PLOS.
It goes way beyond just genes and patient data. First, there's the issue of regulation. In most biology/psychology related fields, there's a raft of regulations from funding sources, internal review boards, the Dept. of Agriculture (which oversees animal facilities) and IACUCs for example that make it impossible to comply with this requirement, and will continue to do so for a long time. No study currently being conducted using animal facilities can meet this criteria, because many records related to animal facilities (including the all important experimental protocol) must remain confidential by statute (with the attestation of compliance from the IRB and IACUC). Likewise in the case of (any) human research, you'll have to get a protocol past the IRB for protecting subject anonymity, and given the likelihood of inadvertent identity disclosure that will extremely difficult to do.
Second, there's a deep flaw in how the policy is written and how it conceives of data. To wit, the policy defines: "Data are any and all of the digital materials that are collected and analyzed in the pursuit of scientific advances."
Now for starters, there's a loophole big enough to drive several trucks through: In many experimental contexts material necessary for complete understanding of the 'raw data' are not in digital form, but rather in say, lab notebooks. Which leads to the broader issue: what most researchers would be actually interested in seeing publicly disclosed is the 'data set' which is not 'raw data', but data that's processed into a useful, compact form that's suitable for statistical analysis.
However, in many experiments all of the material necessary to understand the 'raw data' (which I'll definite here as the measured result of an assay in a very general sense) is distributed between lab notebooks, digital data collection, calibration and compliance records in facilities archives and several levels of processing often using proprietary and very expensive software. Even if all of those things could be published (see above), the 'raw data' would be mostly worthless because of the vast amount of time and effort required in many cases to turn the 'raw data' into the 'data set'.
The third problem of course, which has been addressed in several places already on this thread is that there's no money in grants to fund the required repositories.
I think at some level this policy is a noble idea, but it's been implemented in a terrible way, and obviously written by people in fields that already have functioning, funded public databases. Either people are going to stop publishing in PLOS from many fields, or they'll drive the truck through the loopholes and it'll be just a toothless as Science and Nature's sharing requirements.
If they really wanted to effectively push for greater transparency, what they should be pushing at the moment is simultaneous publication of the 'data set', which would let fields that don't have standardized databases in place to design standards that would allow their creation.
Rather than publishing on proprietary data of uncertain characteristics, this will essentially force researchers to use common, known, and available data sets. A smattering of what's available and reputable:
http://www.itl.nist.gov/div898...
http://www.keypress.com/x2814....
http://lib.stat.cmu.edu/DASL/
http://www.statsci.org/dataset...
http://data.gc.ca/eng/facts-an...
http://library.med.cornell.edu...
"Consensus" in science is _always_ a political construct.
We won't know the result BUT yeah, finally researchers will have to really provide transparency on their work.
That works both ways though. Now Exxon et al will also have to show their justifications with hard numbers whose origins are clearly replicable.
"Consensus" in science is _always_ a political construct.
This would be revolutionary if applied to healthcare. It would mean that datasets could be recycled and meta analysed for rare tumors, rare cancers etc. It would also mean that drug companies will have to behave. Problems which may lead to panic such as how confidential data would be is often addressed at institutional review boards which vet the ethics of any study prior to its initiation at most institutions on the western hemisphere and based on personal experience dealing with studies involving patient data rarely are hing like filing number or ID codes used, neither is complete genetic data (pragmatism and practically would make it a little difficult to make a study with complete genetic code for 500 patients prohibitively expensive) Overall I hope this becomes a trend similar to that of open access which they have championed in the past.
http://elife.elifesciences.org...