Research Data: Share Early, Share Often
Shipud writes "Holland was recently in the news when a psychology professor in Tilburg University was found to have committed large-scale fraud over several years. Now, another Dutch psychologist is suggesting a way to avert these sort of problems, namely by 'sharing early and sharing often,' since fraud may start with small indiscretions due to career-related pressure to publish. In Wilchert's study, he requested raw data from the authors of some 49 papers. He found that the authors' reluctance to share data was associated with 'more errors in the reporting of statistical results and with relatively weaker evidence (against the null hypothesis). The documented errors are arguably the tip of the iceberg of potential errors and biases in statistical analyses and the reporting of statistical results. It is rather disconcerting that roughly 50% of published papers in psychology contain reporting errors and that the unwillingness to share data was most pronounced when the errors concerned statistical significance.'"
"Trust me I'm a scientist" isn't good enough anymore?
...most people who do it are downright bad at it. That they might take more time and care to be good at it without the perpetual axe of publish-publish-publish and grants funding hanging over their heads is another issue all together.
Don't believe anything that hasn't been verified by an independent group of researchers.
I do research in textual web mining and from time to time I have other researchers ask me for my collections which I spider myself from copyrighted web sources. While my work is purely academic, I am covered by fair use. But since US intellectual property laws are obtuse and overbearing (imho), I cannot take the risk of sharing my collections with others for fear of running afoul of copyright law (since I can't control what is done with the collection once it is out of my hands and how do I know they would use it in a manner consistent with fair use). So it may be more than an unwillingness out of statistical fudging and more an unwillingness to become a target of copyright lawyers.
I called it a mighty Sperm Whale, she called it Finding Nemo.
One reason scientist's don't share is because if the data gets out early and gets around (damn slutty data) is that other scientist's might steal/copy/scope/whatever the data. Unless there is a great way to prevent this the suggestion proposed here will never go anywhere.
It is very difficult to make a man understand something when his job depends on not understanding it. If psychology research were made to adhere to any kind of stringent scientific standard, there would be no psychology research.
Sounds like you have some issues with authority. Would you like to discuss it?
Faster! Faster! Faster would be better!
Some probes like Mars Rovers, Cassini, SOHO post their data on the web within days. Others like kepler and ESA-Express have posted very little of their data. The tradition is for Principal Investigators to embargo the data one year.
What? The IPCC was just collecting already published data, there was no 'new' studies done.
Careful - your bias is shining through.
Einstein was unable to find a teaching post, and was working in a patent office when he published his annus mirabilis papers. Things have changed over the years though. John Dewey discovered a century ago how children best learned - let the child direct his own learning, and have an adult to facilitate this. This, of course, is not how children are taught. Things nowadays are very test-heavy, and becoming even more so, not as a means to help students in seeing what their deficiencies are, but as a punishment system - and the teachers, and the administrators are under the same punishment system. The carrot of reward is very vague and ill-defined and far-off. It is a system designed to try to squelch the curiosity of those handful of students who had been curious and wanted to learn. Businesses want to get into the education gravy train, and all this charter school stuff is being embraced by both parties, which isn't surprising if you look at the funding behind it.
At the university, the financial incentives are all aligned so that publishing is a necessity. If one does not publish, they do not get tenure, and then all those years of work were for naught as the academic career is over. And what gets published? An average series of experiments done by the scientific method would usually lead to either inconclusive data and results, or just wind up in a dead end. And what journal wants to publish those results after months of work? One of the most popular Phd comics is this one. It seems fairly obvious to me - the more financial incentives are tied to getting published, the more that bogus studies are going to be published. As far as the idea of honesty, integrity or whatever, these things will gradually subside for most people when they come into conflict with keeping a roof over one's head and food on the table.
...most people who do it are downright bad at it. That they might take more time and care to be good at it without the perpetual axe of publish-publish-publish and grants funding hanging over their heads is another issue all together.
I agree and I can think of something to illustrate your point.
I was listening to a This American Life episode a few weeks back and there was a story done on two people -- one a music professor and the other a respected oncologist -- who were investigating a long defunct theory that certain electromagnetic wavelengths can kill cancer cells and only cancer cells leaving healthy cells completely fine. When left to run the test, the music professor failed to maintain the control correctly and many other things. But after being corrected by the respected researcher they started getting positive sets of preliminary results. The respected researcher requested that the music professor not share this with anyone and not to attach his name to it just yet.
Well, the music professor did not follow this advice because he was so excited about the preliminary results and had, I guess, sort of felt like the respected researcher had short changed him and suppressed him. What the music professor wanted to do was blow the lid off this thing with possibly flawed data and sent it to other oncologists with the original researcher's name attached to it -- possibly misrepresenting it as flawed data. Now I can see why a researcher might fly off the handle when data is released extremely early. They were having problems recreating their own findings (with sham-control) which caused the original researcher to want to keep this very much out of the public's eye. You might claim he was just trying to save himself embarrassment but there's nothing embarrassing about finding out your hypothesis is wrong in science, I just think the best researchers avoid these "failures" and the subsequent investment of resources into them.
I think that scientists figure out how to create the most data and separate the wheat from the shaft in a very lengthy (think decades) long process whereas the first sign of a breakthrough might cause more inexperienced researchers to show the world. And the reason, as you mentioned, is probably the immediate funding they can get with it. But I think it badly neuters scientific news, the reward system and even the direction that research takes. But to release and share early on and often might just make everyone look bad when the whole background of the data is unknown to someone who receives it.
My work here is dung.
Ultimately, everyone agrees that open sharing of research data funded by the taxpayers would be A Good Thing(TM). The problem is: how do you persuade people to actually do it. Much how things like advanced safety features on cars, free college tuition, and taxes on big banks sound like great ideas, until you look at what it will actually cost to implement. Not just "cost" in terms of money for infrastructure development, data storage, and support, but in terms of persuading an entire culture to change their workflow.
In our lab, we already spend an extraordinary amount of time on administrative tasks only indirectly related to our research. Adding in a mandatory data sharing task and fielding questions from random people who wanted to use it would be a serious additional chore. Then there's the embarrassment aspect... we actually had a project a couple months ago where there was another group doing an experiment that we wanted to do, and they had software already written. So we thought, "great, we'll just ask them for the code". So we fired off an email... and after a couple weeks we finally got a reply to the effect of "this is actually my first program, and I don't feel comfortable sharing it." So we had to spend 2-3 months writing our own version to do exactly the same thing.
The NSF is now requiring this as part of grant applications. You have to have a data management plan that includes the public deposit of both the data and results from grant funded work. Other funding orgs are following suit.
This is a fairly major project at the university I work for, both from the in-process data management perspective (keeping field researchers from storing their only copies on thumbdrives and laptops) and from the long-term repository perspective for holding the data when the grant is completed (that's what I'm involved with).
Storage is cheap. Convincing university administrators to pay for keeping it accessible is another problem, but the NSF position is helping.
If Star Trek had the internet: Captain, we've received an IM from the romulans. "Surrender or be destroyed. LOL. o.O"
And continues. Phil Jones, for example, has stonewalled requests for the raw data used to e.g. create HadCRUT3 etc, although recently it seems that one reason he hasn't shared it is that he lost it and literally can't share it. So we have a rather important temperature series, openly available on the web and used by many, many climate researchers and nobody can reconstruct it, including the original author. The problem continues -- it is like pulling teeth, getting members of the hockey team to share data and/or methods so anyone can check them.
Since the few times somebody has bulled through until they've succeeded, e.g. Steve Mcintyre vs Michael Mann, what has been discovered is that the published result (the infamous MBH "hockey stick") is nothing but amplified, distorted white noise that has absolutely no correlation with the data used to produce it, let alone skill at reconstructing actual past temperatures, it doesn't bode well for the discipline.
I've recently written a guest article on WUWT calling for data/methods transparency in climate research. By transparent, I mean that you should not be allowed to publish a paper that could potentially influence lawmakers and public policy to the tune of hundreds of billions of dollars unless you simultaneously publish all contributory raw data (including any data you for any reason left out) and the actual computer code used to process it into figures and conclusions. Something this important needs full open source open data transparency even more than medical research (another discipline where reproducibility of results is abysmal, where there are vested interests galore, and where we spend/waste a phenomenal amount of both money and human morbidity and mortality on crap results.
rgb
Even when the experts all agree, they may well be mistaken. --- Bertrand Russell.
I wonder if we just haven't quite mastered the techniques necessary to deal scientifically with highly complex systems. Psychology, economics, climatology, etc., all are theoretically understandable, but are so chaotic that our standard scientific methodology can't be applied... you can't, for instance, repeat an experiment. You can't isolate one changing variable.
Are you kidding? What's the cost of storage on a webserver, per byte? Would that be "zero" compared to the size of any reasonable dataset in the discipline? It would. You could put up a single e.g. 10 TB server in a single lab for a few thousand dollars and it would cost a few hundred dollars a year to run and would handle all the data associated with all the publications in psychology in a decade.
What is expensive and wastes taxes is bozos who do crap research, publish the crap results, hide the crap data and crap methods, and are cited repeatedly in other people's work, a circle of error and corruption that often lasts for years before it is finally discovered and weeded out. We pay for that work already; we need to make people accountable for it by requiring data/methods transparency (if you are e.g. not privately funded). That way the bozos would have research careers that are either over instantly or they'd get so sharply corrected by their peers that they'd wake up and do things right.
rgb
Even when the experts all agree, they may well be mistaken. --- Bertrand Russell.
Sorry, but you're an intellectual bigot who resorts to citing well-known celebrities rather than actually researching what the content of a field actually is and making a principled argument. Unfortunately, your bigotry is only ameliorated by its ubiquity in communities such as Slashdot.
A number of points need to be made:
First, most people have a stereotyped idea of what psychology is, because they don't actually know what it is. It's the scientific study of human behavior and experience. If you think it's couches and Freud, you're uninformed. My guess is that Feynman took psychology courses and had his primary exposure to the field during the mid-20th century, when psychoanalysis was dominant in *one branch of psychology*, and isn't even dominant in that area anymore. Psychologists study molecular neurobiology, multivariate statistics, neurophysiology, immunology, and any other number of topics. Be prepared to argue that those fields aren't science (or math) if you're prepared to argue that psychology isn't a science.
Second, it's worth noting that this fraud case (and the way the story is framed) focuses on psychology, but similar problems happen in other fields. E.g.:
http://en.wikipedia.org/wiki/Controversy_over_the_discovery_of_Haumea
http://abcnews.go.com/Health/Wellness/chronic-fatigue-researcher-jailed-controversy/story?id=15076224
Finally, what would you propose to do instead? Study human behavior and experience nonscientifically? That's what you seem to be suggesting.
And continues. Phil Jones, for example, has stonewalled requests for the raw data used to e.g. create HadCRUT3 etc, although recently it seems that one reason he hasn't shared it is that he lost it and literally can't share it.
That is complete and utter bullshit.
First, he has never stonewalled requests for the raw data. It's been out there for ANYONE to obtain. The problem is that, for some of it, you have to PAY to get it, and UEA was forbidden by contract to give away said data for free because then people wouldn't PAY for it anymore. So, if you want to piss and moan about access to the raw data, then apply your angst and woe to the most responsible parties, the Met offices which want to profit from their weather data-gathering businesses.
Second, the "lost data" canard is a crock. Since the raw data is not owned or generated by UEA, but instead obtained from outside sources, they have NO mandate to keep the original raw data once they have processed it. They (and you and anyone else) can go and get it from the same sources at any time. Whip out your checkbook and get to it.
So we have a rather important temperature series, openly available on the web and used by many, many climate researchers and nobody can reconstruct it, including the original author. The problem continues -- it is like pulling teeth, getting members of the hockey team to share data and/or methods so anyone can check them.
You (and they) most certainly can get the original raw data and reconstruct it. There are literally mountains of data that have been released to the public on a large part of climate science. You just need to learn who and how to ask properly and, in some cases, how much it costs.
Here's a huge FREE repository of all kinds of climate-related data, from the climate scientists themselves.
Since the few times somebody has bulled through until they've succeeded, e.g. Steve Mcintyre vs Michael Mann, what has been discovered is that the published result (the infamous MBH "hockey stick") is nothing but amplified, distorted white noise that has absolutely no correlation with the data used to produce it, let alone skill at reconstructing actual past temperatures, it doesn't bode well for the discipline.
Mann's work has been vindicated and replicated time and time again, McIntyre's (and others') quixotic attempts to discredit it notwithstanding.
I've recently written a guest article on WUWT..
That explains the ignorance of your previous comments a bit.
..calling for data/methods transparency in climate research. By transparent, I mean that you should not be allowed to publish a paper that could potentially influence lawmakers and public policy to the tune of hundreds of billions of dollars unless you simultaneously publish all contributory raw data (including any data you for any reason left out) and the actual computer code used to process it into figures and conclusions. Something this important needs full open source open data transparency even more than medical research (another discipline where reproducibility of results is abysmal, where there are vested interests galore, and where we spend/waste a phenomenal amount of both money and human morbidity and mortality on crap results.
In large part, this is precisely what happens, with a few exceptions. Those exceptions usually revolve around whether any kinda of contracts with private entities to obtain said data, or to develop software/hardware, are in effect that would preclude giving them away. That said, the research should (and usually does) document the specifications for said hardware/software, and include where the original data came from for anyone to pay to obtain it themselves.
As a software developer who actually writes software for scienti
-SS "Teach the ignorant, care for the dumb, and punish the stupid."
Hmmm, you really do need to read the climategate 2 letters, don't you.
From message 4241.txt, a communication from Rob Wilson to Ed Cook (and others):
I first generated 1000 random time-series in Excel – I did not try and approximate the persistence structure in tree-ring data. The autocorrelation therefore of the time-series was close to zero, although it did vary between each time-series. Playing around therefore with the AR persistent structure of these time-series would make a difference. However, as these series are generally random white noise processes, I thought this would be a conservative test of any potential bias.
I then screened the time-series against NH mean annual temperatures and retained those series that correlated at the 90% C.L.
48 series passed this screening process.
Using three different methods, I developed a NH temperature reconstruction from these data:
1. simple mean of all 48 series after they had been normalised to their common period
2. Stepwise multiple regression
3. Principle component regression using a stepwise selection process.
The results are attached.
Interestingly, the averaging method produced the best results, although for each method there is a linear trend in the model residuals – perhaps an end-effect problem of over-fitting.
The reconstructions clearly show a ‘hockey-stick’ trend. I guess this is precisely the phenomenon that Macintyre has been going on about.
Surely this vindicates Mann -- by proving that it does indeed turn white noise into hockey sticks! Not only is Mann wrong, but the hockey team knows it perfectly well! There are letters where people openly lament being involved with the hockey stick type reconstructions (and other places, e.g. where they "hid the decline" in tree ring data) because they are terrible science and because they are openly worried that sooner or later people will catch on. As indeed they have, although they have won the PR war (another great Mann quote) to such an extent that even though they themselves know that the hockey stick is bogus and that white noise fit according to Mann's cherrypicking methodology will produce nothing but hockey sticks, it just won't die, will it? Thanks to people like you!
We could review the specific Climategate 2 letters where Jones talks about deliberately trying not to give away data to the people who requested it (something I would call "stonewalling", except that the circumstance in question is a FOIA request that was only a missed deadline away from being "a crime" upon the release of the CG emails), or about the points where it turns out that he does a lousy job of keeping records (problems with Excel spreadsheets) and no longer can reproduce his own results because he doesn't know what data he used, if you like.
Or we could look at the many, many other places where internal communications show that the hockey team is well aware of many problems with their own results and consistently choose not to let the general public know about them lest we be led to doubt their conclusion. Then we could read Feynman's lovely article on "Cargo Cult Science": http://www.lhup.edu/~DSIMANEK/cargocul.htm. See how close you think the hockey team comes to Feynman's fairly modest standard for good, honest science, while reading Mann going on about the importance of winning the PR war, getting journal editors fired, and generally doing his very best to eliminate all challenge to his papers, or, if he can't manage that, eliminating the challengers themselves.
But really, read them yourself. Don't accept what people tell you about them, read them! Then tell me that this is honest science, well done.
rgb
Even when the experts all agree, they may well be mistaken. --- Bertrand Russell.
No. Those who requested the data requested that if all the data couldn't be provided, then the freely available data should be provided. They were refused.
Bzzt. Wrong. Try again.
30.First, in answer to the question of whether the raw data are accessible and verifiable, Professor Jones told us that:
The simple answer is yes, most of the same basic data are available in the United States in something called the Global Historical Climatology Network. They have been downloadable there for a number of years so people have been able to take the data, do whatever method of assessment of the quality of the data and derive their own gridded product and compare that with other workers.
31.In addition, of course, there are the sources of the data, the weather stations, to which any individual is free to go and collect the data in the same way that CRU did. This is feasible because the list of stations that CRU used was published in 2008.
41. Professor Jones contested these claims. According to him, “The methods are published in the scientific papers; they are relatively simple and there is nothing that is rocket science in them”. He also noted: “We have made all the adjustments we have made to the data available in these reports; they are 25 years old now”. He added that the programme that produced the global temperature average had been available from the Met Office since December 2009.
51. Even if the data that CRU used were not publicly available—which they mostly are—or the methods not published—which they have been—its published results would still be credible: the results from CRU agree with those drawn from other international data sets; in other words, the analyses have been repeated and the conclusions have been verified.
When asked for a list of what data was used, but not the data itself, they refused. Even if the data is available for free on the net, how can the results be replicated if they will not say which data was used?
Jones PERSONALLY refused. The information about what data was used has been available since the original papers and research were performed! IT'S IN THE RESEARCH, DURRRR. Have you ever read any of it?
It has only been replicated by his buddies.
Bzzt! Wrong. Try again.
BEST was funded by the Koch brothers, owners of a giant oil/petrochemical company. Most DEFINITELY NOT "buddies" with Mann. Even still, being "buddies" in science doesn't mean diddly-squat; it's not about WHO you know, but WHAT you know, and HOW WELL you know it. So far, Mann's work has been REPEATEDLY vindicated.
There can be no vindication for trying to "hide the decline".
Ya know, for a minute there, I thought you might be trying to be genuinely serious and skeptical. Then you trot THAT out. /facepalm
It is a well established rule of science that you don't leave out data that casts doubt on your conclusion.
You are correct, it is, and the vast majority of climate scientists and their research faithfully follow that rule, no matter how many intellectually dishonest, ignorant, and gullible idiots falling for charlatans and snake oil salesmen lke Watts, Michaels, Singer, et cetera ad nauseum, try to spin otherwise.
You've fallen for their story.
No, I've fallen for the FACTS of the matter. I've done my homework; I've looked beyond anyone's story; what's YOUR excuse?
Many of us used to think the alarmists were good willed, and we assumed they were honest. I still think they are good
-SS "Teach the ignorant, care for the dumb, and punish the stupid."
So you admit that they stonewalled on the station list till 2008? And you admit that they didn't release their software until after they had been exposed by the climate gate email release? I may not have been clear, but I didn't mean to imply that they still haven't released stuff, but only that they were stonewalling at one time.
Do you know what a "station list" is? It's a list of weather stations all over the world. It's not exactly a secret, ya know. He didn't stonewall on releasing anything that wasn't already accessible by the public. You get a list of all the Meteorological offices across the world, and you ask them for their list of stations. Some may require you to PAY for that information. What's so super seekrit squirrel about that?
They didn't release their software until they had PERMISSION to do so. I bet you also didn't know that some weather station data is STILL not published to this day. You have to get it from the MOs who SELL it, if you want a copy.
If they have proprietary info that they CAN NOT release, by contract, no amount of whining about "stonewalling" is going to change a damn thing about that. Get over it.
Strange. Why didn't he just give the URL for the files instead of refusing. But of course you've quoted a source admitting he didn't release the station list till 2008. So it doesn't look like it was "IN THE RESEARCH".
Because..he..didn't..have..rights..to..release..the..data. What part of this is unclear? The SOURCES (the MOs) of the data WERE in the research. He said that much in the report.
You cite BEST as replication by some other than buddies, but I was referring to replication of the hockey stick. BEST did not replicate the hockey stick. Furthermore, BEST was lead by an alarmist, so that is not clearly replication by other than buddies.
No, BEST does not do paleoclimate reconstruction, but the "blade" of the "stick", which is what many deniers actively dispute about it anyway, matches with a high degree of confidence to the BEST results.
Also, Richard Muller and Judith Curry are HARDLY "alarmist", considering Muller sided with McIntyre and McKittrick over the MBH98 reconstruction. He still voices opposition to it, but he's no longer doubting the temperature record, and where it is heading. Curry has been dissenting against the "mainstream" climate take on purely social grounds for a while now.
Anthony Watts admitted after his own study that the average temperature trend of the urban stations was no higher than the good rural stations. Of course he then minimized it and tried to make a seemingly insignificant issue of the difference between the trends in the diurnal temperature range.
Anthony Watts doesn't admit he's wrong about squat. In that dodge, he avoided any fallout from any admission of fault. It wasn't "mea culpa, I was wrong, maybe I should rethink things a bit", it was more like "meh.. even if I was wrong, it doesn't matter anyway; AGW STILL IS WRONG!!!!1!1!!oneoneone!1".
I see tons of ignorance on the skeptic side. The alarmist side actually seems to be much more grounded in facts.
That's nice of you to say, but...
But now we're seeing that the alarmist facts may not be as solid as was once thought.
Such as? Got the data? Research?
And you simply dismissed my criticism of the attempt to "hide the decline", but you gave no reasoned defense.
That's because it is an irrational and stupid canard that has had the snot beaten out of it so much that I can't see how anyone can STILL use it with a straight face.
OK, if you insist. Reasoned defense: You DO understand the context of that comment, right? Here, this video will 'splain things.
That is understandable given it appears to be indefensible.
-SS "Teach the ignorant, care for the dumb, and punish the stupid."