Freeing and Forgetting Data With Science Commons
blackbearnh writes "Scientific data can be both hard to get and expensive, even if your tax dollars paid for it. And if you do pay the big bucks to a publisher for access to a scientific paper, there's no assurance that you'll be able to read it, unless you've spent your life learning to decipher them. That's the argument that John Wilbanks makes in a recent interview on O'Reilly Radar, describing the problems that have led to the creation of the Science Commons project, which he heads. According to Wilbanks, scientific data should be easy to access, in common formats that make it easy to exchange, and free for use in research. He also wants to see standard licensing models for scientific patents, rather than the individually negotiated ones now that make research based on an existing patent so financially risky."
Read on for the rest of blackbearnh's thoughts.
"Wilbanks also points of that as the volume of data grows from new projects like the LHC and the new high-resolution cameras that may generate petabytes a day, we'll need to get better at determining what data to keep and what to throw away. We have to figure out how to deal with preservation and federation because our libraries have been able to hold books for hundreds and hundreds and hundreds of years. But persistence on the web is trivial. Right? The assumption is well, if it's meaningful, it'll be in the Google cache or the internet archives. But from a memory perspective, what do we need to keep in science? What matters? Is it the raw data? Is it the processed data? Is it the software used to process the data? Is it the normalized data? Is it the software used to normalize the data? Is it the interpretation of the normalized data? Is it the software we use to interpret the normalization of the data? Is it the operating systems on which all of those ran? What about genome data?'"
Comment removed based on user account deletion
That's not true. Any tax funded study requires more documentation and publication then a private one. Anyone who reads them knows.
All studies worth anything are aimed at a audience proficient in the subject, they are not meant for general audiences, and are often proven wrong, you need repeatable results.
;Your faith in wikipedia is misplaced; it was both, actually.
Perhaps Sir I.N. was the first, so you do earn the proverbial "first quote"
Trickle-down. Dissemination of knowledge.
You don't know it yet (not meant as a jibe but it is something that clicks in after your PhD) but your primary function as a scientist is not to make discoveries. It is spreading knowledge. Sometimes that dissemination will occur in a narrow pool, through journal papers between specialists in that narrow pool of talent.
This is not the primary goal of science, although it can seem like it when you are slogging away at learning your first specialisation well enough to get your doctorate. Occasionally a wave from that little pool will splash over the side - maybe someone will write a literature review that is read by a specialist in another field. A new idea will be found - after all sometimes we know the result before we know the context that it will be applied to.
The pools get bigger as you move further downstream. Journal articles pass into conference publications, then into workshops. Less detail but carried through a wider audience. Then after a time, when the surface seems to have become still textbooks are written and the knowledge is passed on to another generation. We tend to stick around and help them find the experience to use it as well. This is why all PhD students have an advisor to point out the best swimming areas.
That was the long detailed answer to your question. The simple version is that you don't know who your target audience is yet. And limiting it to people in institutions that pay enormous access fees every year is not science. As a data-point - a lot of European institutes don't bother with IEEE fees. They run to about £50k/year which simply isn't worth it. As a consequence results published in IEEE venues are cited less in Europe. So even amongst the elite access walls have an effect.
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Actually IEEE allows you to make your paper available on the internet at *one* location. However the material must not be reprinted/republished without permission from the IEEE. They also don't allow making your work part of another world-wide indexed collection. That's still far from perfect but at least it allows you to make your work accessible on your homepage or your university's Digital Commons repository. I don't know what the future plans of IEEE are.
On the subject of reproducibility, I am reminded of a situation with Wei-Chyung Wang, a climate scientist.
He was involved in the paper Jones et al (1990), which is where the situation begins.
After *17 YEARS* of requests, Jones FINALLY released some of the data used in Jones 1990 through demands under the terms of the U.K. Freedom of Information policy on publicly funded research.
Wang himself is free from FOI requests because Wang is an American and operates in America, where FOI requests regarding publicaly funded studies have no legal weight.
The result of the eventual discloser of Jones, is that several researches have concluded that Wang fabricated research steps. That some of the steps could not have been performed, then or even now, and that for many of the climate stations used in his work the existing station histories directly contradict Wang's stated assessments about his data set.
Specifically he claimed that only a few of these recording stations had been moved during the time-frame significant to the research, and that they were free from significant urbanization changes (the research was to measure the "Urban Heat Island" (UHI) Effect.) In short, Wang claimed that the stations histories showed that they were largely "homogeneous."
According to the DOE CAS study, in regards to the quality of Wang's other station data, "details regarding instrumentation, collection methods, changes in station location or observing times are not known." The CAS bills itself as the most comprehensive history of Chinese climate available to date. Note that Wang actualy cited the CAS as one of the sources for his data.
Essentialy both Wang et al 1990 and Jones et al 1990 were fradulent pieces of work that was never independently verified, and could not have been verified given both the straight out fraud and the failure to disclose the data set used.
(Jones denies knowledge of Wangs fabrication of data.)
Sparked by this controversy, new research specifically addressing the UHI based on the Chinese climate record paints an entirely different picture with regards to China, that the effect is in fact much more significant that concluded by Jones et al 1990.
FULL DATA DISCLOSER IS NEEDED.
This is especialy true in some areas of science, where all the big players not only know each other, BUT WORK, PUBLISH, AND PEER REVIEW TOGETHER.
One specific small group of people is directly influencing global policies regarding climate change through their direct involvement with the IPCC, all the while hiding their own work and obstructing validation of their work.
"His name was James Damore."
How can the results be reproducible if you don't keep the original data?
As others noted, there are cases where raw data is king, and others where raw data is virtually useless. LHC raw data will be invaluable. Raw data from genetic sequencing is a waste of time to keep. Why store huge graphics files when the only thing we will ever want from them is the sequence of a few letters? One must be able to distinguish between these two possibilities (and more subtle, less black and white cases, too), and there is no one size fits all solution.
That said, you may be surprised how well really valuable data is stored by good principal investigators. I recently helped my PI re-digitize a prized result from 1988 (showing the first example of a synthetic enediyne compound cleaving DNA). The journal did not do a good job of scanning it, and it therefore was hard to interpret in the printed journal. So we dug up the original raw data (the original UV photograph of the DNA gel showing this result), which had been carefully filed away in our offsite storage location all these years, and re-digitized the image for a recent review article.
I've been doing research in the biological sciences for 12 years now, including some work that was at least tangentially related to human health. I am not in it for the paycheck--if that's all I wanted, my friends and I joke that we'd go to KFC School of Business Management and be assistant managers at fast food restaurants making more than we do in science. I, and the majority of the people I know, don't want to be professors either. It's extremely rare for a professor to actually do any lab work themselves, but if you ask they'll tell you they miss it. Besides there are 300 people applying for each professorship at a decent university. Then if you are unlucky enough to get the job, you have to successfully fight in a viciously competitive funding environment to get tenure and not lose your mind or your liver in the process. It's actually hard enough to keep a job in academic science, period. My boss and I are applying for grants. Hers are in part to keep my position funded, she's got one out and is writing a second. I've got one out, and am applying for two or possibly three more. Contrary to what you wrote, my grants are largely my ideas and my writing, and should I get funded is my money, not the boss's. However science funding is so obscenely bad (most grants have ~5% success rate, the best one I'm applying for has ~25%) that I'm also going to look for a new job, with the boss's full knowledge and support, even though we'd both very much like me to stick around for another couple years and get our proposed butt kicking science done.
So why do it if there's nothing but nonstop stress, Burger King assistant manager pay, and institutionalized job insecurity? I get to solve problems. I get to figure things out. I get to do things (sometimes, not often, but sometimes) that nobody has ever done before, see things nobody else has ever seen before. Work in a small way on projects that could impact millions of people's lives. I'll never be famous, which is fine with me. I'll never be rich, which, well, I can tolerate. I might not ever have job security...which okay, I'll admit is seriously grinding down my enthusiasm and idealism. But the things I've gotten to do--even paid a pittance to do--I wouldn't trade. Catching jellyfish off the docks in Oregon. Turned loose on a billion dollar synchrotron, unsupervised at 3 am to understand how an enzyme known to be a virulence factor in several diseases functions at an atomic level. Making radioactively labeled mosquitoes to understand lipid trafficking, working with cell culture (It's a cell from an insect's midgut...that under laboratory conditions can endlessly propagate itself. How cool! And here's my what I'm going to do with it...), genetically engineering fluorescent organisms, using high-throughput screening to find new drug lead compounds. A lot of hard work, but sometimes that's damn good fun. Plus along the way you get to understand phenomena on a level that most people don't even know exists. I'm of course not claiming god-king knowledge here, but I could spend a long time talking about the terrible beauty of host:pathogen and vector:pathogen relationships for example, or protein structure, or anything else I've studied a while, just like any other scientist. That's fun too, although not cool in most of society. But my mom still thinks I'm cool. Ok, no, she doesn't.
If you expect to get rich and famous doing science, no wonder your post seems bitter. It isn't going to happen and isn't a right reason to do science in the first place. Those pie-in-the-sky ideals are.
Let's stop right there. There are no general lessons to be had from the LHC. It's an exception, not the rule.
First: 99.9% of scientists are not working at LHC, or any other billion dollar, world-unique facility.
They are working in ordinary labs, with ordinary equipment that's identical or similar to equipment in hundreds of other labs around the world.
I admit that I jumped on the LHC as an extreme example. But even in an "ordinary" lab these days, you'll find some specialized and complex equipment. This is true for the cutting edge of any field.
Second: Primary data, actual measurement results, are already kept, as a rule.
As oneiros27 notes, this is not guaranteed, either by design or circumstance.
Third: The vast majority of experiments are never ever reproduced to begin with. You're lucky enough to get cited, really. Most papers don't even get cited apart from by those who wrote them.
Not sure what kind of point you're trying to make here.
Fourth: Very little science is done by re-interpreting existing results. That only applies to the unique cases where the actual experiment can't be reproduced easily.
It's not necessarily a matter of re-interpreting existing results. You may be adding an old dataset to a new dataset, and finding new results in the combined set, or finding a glimmer of something new in an old dataset. Even for "small" experiments, having somebody else's raw dataset can make your life a lot easier.
Truth is, you'd still have to rebuild the LHC then, because you didn't test your 'corrected' assumption against the actual machine to show that your 'corrected' results are valid. Until the actual experiment is re-done it'll remain an unanswered question.
No, I am talking strictly about analysis. For example, the use of neural networks in particle/track finding has recently met greater acceptance in in High Energy Physics. But what happens if, a few years down the road, evidence turns up that neural networks are fundamentally flawed? If you have kept the data, you can re-run the analysis with different methods. If you have thrown out the data, it's time to build a new LHC.
Granted, High Energy Physics, with its requirements for large datasets in order to find extremely rare processes, is perhaps the only branch of science to require so much data. In HEP, we want to keep as much as possible, but there are realistic limits. In other fields, since there are no difficulties, why not keep everything?