Freeing and Forgetting Data With Science Commons
blackbearnh writes "Scientific data can be both hard to get and expensive, even if your tax dollars paid for it. And if you do pay the big bucks to a publisher for access to a scientific paper, there's no assurance that you'll be able to read it, unless you've spent your life learning to decipher them. That's the argument that John Wilbanks makes in a recent interview on O'Reilly Radar, describing the problems that have led to the creation of the Science Commons project, which he heads. According to Wilbanks, scientific data should be easy to access, in common formats that make it easy to exchange, and free for use in research. He also wants to see standard licensing models for scientific patents, rather than the individually negotiated ones now that make research based on an existing patent so financially risky."
Read on for the rest of blackbearnh's thoughts.
"Wilbanks also points of that as the volume of data grows from new projects like the LHC and the new high-resolution cameras that may generate petabytes a day, we'll need to get better at determining what data to keep and what to throw away. We have to figure out how to deal with preservation and federation because our libraries have been able to hold books for hundreds and hundreds and hundreds of years. But persistence on the web is trivial. Right? The assumption is well, if it's meaningful, it'll be in the Google cache or the internet archives. But from a memory perspective, what do we need to keep in science? What matters? Is it the raw data? Is it the processed data? Is it the software used to process the data? Is it the normalized data? Is it the software used to normalize the data? Is it the interpretation of the normalized data? Is it the software we use to interpret the normalization of the data? Is it the operating systems on which all of those ran? What about genome data?'"
Einstein said "If I have seen farther than most it is because I have stood on the shoulders of giants."
Where does that begin to apply in a society of lawyers, profiteers, and billion dollar industries based on exploiting shortsighted IP management?
I was reading through the summary quickly and almost had a panic attack at the deluge of questions at the end. We get the point already!
Exactly 1000 more bytes? Wow!
After all, we are a nation of cowards.
What's most important to keep is quite simple and obvious really:
The results. The published papers, etc.
It's an important and distinctive feature of Science that results are reproducible.
Comment removed based on user account deletion
That's not true. Any tax funded study requires more documentation and publication then a private one. Anyone who reads them knows.
All studies worth anything are aimed at a audience proficient in the subject, they are not meant for general audiences, and are often proven wrong, you need repeatable results.
"And if you do pay the big bucks to a publisher for access to a scientific paper, there's no assurance that you'll be able to read it, unless you've spent your life learning to decipher them. "
I predict the dumbing down of science.
Shai Schticks:"You don't make peace with friends, you make peace with enemies"
Has nobody ever read The tragedy of the commons?
However, in the case of the non-physical, I guess noone can "waste" or "steal" it, only copy and use.
This issue is a bit more complicated than you think.
On a more serious note, a common ground for data format would be nice. You already have some generic formats, like HDF5 and other, but i must admit right now, it is a bit of a jungle in the astrophysic department, and it is not going to change anytime soon (unless someone make a awesome generic, one-fit-all library in... Fortran77...).
EULA : By reading the above message, you agree that I now own your soul.
I'm a working scientist (ok, PhD student), so I read journal articles pretty often. I can understand the rub in principle, but let's say that we come up with some way for all scientific data to be freely shared. So what? In almost all cases, the only people who actually benefit from access to particular data are a small handful of specialists. Could someone explain to me why this is a real problem and not just something that people with too much time on their hands (and who would never actually read, let alone understand, real research results) get worked up about?
It reminds me of the XKCD this morning...
Epitaph: At last! Root access!
Data storage is something we've gotten very good at and we've made it very cheap. A Petabyte a day is not as staggering as it was even five years ago.
The results. The published papers, etc. It's an important and distinctive feature of Science that results are reproducible.
Having worked around academic groups that do medical research for three years now, I can tell you that is absolutely not what drives research.
Researchers will love to tell you about how it is the quest for knowledge and other pie-in-the-sky ideals, but when it comes down to it- it's mostly about making a living (or more than a living), and fame/prestige.
See, journals have what's called an "impact factor." An impact factor is how many times an article in a particular journal ends up being cited by other papers. In one lab I worked at, it was closely tracked who was published where, and how many times.
At the end of the year, when it came time to decide who went and who stayed, the scores were lined up and however many people needed to go came from the bottom. The top ones get a little closer to becoming a PI (Principle Investigator, aka someone who has postdocs and grad students working for them.)
PIs, all the people you read about in the paper- they survived the process, but they're now nothing more than management. They don't do lab work, they don't do research. They solicit ideas from their postdocs, put the final polish on a grant proposal the postdoc slaved over, and get big fat checks from NIH for millions of dollars. The PIs then pass the work down to postdocs, who dole it out to grad students. The grad students do it because a PhD is dangled in front of them while they run on the treadmill of endless, monotonous, repetitive lab work and analysis work. The postdocs do it because faculty positions and PI slots are dangled in front of them.
The problem with "the system" is that nobody is rewarded for reaching that brass ring. Just like Ford has no incentive to build a very durable car (no service/parts sales after the vehicle hits the end of the warranty, and the market quickly becomes saturated) researchers have no incentive to completely solve issues facing us today; their incentive is to come close enough to say "aha, look, we did find SOMETHING, so your grant money wasn't wasted."
What incentive does a massive industry have to solve cancer, when it would put them out of business? Tens of thousands of people have dedicated most of their adult lives, usually to studying specific mechanisms and biological functions so narrow that if cancer were cured tomorrow, they would be useless- their training and knowledge is so focused, so narrow- they cannot compete with the existing population of researchers in other biomedical fields. Journals which charge big bucks for subscriptions also would be useless. Billions of dollars of materials, equipment, supplies, chemicals- gone. "Centers", hospitals, colleges, universities which each rake in hundreds of millions of dollars in private, government, and non-profit sourced money would be useless.
Please help metamoderate.
I'm sorry, but that makes no sense. 'Points of'???? Come on.
I know that this is a real shock to you humanities majors, but science is hard. And yes, for the record, I do have degrees in both [physics and philosophy, or will as of this May — and the physics was by far the harder of the two].
Here's another shocker. If you think the papers are hard to read, you should see the amount of work that went into processing the data until it's ready to be written up in an academic journal. Ol' Tom Edison wasn't joking when he said its "1% inspiration and 99% perspiration." If you think seeing the raw data is going to magically make everything clear, well, I'm sorry, the real world just doesn't work that way. Finally, if you think professional scientists are going to trust random data they downloaded off the web of unknown provenance, well, I'm sorry but that isn't going to happen either. I spend enough time fixing my own problems; I certainly don't have time to waste fixing other peoples' data for them.
-JS
Vanity of vanities, all is vanity...
Is it that I'm posting on slashdot? Is it that we all read this article? Is it that your read this comment? Is it that you read this comment on slashdot? Is it that you read slashdot which had this article which had this comment? Or is it?
said once to a king "there is no royal road to geometry". The nature of some things is in fact complex and there is no easy and accurate at the same time way to represent that.
Is a science or religion goal that the universe is made in such way that should be easy to explain it to humans?
Tried to use in recent paper, and got this reply:
"IEEE have advised that they are unable to accept
the Science Commons license at this time.
If you want your paper to be published, you will
need to sign off a plain IEEE copyright form
and scan/email it to me."
This is the just as likely to add burden as to remove it.
I can't count the number of times I've seen attempts to 'standardize' data, or even just notation, in a given field. It all works very well for data to that point, but then the field expands or changes, or new assumptions become important, and the whole thing becomes either unwieldy or obsolete. This is one reason why every different field, it seems, has their own standards in their literature.
Speaking of the literature, most of these proposals are quickly followed by a 'let's just ask authors to conform to this now' approach to adopting these things. Papers get rewritten (or rejected), key points get lost, and the community gets weaker, all so that some standard with a half life of 12 months can be implemented.
This might be different. I applaud people trying to solve hard problems, and this is certainly one. I do think that more of the burden should be on demonstrating that the standradization is applicable for 12 months or more AFTER final development in a given field, never mind several.
Generally, though we shouldn't fear context. We should embrace it.
=======
Science -- Sealed, Delivered.
What is a lot harder is knowing how the data sets were measured and whether it is valid to combine them with data sets measured in other ways.
At least half the Global Warming bun-fight is about the validity of comparison between different data sets and the same goes for pretty much any non-trivial data sets.
Engineering is the art of compromise.
Excluding experimental data, those fields don't really have the problem that this guy is talking about. Perhaps someone should give him/her a lesson in the Scientific Method. Then maybe his/her words would reflect some rigour. Well, that and a link to the ArXive (http://arxiv.org/).
Why is this so? Because, these communities are so small, that just about everyone knows or knows of everyone else('s work). Of course, that's a slight hyperbole. BUT, /just/ a *slight* one.
This sort of project only really applies to the non-fundamental sciences. Not that it's not useful. Of course it'd be a good thing to get this going. But, we just have to be honest about its true scope. And of course it'd be nice if this guy would tone down the rhetoric. Coming off that naively idealistic only works against things.
But they don't have the patience and 5 years to explain to aspiring noobs.
I guess you don't have PhD students then? You should try one - mine make me think hard about things I thought I already knew.
This sort of project only really applies to the non-fundamental sciences.
And what are fundamental sciences?
I keep hearing this type of argument: (some) physicist think biology is not a fundamental science; (some) biologists think sociology is not a fundamental science... each science is fundamental to those who want to understand the phenomena that it deals with.
metageek
IAAP (I am a physicist), and agree that the model of charging researchers to access their own papers is rediculous and broken - i submit preprints to arxiv.org in addition to print journals (everyone needs citations).
Any researcher will tell you that writing papers is a giant pain - it takes a long time which we would rather spend running experiments/simulations.
Whether they are published in open or closed journals, papers do have a useful function: they summarise the important results and (should) clearly explain the caveats and errors.
What this guy seems to be advocating is that the raw data from experiments be openly available. I work on large experiments (tokamaks), where diagnostics are one-offs, built specifically for that experiment. The data is full of errors and subtleties which only those most familiar with the experiment can assess properly. For this reason any papers to be published externally are first thoroughly reviewed internally to ensure that the data has not been misinterpreted.
Whilst freely publishing the resulting papers is a Good Thing (TM), freely allowing access to the raw data is not.
Does anyone, -- I mean there's me obviously -- think that the way the structure of the articles doesn't, in the sense that it's sort of an exact word for word -- transcription of someone *speaking* -- is extremely jarring when you see it -- by that I mean in the written form?
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
Research data is typically large. In the mid-late 90s I recall a researcher planning to move 10 TB of data internationally. It wasn't exactly unprecedented either. The internet was simply not capable of such a transfer. Eventually they had to ship it on many disks.
The problem is with such raw data, ie from a radio telescope, is you need all of it, you can't really cut any out before it's even processed.
This is a lot less of a issue today with research networks all hooked into multi-gigabit pipes. But there are still very large datasets researchers are attempting to work with that are simply not cheap to handle.
I think this is a great idea, it's nice being able to share it but as far as the really sexy big research going on these days I don't see it being much of a point-click-download service!
After logging in slashdot still does not take you back to the page you were on. It's been that way for 20 years.
Finally somebody mentioned the arxiv.
By the way, it's quite funny to see all these guys telling somebody how to do his job better, mostly when they have absolutely no idea what they're talking about.
Some nice sentences from the article:
-"It's taken me some time to learn how to read them"... what!!??
-"Because you're trying to present what happened in the lab one day as some fundamental truth", hahaaha, that one is good.
-"So what we need to do is both think about the way that we write those papers, and the words and the tone and how that really keeps people out of science. It really reduces the number of scientists". Yea. From under which rock of another planet have they taken this guy?. Keeps people out of science...yes, they see equations they don't understand, and don't want to make the needed effort. I can imagine the solution is that we through science away and begin writing "easy papers" that any analphabet can understand. That would be progress!
Now seriously, change the patent system to reward theoreticians, not only experimentalists. And make the population less ignorant!!!
Very simple:
Make a law that forces any tax funded research to end up in the public domain.
Problem solved.
We all benefit if policy is based on reality, rather than bad science or bad data. We all lose if our money is wasted based on bad science. And the policy should make everything public, as you don't know which data will affect you (and you might not be able to get the data you need for your project).
Recently outsiders have spotted bad data at Antarctica and arctic ice mistakes.
From the article, regarding scientific literature: "Because you're trying to present what happened in the lab one day as some fundamental truth. And the reality is much more ambiguous. It's much more vague. But this is an artifact of the pre-network world. There was no other way to communicate this kind of knowledge other than to compress it."
A statement like this suggests that the speaker either unfamiliar with the way scientific data is actually turned into papers, or inappropriately optimistic about the utility of making the data "available." It is true that scientific data can be voluminous, but the overwhelming majority of papers do not "compress" data. To stretch an inadequate analogy, scientific literature is much more akin to metadata. Imagine scientific data as a large set of digitized recordings of music, all jumbled about. The paper would represent the list of song title, artist, etc. that someone had to put together. The metadata is not so much a compression as a re-representation and categorization of the data.
As a neuroscientist responsible for sharing my results with the world, I've taken reasonable steps to ensure that all of the data used in my papers is freely available (under the Science Commons license, which I'm quite grateful to Wilbanks & co. for). Similarly, the code I wrote to extract meaningful parameters from the data and present them in an aesthetically pleasing way is also freely available. I maintain no illusions as to the utility of the database: nobody is really interested in recreating the figures in the paper from the original data, nor in reanalyzing the data. However, I do know that some of the insights I've presented have influenced those (few) that have read my papers and struggled to understand the ideas presented within.
There is nothing wrong with the idea that scientific data and biological materials ought to be readily available to those who would use them. But the notion that somehow the hard-won insights that come to those who spend years collecting and thinking about the data will somehow follow is fanciful at best. Peer-reviewed, editor-selected papers are not compressed versions that are easier to transmit, but rather the collected insights and interpretations that allow us confidence in the work we've done. So by all means, if Mr. Wilbanks can find people to pay for it, make it easy to disseminate data. Just don't be surprised to find that "decompressing" papers doesn't do all that much to advance knowledge.
Try to love the questions themselves -- Rilke
Don't feed the trolls - when an AC says something stupid, let it slide.
What about when an AC says something smart?
This issue is a bit more complicated than you think.
What matters? Is it the raw data? Is it the processed data? Is it the software used to process the data?
The original data is of paramount importance, software for processing and analysis not so much... Science requires the ability to independently redo experiments and analyze data... getting the same result IS the method of verification that makes the "Scientific Method" valid. Getting the same result using different tools for analysis is even better... Mann's "Hockey Stick" graph is one of the failures of that system since he either can't recall which data sources he used or lost the original data... (not a problem for him since random noise conveniently generates the same hook in the graph)