Freeing and Forgetting Data With Science Commons

← Back to Stories (view on slashdot.org)

Freeing and Forgetting Data With Science Commons

Posted by Soulskill on Friday February 20, 2009 @02:59PM from the bringing-it-all-together dept.

blackbearnh writes "Scientific data can be both hard to get and expensive, even if your tax dollars paid for it. And if you do pay the big bucks to a publisher for access to a scientific paper, there's no assurance that you'll be able to read it, unless you've spent your life learning to decipher them. That's the argument that John Wilbanks makes in a recent interview on O'Reilly Radar, describing the problems that have led to the creation of the Science Commons project, which he heads. According to Wilbanks, scientific data should be easy to access, in common formats that make it easy to exchange, and free for use in research. He also wants to see standard licensing models for scientific patents, rather than the individually negotiated ones now that make research based on an existing patent so financially risky." Read on for the rest of blackbearnh's thoughts. "Wilbanks also points of that as the volume of data grows from new projects like the LHC and the new high-resolution cameras that may generate petabytes a day, we'll need to get better at determining what data to keep and what to throw away. We have to figure out how to deal with preservation and federation because our libraries have been able to hold books for hundreds and hundreds and hundreds of years. But persistence on the web is trivial. Right? The assumption is well, if it's meaningful, it'll be in the Google cache or the internet archives. But from a memory perspective, what do we need to keep in science? What matters? Is it the raw data? Is it the processed data? Is it the software used to process the data? Is it the normalized data? Is it the software used to normalize the data? Is it the interpretation of the normalized data? Is it the software we use to interpret the normalization of the data? Is it the operating systems on which all of those ran? What about genome data?'"

11 of 114 comments (clear)

Min score:

Reason:

Sort:

Comment removed by account_deleted · 2009-02-20 15:15 · Score: 4, Informative

Comment removed based on user account deletion
Re:What's most important to keep. by MoellerPlesset2 · 2009-02-20 15:27 · Score: 4, Insightful

How can the results be reproducible if you don't keep the original data?
The relevant results are supposed to be included in the paper, as well as the information necessary to reproduce the work. Most data doesn't fall into that category.

To make an analogy the computer geeks here can relate to: All you need to reproduce the output of a program is the source code and parameters. You don't need the executable, the program's debug log, the compilers object files, etc, etc.

The point is you want to reproduce the general result. You don't usually want to reproduce the exact same experiment with the exact same conditions. Supposedly you already know what happens then.
Re:What's most important to keep. by repepo · 2009-02-20 15:28 · Score: 3, Interesting

It is a basic assumption in science that given some set of conditions (or causes) you get the same effect. For this to happen it is important to properly record how to setup the conditions. This is the kind of things that scientific papers describe (in principle at least!).
Re:What's most important to keep. by mako1138 · 2009-02-20 15:47 · Score: 5, Insightful

Let's say the LHC publishes its analysis, and then throws away the data. What happens when five years later it's discovered that a flawed assumption was used in the analysis? Are we going to build another LHC any time soon, to verify the result?
For a billion-dollar experiment like the LHC, that dataset is the prize. The dataset is the whole reason the LHC was built. Physicists will be combing the data for rare events and odd occurrences, many years down the road.
What's the goal, really? by Rostin · 2009-02-20 16:24 · Score: 4, Insightful

I'm a working scientist (ok, PhD student), so I read journal articles pretty often. I can understand the rub in principle, but let's say that we come up with some way for all scientific data to be freely shared. So what? In almost all cases, the only people who actually benefit from access to particular data are a small handful of specialists. Could someone explain to me why this is a real problem and not just something that people with too much time on their hands (and who would never actually read, let alone understand, real research results) get worked up about?
It reminds me of the XKCD this morning...
1. Re:What's the goal, really? by Beetle+B. · 2009-02-20 18:13 · Score: 3, Insightful
  
  Typical comments from someone in the first world.
  First, just on the side, I know lots of people who got PhD's but did not really stay in research and academia. They still want to read papers, though, as they still maintain an interest.
  But the main benefit of opening up journal papers is for the rest of the world to benefit. Yes, if you have a very narrow perspective, you could just dismiss that as charity. If you're open minded, you'll realize that shutting out most of the world to scientific output means much less science globally, and much less benefits to you as a result.
  Imagine if all researchers in Japan published papers only in Japanese, and the journals had a copyright condition that prevented the content from ever being translated to another language, and you'll see what I mean. Whereas current journals require a lot of money for access, these ones also have a price: Just learn Japanese. It's not exactly promoting science.
  Then again, of course, journals do need a base amount of money to operate. Just that Elsevier kind of companies charge so much more than is needed to make a profit.
  
  --
  Beetle B.
2. Re:What's the goal, really? by smallfries · 2009-02-20 23:22 · Score: 3, Informative
  
  Trickle-down. Dissemination of knowledge.
  You don't know it yet (not meant as a jibe but it is something that clicks in after your PhD) but your primary function as a scientist is not to make discoveries. It is spreading knowledge. Sometimes that dissemination will occur in a narrow pool, through journal papers between specialists in that narrow pool of talent.
  This is not the primary goal of science, although it can seem like it when you are slogging away at learning your first specialisation well enough to get your doctorate. Occasionally a wave from that little pool will splash over the side - maybe someone will write a literature review that is read by a specialist in another field. A new idea will be found - after all sometimes we know the result before we know the context that it will be applied to.
  The pools get bigger as you move further downstream. Journal articles pass into conference publications, then into workshops. Less detail but carried through a wider audience. Then after a time, when the surface seems to have become still textbooks are written and the knowledge is passed on to another generation. We tend to stick around and help them find the experience to use it as well. This is why all PhD students have an advisor to point out the best swimming areas.
  That was the long detailed answer to your question. The simple version is that you don't know who your target audience is yet. And limiting it to people in institutions that pay enormous access fees every year is not science. As a data-point - a lot of European institutes don't bother with IEEE fees. They run to about Â£50k/year which simply isn't worth it. As a consequence results published in IEEE venues are cited less in Europe. So even amongst the elite access walls have an effect.
  
  --
  Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php
Re:What's most important to keep. by Mr+Z · 2009-02-20 16:24 · Score: 5, Interesting

With a large and expensive dataset that can be mined many ways, yes, it makes sense to keep the raw data. This is actually pretty similar to the raw datasets that various online providers have published over the years for researchers to datamine. (AOL and Netflix come to mind.) Those data sets are large and hard to reproduce, and lend themselves to multiple experiments.
But, there are other experiments where the experiment is larger than the data, and so keeping the raw data isn't quite so important as documenting the technique and conclusions. The Michelson-Morley interferometer experiments (to detect the 'ether'), the Millikan oil-drop experiment (which demonstrated quantized charges)... for both of these the experiment and technique were larger than the data, so the data collected doesn't matter so much.
Thus, there's no simple "one size fits all" answer.
When it comes to these ginormous data sets that were collected in the absence of any particular experiment or as the side effect of some experiment, their continued existence and maintenance is predicated on future parties constructing and executing experiments against the data. This is where your LHC comment fits.

--
Program Intellivision!
Re:What's most important to keep. by oneiros27 · 2009-02-20 16:52 · Score: 4, Insightful

Let's stop right there. There are no general lessons to be had from the LHC. It's an exception, not the rule. First: 99.9% of scientists are not working at LHC, or any other billion dollar, world-unique facility. They are working in ordinary labs, with ordinary equipment that's identical or similar to equipment in hundreds of other labs around the world.

There are two types of science. What you're referring to is called 'Little Science' (not to be derogatory), but it's the type of thing that a small lab can do, with a reasonable amount of funding. And then there's what we call "Big Science" like the LHC, Hubble Space Telecope, Arecibo Observatory, Large Synoptic Space Telescope, etc.

Second: Primary data, actual measurement results, are already kept, as a rule.

I wish. Well, okay, it might be kept, but the question is by who, and have they put it somewhere that people can analyze it?
I was at the AGU last year, and there was someone from a solar observatory that I wasn't familiar with. As I do work for the Virtual Solar Observatory, I asked them if we could put up a web service to connect their repository to our federated search. They told me there was no repository for the observatory -- the data walks out the door with whoever the observer was.
Then there's the issue of trying to to tell from the published research exactly what the original data was. But then, I've been harping on the need for data citation for years now ... it's an issue that's starting to get noticed.

Third: The vast majority of experiments are never ever reproduced to begin with. You're lucky enough to get cited, really. Most papers don't even get cited apart from by those who wrote them.

For the type of data that I deal with, none of it is technically reproducible, because it's observations, not experiments. And that's precisely why it's important to save the data.

Fourth: Very little science is done by re-interpreting existing results. That only applies to the unique cases where the actual experiment can't be reproduced easily.

In your field, maybe. But we have folks who try to design systems to predict when events are going to happen and need training data. Others do long-term statistical analysis with years or decades of data at a time. Still others find a strange feature that hadn't previously been identified as important (eg, coronal dimmings) and want to go back through all of the data to try to identify other occurrences.

--
Build it, and they will come^Hplain.
Re:Again with the IP by wisty · 2009-02-20 17:54 · Score: 3, Insightful

There is a rumor that Newton meant it as an insult to Hooke. Newton had refined DesCarte's wave theory, while Hooke had backed the corpuscul theory. Also, Hooke was a short man.
Re:not results- grant dollars by smallfries · 2009-02-20 23:09 · Score: 4, Insightful

What incentive does a massive industry have to solve cancer, when it would put them out of business? Tens of thousands of people have dedicated most of their adult lives, usually to studying specific mechanisms and biological functions so narrow that if cancer were cured tomorrow, they would be useless- their training and knowledge is so focused, so narrow- they cannot compete with the existing population of researchers in other biomedical fields. Journals which charge big bucks for subscriptions also would be useless. Billions of dollars of materials, equipment, supplies, chemicals- gone. "Centers", hospitals, colleges, universities which each rake in hundreds of millions of dollars in private, government, and non-profit sourced money would be useless.
That's an old argument and although it sounds reasonable it is completely unsound. An industry does not function as a single cohesive entity with wants and desires. It is composed of many different individuals with their own wants and desires.
I know enough academics to say for certain that if any one of those individuals could discover a cure that would put their entire employer out of business then they would leap at the chance. The fame that would follow would make another job easy enough to get, and the recognition is what they're really in it for anyway.

--
Slashdot: where don knuth is an idiot because he cant grasp the awesome power of php