Neglect Causes Massive Loss of 'Irreplaceable' Research Data
Nerval's Lobster writes "Research scientists could learn an important thing or two from computer scientists, according to a new study (abstract) showing that data underpinning even groundbreaking research tends to disappear over time. Researchers also disappear, though more slowly and only in terms of the email addresses and the other public contact methods that other scientists would normally use to contact them. Almost all the data supporting studies published during the past two years is still available, as are at least some of the researchers, according to a study published Dec. 19 in the journal Current Biology. The odds that supporting data is still available for studies published between 2 years and 22 years ago drops 17 percent every year after the first two. The odds of finding a working email address for the first, last or corresponding author of a paper also dropped 7 percent per year, according to the study, which examined the state of data from 516 studies between 2 years and 22 years old. Having data available from an original study is critical for other scientists wanting to confirm, replicate or build on previous research – goals that are core parts of the evolutionary, usually self-correcting dynamic of the scientific method on which nearly all modern research is based. No matter how invested in their own work, scientists appear to be 'poor stewards' of their own work, the study concluded."
Just ask somebody to figure out how to build a Battleship, or even the guns off one, heck, you'd have trouble finding people who know the process of firing them.
Or if you prefer, Greek Fire.
Is the vernacular.
Sounds familiar!
My wife is a wildlife biologist. Her office collects raw field data all year, compiles data, runs stats, writes reports, reads reports, creates a pretty large volume of "product" every year.
I ask her who exactly reads all the required papers and reports they produce. The federal Fish and Wildlife Service demands product. State demands product. Various agencies with funding ties that would confuse anyone all demand product. The real ass-kicker? Almost none of it is actually READ by those who asked for it. The papers that are read, are rarely read by more than one person.
In the end, thousands and thousands of offices like hers, producing real scientific data, it is just too much.
The number of people consuming the product is DWARFED by those producing it. The number of people tasked to archive, organize, store, catalog, and index this torrent of information are even FEWER than those who consume it.
These are "real life" scientists out there every day. Not throw in academia, including "research academia".
The bottom line? A true first-world problem. We produce WAY more research than we are prepared to do ANYTHING with.
Maybe because it was posted less than 24 hours ago?
Make it publicly available instead of DRM controlled publications or services.
Seriously, Slashdot editors... is it too fucking hard to look at the news you posted *yesterday* before adding an article? Turn in your keys and ID badge.
They should post their data to slashdot. Who will duplicate that shit so many times it will never vanish.
That's why Slashdot is keen on posting all new studies at least twice, thus increasing the chances they are still available for future generations!
I've found dead links to data in peer reviewed papers published just a week or less prior to reading them, sometimes these links were never valid to begin with.
Don't worry, Slashdot stories won't suffer the same fate as each one is duplicated later on!
Couldn't resist.
Maybe there should be an option to "ignore" an article or "report as duplicate". The second option would require someone to react to it so it may not work.
Gee, three hours to a dupe.
That has to be some kind of new record.
I do not fail; I succeed at finding out what does not work.
Slashdot is doing its part by posting the same data multiple times. Perhaps one copy will survive the test of time!
Oh. Wait. 15 hours. Maybe it's not a record after all. :P
Forgot about the 24 hour clock. :)
I do not fail; I succeed at finding out what does not work.
Dupity dupe dupe!
What data? I just need to walk outside. It's end of December in Germany and we have 6C outside. Tomorrow 12C are forecast. I doubt I will see any snow at all this year. When I was a kid, we used to build snowmen and do battle with snowballs at this time.
Right, all those almanac records have just up and disappeared. It's a "conspiracy".
Working in the field, I can pretty much state that far from enough care is taken with data archival and/or transfer to newer storage media when older ones approach obsolescence.
There's:
A: not enough staff to take care of it properly or keep a proper archival environment for the various media
B: not enough money & time to modernize the records/transfer to new mediums
C: sometimes not enough money to even properly maintain obsolete, long-unsupported and obscure data recording equipment
(I've seen 'rubber' pinch rollers that had turned to tar-like sludge pretty often. Still have no idea what could have caused that. It's a nightmare to clean.)
D: data recording equipment that hasn't been so much as looked at in such a long time that the last guy who knew how to even run the thing died of old age
E: recording medium with true 'shelf' lifespan far shorter than originally stated. (Cue reel-to-reel tapes delaminating, thermal graphic records bleaching/blacking out, etc. -- related to point A)
F: esoteric and variable recording methods and configurations that were not written down at the time the data was initially collected&recorded
G: outright loss and/or disposal of unique equipment due to inattentive staff or inventory management personnel / procurement personnel deciding something is useless and worthless.
Let's not even talk about accidentally overwriting data without ever realizing it (say, 'flipping' tapes in a situation where they shouldn't) because no one would actually check that the data was adequately recorded until years/decades after the fact.
Lonnie Thomson's missing ice core data, unarchived for 20+ yrs comes to mind, among many. Catastrophic anthropogenic global warming, it's a religion. The smart money is thinking more about the probable cold years after 2018.
They're pretty good at preserving their research data these days...
https://data.csiro.au
What data? I just need to walk outside. It's end of December in Germany and we have 6C outside. Tomorrow 12C are forecast. I doubt I will see any snow at all this year. When I was a kid, we used to build snowmen and do battle with snowballs at this time.
That's very interesting! We had record low temperatures here a couple weeks ago. Colder than I've ever experienced in my life and I've been living here for 30 years. Exciting times! But, unfortunately two data points is not enough to make any kind of conclusion about changes in the global climate. I think you may be confusing meteorology and climatology. We need lots of data to examine climate change, which has been collected for that very reason. It'd be a shame to lose it.
That's very interesting! We had record low temperatures here a couple weeks ago. Colder than I've ever experienced in my life and I've been living here for 30 years. Exciting times! But, unfortunately two data points is not enough to make any kind of conclusion about changes in the global climate.
Record cold here, too. Now we have three points, and it's two to one in favor of global cooling. Woot woot!
We need lots of data to examine climate change, which has been collected for that very reason. It'd be a shame to lose it.
Don't fear. If we lose any real data, the atmospheric modelers will happily create new old data.
Maybe they do it so you can use your mod points on one of the posts and make comments on the dupe.
It's sad the misunderstanding of climate science that your post demonstrates. Modelers don't create data (at least not the data you're thinking about), they compare their model output to real world data to understand how well they model the real world.
And look up the word "hindcast" if you don't think modelers don't create "old" data.
Think of all the family photos that will get deleted or destroyed by hardware failure, and to think I have family photos (on film) from over 100 years ago.
"If any question why we died, Tell them because our fathers lied."
When a researcher (a postdoc, say) leaves the typical university, her web page gets shut down and her email account deleted. Researchers tend to keep links to their papers, data, and open-source code on these web pages. But university IT departments tend to be super conservative. If the person is gone, so's the data.
It frustrates me a lot, actually. I think there needs to be a new role in IT: the librarian-archivist: someone who is dedicated to keeping data alive. Exiting researchers could apply to have their web pages frozen (optionally with a forwarding URL), and the IT librarian's job would be to review these applications and do the work necessary to keep these pages alive indefinitely. It's all static content, so it's not that hard aside from the storage problem.
Additionally, even if the institution doesn't want the exiting researcher to send from her institutional email address any longer, the institution could still forward the email to a new account.
I'm perfectly aware of what hindcasting is. The results of a hindcast are never presented as real world data.
Maybe designating the Library of Congress as a repository for scientific data would work. They're pretty good at archiving stuff.
Part of the problem with corresponding with authors of papers more than 2 years old is that there is no good way to uniquely identify an author. If you know that you are interested in a "John Smith" who wrote a Nature paper i n1989, good luck figuring out which "John Smith" is the same one today (if he is still alive). Another good example is of how many papers are by "Z Huang":currently over 6,000 to date in pubmed.
Considering how we expect researchers to change institutions multiple times in their careers in order to advance, this only becomes more difficult of a problem over time.
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
The results of a hindcast are never presented as real world data.
A. That you know of.
B. There was a conditional clause involved that included a complete loss of valuable real data. If the data was valuable and there is a model that can recreate it, it can be done.
C. If you know about hindcasting, then you know that modelers, as a regular course of business, create "new old data" which they then compare to the real old data. Saying "modelers don't create data" is wrong; they don't routinely or honestly create what they will call real data.
D. Even that last statement isn't totally true. In ocean modeling it is not unheard of for a model output (created data) to be used to correct some real-world measurements for parameters that cannot easily be measured. For example, if tide level data is required from a place that doesn't have a tide gauge, the modeled tide level from a validated model may be used. This leads to second generation "real" data that is directly dependent upon model output.
E. And again, "whoosh". It was a joke. "Woot woot!"
One thing that I lament about scientific publications, is that the results are boiled down to a few pages. You rarely see raw data , an generally only the statistical analysis. I would like to see web links in journals that include more of the raw data, the programs that generated that data, etc. We live in a day in age when gigabytes are cheap. It would be a lot easier to duplicate someone's work for peer review if the inherent data & analysis programs were more accessible. Although, there are a fair number of organizations that have no interest in making their data easier to understand because of commercialization and patent issues..
I for one see a lot of EE/CS papers that are devoid of source code. Source code is cumbersome to print, which is why I think it's rarely included as it would take up too much paper. I do think the inclusion of source code facilitates a better understanding of the authors intent. I would love to see CS papers links hyperlinks to a database of the journal publisher as a new standard in the "information age".
"Research scientists could learn an important thing or two from computer scientists,..."
What is the error bar on "a thing or two"?
As someone with a foot in each camp, I believe it's more like fifty or a hundred. The methods of scientists regarding computing are often built of slow evolutionary changes upon old familiar methods, while incorporating selected cutting edge hardware or algorithms. It is partly the nature of some science projects to carry out observations over many years, ideally with the same instruments, processing and management. In academic computer science, as well as real world IT, all layers and all aspects of any large system are always changing over time. ("All" = 100% give or take a few %) (And yes, somedays, it does seem like over 100%)
It's 6C in Warsaw right now... and last year we'd had snow for two months by this time.
A handful of data points does not a trend make.
https://en.wikipedia.org/wiki/John_Lott#Disputed_survey
Disputed survey
In the course of a dispute with Otis Dudley Duncan in 1999–2000,[55][56] Lott claimed to have undertaken a national survey of 2,424 respondents in 1997, the results of which were the source for claims he had made beginning in 1997.[57] However, in 2000 Lott was unable to produce the data, or any records showing that the survey had been undertaken. He said the 1997 hard drive crash that had affected several projects with co-authors had destroyed his survey data set,[58] the original tally sheets had been abandoned with other personal property in his move from Chicago to Yale, and he could not recall the names of any of the students who he said had worked on it. Critics alleged that the survey had never taken place,[59] but Lott defends the survey's existence and accuracy, quoting on his website colleagues who lost data in the hard drive crash.[60][self-published source?]
Perhaps this is n opportunity for journals to update their business models?
Warehouse and convert data, as well as curate contact lists for papers.
Were that I say, pancakes?
Oops, I guess I fell victim to Poe's law.
But when you get down to it pretty much everything in science is a model of the real world in one way or another.
... are condemned to repeat it.
Tell that to the warmists during a heat wave or a hurricane. Both are considered indisputable proof of global warming. There is indeed a double standard.
Some people think that global warming is related to solar flares, not human activity.
I think it is really convenient to not give direct access to raw data. Thus your claims are harder to verify. Just in case, I'm sure they have a subset ready that will work perfectly with their results
I'm late to the party here, but I thought it was worth mentioning that the Purdue University Research Repository (https://purr.purdue.edu) is designed as a Trusted Digital Repository for research data. The default lifetime is 10 years, but the Purdue Libraries will add noteworthy datasets to its permanent digital collection after their default lifetime expires. (And yes, I am a programmer on the project.)
"Display some adaptability" -- Doug Shaftoe, _Cryptonomicon_