Scientific Data Disappears At Alarming Rate, 80% Lost In Two Decades
cold fjord writes "UPI reports, 'Eighty percent of scientific data are lost within two decades, disappearing into old email addresses and obsolete storage devices, a Canadian study (abstract, article paywalled) indicated. The finding comes from a study tracking the accessibility of scientific data over time, conducted at the University of British Columbia. Researchers attempted to collect original research data from a random set of 516 studies published between 1991 and 2011. While all data sets were available two years after publication, the odds of obtaining the underlying data dropped by 17 per cent per year after that, they reported. "Publicly funded science generates an extraordinary amount of data each year," UBC visiting scholar Tim Vines said. "Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.' — More at The Vancouver Sun and Smithsonian."
And in 20 years, these results too shall be lost.
does it reappear?
thats okay, the nsa has a backup
Trying to ignore that a paper about the unavailability of scientific data is locked behind a paywall.
This is nothing new though, I do occasional conversion from ancient data formats, people need to pay better attention, imagine trying to read an 8" CP/M floppy today.
As libraries move to digital storage rather than the dead tree that's been fine for thousands of years they are inviting a catastrophe, possibly only one well aimed solar mass ejection from massive data loss.
By 2030, there won't be any left! We must act now!
So the institutions do not have any data lifecycle management for research data. Are we supposed to be surprised? Ensuring that data are not lost is a huge undertaking and cannot be left to the individual researcher. It may also require a change in the research culture at many institutions. As long as research is measured by the publications, that is where the resources go and where the focus will be.
Will this change? Probably not.
This is bang on. As a system administrator for a STEM department at a Canadian institution, my budget is 0 for data retention. Long term data retention is just not in the mindset of researchers.
...100% is retained for 2 years, and 17% is lost every year after that, then after 20 years, I get about 3.5% of the data still being accessible, not 20%. WTF, or did someone lose the data for this study and the article is really just a guess.
... poorly collected unreliable data also vanishes at at least the same rate (hopefully faster). And assuming shoddy data disapears faster than good data, then the quality of available data should continually increase.
... at the NSA?
I'm a researcher and I don't have time or space to keep old data as I'm generating too much new data. We work hard to maximize the use of these data and analyses when we write and publish papers. If this was talking about the papers (or presentations), that were the product of the data, being lost at this rate it would be one thing, but the raw data isn't usually very useful to anyone without context or knowledge of subtle and poorly documented technicalities. This just seems like ammunition for the climate change deniers to bitch about. It's unreasonable to keep the old data indefinitely without a massive public repository that will be poorly indexed and organized.
subject to change like the 'weather' & everything else in time space & circumstance. unperfectness abounds what a gig. free the innocent stem cells
I think it is ridiculous that Slashdot's keep posting articles that are behind paywalls. How the hell are we supposed to see them? Do you expect us to pay for subscriptions to services we'd only use once? you, OP, are out of your mind. articles such as this should be rejected as most users, if not all, can't even access the story. This site really has gone down hill in the last few years, over populated with clueless simpletons, frauds, so-called armchair IT experts and -obvious- subscription pushing trolls.
Many things are based on this data... and when the data is gone it cannot be audited which makes it impossible to verify the finding of the data which is later simply referenced... but the data upon which it is based... *poof*
This practice also gives a free reign to fraudsters because if you don't catch them quickly they can claim the data was just in their other pair of trousers.
I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
any organization worth its weight in salt have vast libraries of data that go back many decades. not all 'institutions' are so poorly run where data from 20+ years ago cannot be accessed. must be a 'canadian' thing.. not that we'll know.. since the story is behind a bloody paywall.
"Will this change? Probably not."
I have access to vast libraries of data that date back 30+ years, some datasets (this includes computer software too) date back to the early 70s. why? because these institutions/corporations were organized, they knew that retaining data is important and they kept up with technology to ensure that no data is lost. there is no excuse to lose vast amounts of data. the only excuse for not retaining such data that I can think of is cost. the longer you leave datasets rotting away on old tapes, disks and hard drives, the harder it becomes to salvage and finding people who are experts at retrieving data from old media gets harder and more expensive.
Publish under GPL license and save it forever.
"I'm a researcher and I don't have time or space to keep old data as I'm generating too much new data."
well if that isn't the silliest thing I've ever read. there's no excuse for not retaining data, no matter how large the sets may be. storage in 2013 is incredibly cheap and there's many different systems, with incredible amounts of storage space you could use to back it all up on but I figure this more of a financial reason than your excuse of 'I'm generating too much data' nonsense.
"but the raw data isn't usually very useful to anyone without context or knowledge of subtle and poorly documented technicalities"
I seriously doubt you are a researcher of any kind based on the quote above. It doesn't really matter about the 'context' or 'poorly documented technicalities' as you so elegantly put it. You cannot just assume that if someone were to pick up your data they won't understand the 'context'. that is ridiculous. It's all do with unorganized researchers/institutions and money.
"This just seems like ammunition for the climate change deniers to bitch about."
"climate change deniers". very amusing. if you want your data and research to stand up to scrutiny then keep all your datasets. what have you to hide? are you hiding the fact that you became a climate researcher so can you stick your hand out for free research money while producing data that is laughable? I have a feeling that's the case :)
... wait what was it again ... its gone!
that I used for my paper 15 years ago. It is on a tape, that is somewhere in a drawer, that I have no tape drive for. On the other hand, the LaTeX file and the C and FORTRAN programs I used to evaluate and create the data and write the paper are still on a hard drive that is running on a computer in my network and I can access it right now. I probably can*t compile the the program without change (was written for Solaris and DEC machines) and maybe not even run LaTeX on it without getting some of the included styles, but still it is there.
Since my work was in theoretical physics and numerical the loss of the raw data is probably not as bad as long as you still have the software, but I guess for an experimental physicist the problems would be much greater to keep the massive amount of data they sometimes have and if lost to reproduce the data.
***Quis custodiet ipsos custodes***
Whichever side of the "data is" vs. "data are" argument one falls on, I hope we can all agree that mixing both forms within the same sentence is definitely wrong.
Some idiot sub-editor wrote a misleading figure caption here. The article (which I've read) says nothing about how data is lost with age. It only says something about how much data is lost for papers of a given age as of now.
In other words it does not mean that in 10 years time, 10 year old papers will have such drastic data loss. The world 20 years ago was a very different place in terms of communication, scientific practice, and data storage than it was 10 years ago or is now.
The Slashdot article repeats the fallacy by saying "scientific data disappears". No it doesn't. Some has disappeared, but the paper cannot say anything about whether it is still disappearing.
Come back in 10 years time for that conclusion.
a) because it's behind a paywall; and b) how can the original data even hope to be located when a majority of the population can't even read the paper?
Defund the NSA, kick them out of the Utah data center - and do something useful with it. Like giving all the lost data a permanent home.
smilies are for reetards
> many other data sets are expensive to regenerate...
Or maybe impossible to regenerate (for certain values of impossible). I remember reading a classified technical report (dating from the 1940s) related to military life-jacket development, wherein the question arose as to whether a particular design would reliably turn an unconscious person face-up in the water. The experimental design used was to dress some servicemen (sailors, possibly, but I don't recall) in the prototype design, anaesthetise them and drop them in a large body of water, checking for face-down floaters to disprove the null hypothesis. Somehow, I don't think that those data are going to be regenerated any time soon. I hope to God not, anyway.
NSA has backups :p
This sounds like the sort of "big problem" Google would love to tangle with, considering their mission statement.
I'm....losing...my..mind..Dave......Dave....Would you like me to sing a song?
My ism, it's full of beliefs.
The very fact that "Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.", makes me wonder if this could even be considered "scientific data" anymore. Since the data is unique to a time & place and irreplaceable, it would completely destroy the reproducibility aspect of the scientific process. Given that, should the lack of reproducibility mean that lost scientific data should be redefined as experimental data or hypothesis data? It also brings up the idea in my mind that scientific data has a half life since it can degrade back to hypothesis or experimental data if not properly stored.
And keep proven logs to show it is not tampered with?
Will you pay for the researchers to keep the data forever? Will you insist that they stop researching anything new because the data storage exponentiates and the old stuff will need moving to new media, checking and eventually more work goes into looking after the media than on archiving new stuff?
Will you accept higher taxes to pay for this, and taxes that increase year-on-year exponentially to cover it?
No?
Then you're going to "lose" data.
Reproduction of results isn't "add the numbers that they produced to see if they sum to the value they said it did". That isn't replication of science.
Since the science is supposed to be repeatable and the paper (if valid science not pseudoscience bollocks) contain enough information to do the assessment again (e.g. like a patent as supposed to be), then you MUST consider it BETTER to re-do the experiment again and collect your OWN data and see if the data fits the result of the previous paper.
What if, for example, there was a bias on the original potentiometer, making all voltages appear different from what they are? The result would be WRONG, but your method of "redoing the experiment" would NEVER show this. Doing the experiment again and producing your OWN data would.
I mean, I would like to check that you do find more data, so you have it, right? The raw data?
Is it torrented?
And the programs for manipulating, are they available too? And the results from it?
That makes 2x as much data you have.
Of course, if I reanalyse it, if I have any data, I now must archive it. 3x.
If anyone else wants to recreate it... 4x
Alternatively, for the cost of 20years storage, it may be possible to redo all the measures with UAVs and nanobots in future. And for less cost than 30 years storage...
The Long Now Foundation has devised an interesting mechanism for storing important information which, although not optimal for machine readability, is dense and has an obvious format: a metal disk etched with microprinting, whose exterior shows text getting progressively smaller as an obvious way of saying "look at me under a microscope to see more":
http://rosettaproject.org/
I highly recommend reading The Clock of the Long Now if you're interested in the theory and practice of making things last.
Koans and fables for the software engineer
Given the responses you've got, looks like you nailed it.
"Oh, how arrogant!" from one poster. From another "You don't know anything". Isn't that second one arrogant?
Note too how they claim you don't know what you're talking about but even insist they have no method to know better.
So looks like you nailed it.
Before science gets hot and bothered about the loss of data scientists need to do something about the quality of the data they produce to begin with. Frankly given the complete lack of quality controls that a lot of scientists use the loss of their data is probably for the best. Depending on the field as much as 60% of all scientific research cannot even be reproduced. Work that cannot be reproduced by another team is far from isolated to one field either:
http://online.wsj.com/news/articles/SB10001424052970203764804577059841672541590
http://www.popsci.com/science/article/2013-05/half-cancer-scientists-have-been-unable-reproduce-studies-survey-finds
http://www.slate.com/articles/health_and_science/science/2012/08/reproducing_scientific_studies_a_good_housekeeping_seal_of_approval_.html
https://www.xsede.org/gateways-for-open-science
http://www.eusci.org.uk/articles/data-doesnt-lie-scientists-do
Depending on the study that means that either the data has been fabricated by unethical scientists, or the data has been misrepresnted for political purposes. Studies are often improperly interpreted by failing to take into account sound statistical modeling and noise is reported as science. In some fields politics have effectively taken over (e.g. social sciences) and standards are used that would never be tolerated in other scientific fields.
The very culture of science that demands quantity over quality needs to change as the rat race that inspires junk science to begin with. I can't think of any other field where those kinds of failure rates about the reproducibility of your work would do anything other than get you fired for fraud and destroy your career. I like science, I have since I was a young child, but the junk were getting labeled as science doesn't deserve the label.
Except that as mentioned in TFA, many data sets are unique to a time and place, and thus can never be replicated. They may for example reflect the social temperaments, behaviors, or material/physical qualities of a particular population at a particular point in time.
When you say "put the data into the public", how much storage space and how does it get there?
Will you pony up storage and taxes for this?
Will you ask the same of the "private" data of corporations that rely on government largesse to exist?
And when it's passed to the public, on a thousand servers, how do you know if the one you happened to get to first is genuine or been fudged by someone with an agenda against the science? Do you think AIG would mirror honestly the genetic proof of evolution?
And at what point is it no longer the science institutions requirement to pass this data to the public? Because until then, you'll still need to pay for that access and storage. Then what's to stop every public copy being deleted because nobody cares any more? "the public" won't change storage media and veryfy contents for ever you know.
It's very easy to claim as you have done, but what do you mean by it?
Universities should band together to distribute all data from published material on P2P networks so it's redundantly stored at mulitple locations. This has the side-benefit of making a legitimate use of P2P obvious.
Higher Logics: where programming meets science.
Some years ago I picked up a copy of "Dark Ages II -- When the Digital Data Die" by Bryan Bergeron (2002) but only now have gotten around to finishing reading (for some reason I never got past the first chapter at the time). When I bought it I had just had my own experience with the not-so-long life of digital data (some CDs I'd burned a few years earlier were already unreadable). The book's a bit dated (it says that there are many people out there with Zip drives connected to their PCs) as, obviously technology marches on, leaving older media in the dust but that's the point of the book and the ideas are still relevant. Worth looking for at your public library if you're still of the mind that a digital format is superior to everything else for long-term storage. Personally, I think we're looking at trouble if everything's converted to bits thinking that it'll always be available. Continued access to one of those aforementioned 8" CPM floppies is a good example. My failed CD-Rs are another.
CUR ALLOC 20195.....5804M
The InterPARES Project
The International Research on Permanent Authentic Records in Electronic Systems (InterPARES) aims at developing the knowledge essential to the long-term preservation of authentic records created and/or maintained in digital form and providing the basis for standards, policies, strategies and plans of action capable of ensuring the longevity of such material and the ability of its users to trust its authenticity. The findings and products of the first three phases of the project can be found on this website.
Out of mind, out of sight,gone forever
much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
slashdot used to purge -1 and 0 rated comments from old stories. "So what?", you say. "Why should they store goatse links and ascii art penises?" But before the misnamed lameness filter, there was a vibrant troll culture. These were works of art that spawned adequacy.org and had a lot of time, creativity, and effort put into them. Much more interesting than the "linux good, microsoft bad" groupthink that made it to +5 informative and wasn't purged.
Do you even lift?
These aren't the 'roids you're looking for.
As a former paper industry professional (recycled pulp), Paper is fine except that people limit its use to readable font. That is what led to Microfiche (which is now being dumped by the truckload at recycling stations as "obsolete tech"). If you printed a hard copy of everything either to microfiche or extremely small 1-point font, you could store the data in a type of seedbank or gene bank.
A salt mine may not be appropriate, but I'd like to start a business where everyone could send their hard drives to a giant 100 year Time Capsule Vault in the Sonoran desert. We are shredding retired professors hard drives which the professors probably would prefer to see preserved. The "half life" of privacy risk is different for different data... experiments, emails, credit card numbers, and porn browsing cookies are not posing the same posthumous risk/benefit. We are cremating too many of our future fossils.
IMHO the biggest threat to raw data is misplaced or randomized fear of privacy combined with copyright planned obsolescence (or mandated "e-waste" shredding for working tech, out of fear that poor people will misuse a display device). Certain data does need to be destroyed, and certain papers shredded. Treating all "data" as having the same expiration date has something to do with the loss of the data in the article.
Gently reply
[OP] "disappearing into old email addresses and obsolete storage devices, a Canadian study (abstract, article paywalled) indicated
Well so much for the study. Money changes everything. Eventually one hundred thousand copies of the abstract will exist on the Internet, but the authors' future descendants will find only only one actual link that leads to content, which terminates at a page saying "this domain is for sale".
You'd think that even science data of extremely low bit rate such as original weather station temperature data should be out there somewhere. A lot of other people did too... but all that is available now might be "value added" ajusted data. Not an evil conspiracy per se, it's human nature at it's best and worst.
A handy chronology of the history of data retention:
[2500BC] King Fuckemup boldly slew the enemy and I, Scribe Asskissus hath inscribed it in stone. He is an asshole who owes me back wages."
[1500] "With quivering quill I will write mine own data."
[1866] "Data published at great expense into leather-bound volumes. Dust sold separately."
[1970] "This is really important. we should print it and store it in a binder."
[1971] They didn't.
[1983] "I'll write it to floppy disk with a notsosticky label"
[1985] "After a long and desperate search, the label has been found!"
[1987] "Unlabeled floppy disk keeps coffeemaker level."
[1995] "Roxio CD storage is forever, and Real Scientists don't close their data sessions."
[2003] "Microsoft Word has experienced a problem updating from an older document format and will now close. Save your work as soon as possible."
[2005] "I'll just email it to myself and shut the computer off immediately, then pick it up at work."
[2009] "Yes, three copies! In the safe. There was a fire. Yes, inside the safe. It was a fireproof safe, so no one noticed."
[2010] "This is really important. I should print it and store it in a binder. But my ink cartridge is dry."
[2013] "Our data has been uploaded to the Cloud where it will live forever."
[2500] "King Grapeape slew the primitive humans and buried their statue on the beach. I, Scribe Anthopoapologus hath incribed it in stone."
Perhaps the most mystiying data retention escapade of Modern Times is the missing Apollo 11 SSTV moon tapes which contained a multiplexed stream of raw telemetry and the original slow-scan TV signal broadcast from the moon. Not 'missing' really, rather we know they were re-used and recorded over because everyone assumed it was someone else's job to ensure that at least one copy was in a safe place. While the earth station operators dutifully sent their tapes to NASA where the sharpest signal of the moon landing was sure to be perserved for posterity (not), fortunately there were some librarians on duty, and you can aquire DVDs of the moonwalk with better quality than the recordings you've seen in countless movies -- an 8mm film camera pointed at an original SSTV monitor at Honeysuckle Creek, and the best quality scan-converted version.
In the Foundation series, Asimov envisioned Gaia, a world in which a telepathic network of sentient (and sensuous) beings kept a 'working set' retrievable data in-memory -- but also via access to progressively less and non-sentient objects, such as plants and even rocks -- a vast archive. Ask the mountain, it will answer in time, a long time.
Our own Earth has a Gaia storage mechanism, a record of its magnetic field over geologic time stored as polarization in crystallized lava floes. But it i
<blink>down the rabbit hole</blink>
I once read a story, don't know it's true, that the team that discovered those Yttrium-Barium-Copper oxide high-temperature superconductors had made a silly mistake in sending their very important breakthrough paper for peer review: they had (by accident of course) changed every Y for Yttrium into Yb for Ytterbium.
;-)
The peer review process took *ages*. Eventually the paper was accepted. A quick erratum "change Yb to Y everywhere. oops. our secretary made a typo."
The Nobel prize the very next year!!
Meanwhile, several large competing labs in the world had been buying Ytterbium like there was no tomorrow and writing articles about experiments in superconductivity with Ytterbium (which doesn't work)
The NIDDK was aware of this years ago and had commissioned a feasibility study on creating a storage mechanism that all grant paid research would have to use. Unfortunately after a successful feasibility study the reviewers for the follow up real grant responded with "I do not see the scientific value of this research" and the grant went away with Vanderbilt as the only applicant. I've heard through the vine that someone picked up a new similar grant to work on it, but I haven't seen anything from it yet. The big problem is that researchers do not want to share their unpublished research. From what I've gleamed they want to keep things in their back pocket for future grants/publications.
The site was http://dkcoin.org/
And paywalls and the overall exclusivity-oriented nature of academia are to blame for this.
When you do stuff in the open and share it, it's (at least in our current information age) immortal.
When you're a prick about it. It's lost. And most of academia is composed of pricks.
Our scientific research system is built around the process of joining a lab, mastering the work there, and then leaving. There are very few long term research partnerships. The people who stay in place are the professors, who generally do not do the research work.
So you join a lab, produce a few terabytes of data a year, pull a few publishable nuggets out of that and then leave. I have a few backup hard drives that move around with me with what I consider my most important data, probably total 1/10 of the data I have taken. After a few years, this data is really unimportant to me as the labs I have left have done a good job of continuing the research and I have to spend my time and money on something else.
The original data is eventually overwritten by researchers a few "generations" removed from me and that's the end of it.
How is that different from the previous state of affairs?
Before digital age, Scientists would have work booklets that would get lost or destroyed when they change job, or when they become too numerous.
Drawning in an overflow of data is about as useful as having no data at all. It could be argued that forgetting is actually a good thing that puts forward important matter, those that we care to keep because they are valuable. Sure, some valuables get lost in the process, but anyway, who would go sort trough all data they ever generated, even if they had them available forever?
You can still do just as much with NEW data as you could with OLD data, you just have to pay to collect it, which is a cost, but then again, you're not going to chip in on the cost of someone else's costs so you can save something later potentially, are you.
NGDC cuts are because so many merkins are 100% anti-tax.
Storing the data costs.
You cut funding, they have to cut costs.
When John Knoll (yes, THE John Knoll, co-creator of Photoshop and VFX wizard extraordinaire) wanted to reproduce the Apollo moon landing in CG he ran into a small problem. He went to NASA to obtain the telemetry data for altitude and orientation but apparently the data had been tossed a long time ago. However, he was able to find physical prints of graphs of the telemetry channels. So he scanned them in, made them an underlay in a 3D modeling program, and painstakingly traced them by hand in order to extract the data. The results can be seen in Magnificent Desolation Apollo 15 landing sequence. And BTW, that's his modeling work for the lander too.
Often data is subject to strict retention and destruction policies.
This isn't news.
So feel like backing that up on your RAID array for free?
I am thinking back to one lab I used to work in that had boxes and boxes of old tape spools sitting out in the hallway, it was always sad to wonder what might be on them since the machine used to create the data had already been disassembled to make space.
And then I think about the actual project I was working on, which produced something like 1GB/hour every hour every day. Only a fraction of the raw data really made it through cooking, but if there turned out to be a flaw in that initial processing our ability to go back and reprocess was limited by 'do we happen to have that run still?'.
From what I've heard, National Science Foundation is worried a lot about scientific data preservation. Here is some reading http://en.wikipedia.org/wiki/Datanet
between data and information. Information is data which reduces confusion. Data can actually carry negative information value if it increases confusion. Any data which is highly informative survives. And just because money was spent to obtain it, doesn't mean it was fruitful. Research is, almost by definition, a walk in the dark. It attempts to reduce confusion. And, as such, is bound to have misses more often than hits.
Any guest worker system is indistinguishable from indentured servitude.
Check out the Dilbert comics from Sept 6 - Sept 16
Damn. This just had me realize the original raw data collected for my most significant publication is gone.
Of course. Your publication should generally stand on it's own, providing enough details in methodology and statistical handling to make the raw data less valuable.
That said, I've felt since the creation of the web that all data generated using public funding should be easily and simply accessed, so that others may evaluate or even expand on your work. Including programs developed (and source code and details of systems used.)
Ideally, we should work towards the kind of open databases that amateur astronomers now have access too.. and continuously adding to the value of the collected data.
We Need Legacy Support - I keep saying this and the little kids keep dissing me but we desperately need to maintain legacy support. In 30 more years what else will we have lost through rapid obsolescence?
Companies like Apple and Microsoft need to reach back and provide it all the way to their earliest systems forward. We need to be able to access our old data and that means being able to run our old applications.
Congress needs to put forth the legal framework that allows all software to be legal cross compiled, enveloped and emulated so that it can run on future hardware and in future operating systems.
This does not require ballooning of operating systems. It can be done through fairly simple emulation or better yet cross compilation and enveloping. We have the technology.
I have a box with about 200 3.5" floppy disks of facility data. And another box with several laser disks from HP data systems (1980s that ran RMB) because those floppies could only store four hours of data. Data is not "scientific" but facility pressure, temperature, stresses, etc. Don't know what to do with all this, I don't think is important like data from Voyager or Pioneer but one never knows. We don't have the equipment anymore to read it. Maybe we can find it used, ebay perhaps? I remember those HP instrument controllers ***never crash***. There may have been times when someone pulls the power cord. Only crashes I experienced was inadvertent divide by zero so the program halts. But. the data is still there including values in the variables i.e. TSPTEMP still has temperature data.
mfwright@batnet.com
Business is booming and look who's buying. :-)
Its the new new new.
Sudy
Perhaps, though if you jump to the bottom of the article, you can see that they are making an effort to keep the data by archiving it with Dryad.
You might want to look at LOCKSS (Lots Of Copies Keep Stuff Safe (http://www.lockss.org/)) -- we are integrating PURR with the MetaArchive Private LOCKSS Network at Purdue (PURR is the Purdue University Research Repository, which is a Trusted Digital Repository for research data).
"Display some adaptability" -- Doug Shaftoe, _Cryptonomicon_
Scientific data by themselves are probably useless. So we have a bunch of numbers. What was the setup of the experiment that generated those numbers? What exactly was the instrument, what are the units of measurement? Did you make any major modifications to the instrument? How was it calibrated? Where is your control? Are those numbers from a good test or a test where someone spilled coffe on the sample? Was that data taken during one of the trials where you left the lens cap on? Reminds me of a bad sci fi movie. That disk has random "scientific data" on it. Any "scientist" should be able to read it and instantly see what is going on here.
Your notes and documentation are probably more important than just the numbers you collect and those are often still stored on lab notebooks. You know what is really important? The journal articles and papers that you write that show all your methods and have pretty pictures showing your good data. A lot of those are still on paper so they aren't going away. So we are loosing a lot of random numbers from obsolete equipment from setups that no one remembers anymore. I am not going to loose sleep over it assuming we still have backups of the papers people published that talked about their setups and outlined their final results.
I made a diagram (derived from a diagram in an earlier publication) that presents this data (and metadata) loss really well: Research Data and Metadata at Risk: Degradation over Time as part of a paper I co-authored on this subject, Facilitating Data Sharing in the Behavioral Sciences.
Second URL should be: http://dx.doi.org/10.1890/1051-0761(1997)007%5B0330:NMFTES%5D2.0.CO;2
https://www.jstage.jst.go.jp/article/dsj/11/0/11_11-DS4/_article
A couple of us just rescued some 20-year-old data that had been stored on 3.5 inch floppies. We actually had to go to one of our old retired colleague's houses because he was the only person we could find who had a computer with a floppy drive capable of reading them. Even so, some of the data was unrecoverable.
I know probably the best option right now for preservation in digital form would be several copies on CD/DVDs of the proper archival type, but I'm wondering if there are any free online services such as Amazon Web Services (which has free accounts for limited usage) where there'd be a prayer they'd keep it around for decades. After all the stuff that Google has abandoned over the years, I'd never count on them, but is there anyone else who might be any better?
Hello,
Our mindset at my research institution is very different. We generate a certain amount of data per year (several terabytes), but the cost of storage decreases so fast we just copy old data onto new media and never delete ANYTHING.
In fact, we consider the cost of actually figuring out what data to delete to be higher than simply buying more storage.
I would not call it "well-indexed" however.
Our backup strategy is tailored to the nature of our data. Most of our data is simulation results. We back up "lightweight" data and analyzed results, input files, and log files. "Heavyweight" data we do not back up, since we consider the cost of reproducing this data (given the input files and the log files) modified by the low probability of actually ever needing it to be lower than the cost of backing it up. This results in our backup requirement to be maybe 5% of our "live" data archive.
If it gets to the point where we can't afford the storage anymore, we'll delete the "heavyweight" data ourselves to reduce the data footprint.
--PeterM
For one example, for one project let's say I have roughly 300GB of simulation data. Of out that data, how much will be used to generate a figures for publication? Maybe 1%? The rest of it is from testing, fine tuning, and exploring the parameter space. The real problem isn't where to save it all, but that there is exteremely little incetive to to go through the trouble of sifting through and archiving the important stuff. 80% is proably a lower bound, IMHO. Futhermore, let's say you save that im portant precious data. Good luck future scientist in figuring out what is in those files and how to analyze it.
I realize that not all science is like this, but I think I'm speaking about the majority, not the minority.
This problem occurs even for people in the same group, who often find problems to repeat the simulations from our own papers, and even as recent as one year ago. The problems typically come from people leaving (PhD finished, grants that expire, people that move to a different job), changes in the simulation tools, etc.
In our Computer Architecture research group we employ Mercurial for versioning the simulator code. Thus, we can know when each change was applied. For each simulation, we store both the configuration file that is used to generate that simulation (which also includes the Mercurial version of the code which is being used) and the simulation results, or at least only the interesting results. Multiple simulators allow for different verbosity levels, and in most cases most of the output is useless, so we typically store the interesting data (such as latency and throughput) because otherwise we would have no disk space.
Even with this setup, we often find problems trying to replicate the exact results of our own previous papers, for example because of poor documentation (this is typical in research, since homebrew simulation tools are not maintained as one would expect from commertial code), changes that introduce subtle effects, code that gets lost when some person leaves or simply large files that get deleted to save disk space (for example, simulation checkpoints or network traces, which are typically very large).
However, you typically do not need to look back and replicate results, so keeping all the data is a useless effort. I completely understand that research data gets lost, but I think that it is largely unavoidable.
And part of the reason is Pay Walls... Just like the one blocking the paper from the public.
to reinvent the wheel?
Loss is irrelevant to the argument, because loss can occur to both paper copies and electronic copies. The argument is about what you can do with the media if it is *not* lost. Paper copies can be read for centuries (at least on acid-free paper). Hard drives probably last 10 to 30 years (we'll know in 30 years, although we can get some idea sooner by exposing hard drives to high temps etc.). CDs, surprisingly (ok, it surprises me; ymmv) don't last much longer (at least we don't think they do).
Interesting. Would be even more interesting to have this disc backed with a silicon wafer. You could store a lot of data in 300mm2 of ROM even at something conservative like 0.1micron.
..presumably until the end of time, or until they can find some nefarious use for it.
Now, just -what- do I pay taxes for every April 15?