Slashdot Mirror


Scientific Data Disappears At Alarming Rate, 80% Lost In Two Decades

cold fjord writes "UPI reports, 'Eighty percent of scientific data are lost within two decades, disappearing into old email addresses and obsolete storage devices, a Canadian study (abstract, article paywalled) indicated. The finding comes from a study tracking the accessibility of scientific data over time, conducted at the University of British Columbia. Researchers attempted to collect original research data from a random set of 516 studies published between 1991 and 2011. While all data sets were available two years after publication, the odds of obtaining the underlying data dropped by 17 per cent per year after that, they reported. "Publicly funded science generates an extraordinary amount of data each year," UBC visiting scholar Tim Vines said. "Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.' — More at The Vancouver Sun and Smithsonian."

17 of 189 comments (clear)

  1. And in 20 years... by Anonymous Coward · · Score: 5, Insightful

    And in 20 years, these results too shall be lost.

    1. Re:And in 20 years... by queazocotal · · Score: 5, Insightful

      That's not the point.
      The actual published results - even if published in an obscure journal tend to stick around _much_ more.

      Even old journals which go out of publication get their archives and the rights to distribute them bought - as there is some small amount of value there, in addition to the copies in the various reference libraries around the world.

      The problem is that if you are wondering about that graph on page 14 of the paper that the whole paper rests on, you can't get the original data to recreate that graph.

      This is a major problem because the only way to check that graph is now to redo the whole experiment.

  2. Concerning... by AdamColley · · Score: 5, Insightful

    Trying to ignore that a paper about the unavailability of scientific data is locked behind a paywall.

    This is nothing new though, I do occasional conversion from ancient data formats, people need to pay better attention, imagine trying to read an 8" CP/M floppy today.

    As libraries move to digital storage rather than the dead tree that's been fine for thousands of years they are inviting a catastrophe, possibly only one well aimed solar mass ejection from massive data loss.

    1. Re:Concerning... by Dutch+Gun · · Score: 4, Insightful

      Paper has its own issues. Talk to me about the durability of paper after you recover the books lost throughout time due to natural decay, burning (intentional or otherwise), floods, wars, and social forces (politics, religion, etc). Digital data can be easily copied and archived (when not behind a paywall, of course). It seems to me that redundancy is the best form of insurance against data loss. A solar mass is not going to wipe out every computer with a copy of important data on it, and all the relevant backups. And if it does, we're probably in a lot more trouble for reasons other than losing some scientific research.

      Besides which, I sort of wonder if scientific data also follows the 80/20 rule. If so, how much are we really losing? I'm only half joking, of course, since it's difficult to ascertain the value of research immediately in some cases, but wouldn't it stand to reason that any important or groundbreaking research will naturally be widely disseminated, and thus protected against loss?

      --
      Irony: Agile development has too much intertia to be abandoned now.
    2. Re:Concerning... by Eunuchswear · · Score: 5, Insightful

      Digital data can be easily copied and archived

      Can be. But mostly isn't.

      --
      Watch this Heartland Institute video
    3. Re:Concerning... by _Shad0w_ · · Score: 4, Informative

      I'd go to one of the British deposit libraries and ask to see their copy; deposit libraries have existed since the Statute of Anne in 1710. The British Library has 28,765 books and 1,480 journals in its catalogue from 1910...

      --

      Yeah, I had a sig once; I got bored of it.

    4. Re:Concerning... by serviscope_minor · · Score: 4, Interesting

      Besides which, I sort of wonder if scientific data also follows the 80/20 rule. If so, how much are we really losing?

      Probably not that much. I'm not claiming this is good, but I don't htink it's as bad as it appears.

      If a paper is unimportant and more or less sinks without a trace (perhaps a handful of citations), then the data is probably of no importance since someone is unlikely to ever want it. Generally this is because papers tend to get more obscure over time and also get supereseded.

      For important papers, the data just isn't enough: is a paper is important then it will establish some technique or result. In 20 years people will have generally already reanalysed the data and likely also independently verified the result if it is important enough. After 20 years I think the community will have moved on and the result will either be established or discredited.

      I think the exception is for things that are "hard" to find such or non-repeatable such as finding fossils. Then again the Natural History Museum has boxes and boxes and boxes of the things in the back room. They still haven't gotten round to sorting all the fossils from the Beagle yet (this is not a joke or rhetoric: I know someone who worked there).

      So my conclusion is that it's not really great that the data is being lost, but it's not as bad as it initially sounds.

      --
      SJW n. One who posts facts.
    5. Re:Concerning... by clickclickdrone · · Score: 4, Informative

      As an extreme case, the BBC has reported on scrolls from Pompeii and Herculaneum that were 'destroyed' by Vesuvius are now starting to reveal their secrets using some pretty impressive techniques. http://www.bbc.co.uk/news/magazine-25106956

      --
      I want a list of atrocities done in your name - Recoil
    6. Re:Concerning... by martin-boundary · · Score: 4, Insightful

      This is nothing new though, I do occasional conversion from ancient data formats, people need to pay better attention, imagine trying to read an 8" CP /M floppy today.

      It's not that it's a new problem as such, it's that for the first time in history we have a simple way to solve it, yet we have stupid greedy rich people who sponsor and enact laws to stop us from solving the problem.

      The way to solve the problem is through massive duplication of all the data, over and over again through time. We have the technical means to do this on an unprecedented scale.

      Even 1000 years ago, people had to painstakingly copy books, by hand, one at a time. And after a handful of copies were produced, there still weren't enough to guarantee that most would survive the ages, wars, fires, censorship, etc. So we generally have tiny collections from the past.

      But now it's digital data. Anyone could copy it. We could have millions of copies of some obscure scientific work, all perfect duplicates. If even 0.1% of these copies survive, that's still thousands of copies.

      And what do we do? We let a bunch of 1 percenters, who themselves barely know how or care to read, sponsor draconian copyright laws to stop eeryone from copying all that stuff, just on the off chance that they might copy a bunch of songs or movies that are outmoded within two years. And the commercial scienrific pulishers are some of the worst.

      It's pathetic.

    7. Re:Concerning... by Anonymous Coward · · Score: 5, Interesting

      I designed and built the equipment for scientific experiments that will never be repeated: cochlear implant stimulation of one ear, done in an MRI. This was safe because the older implant technology had a jack that stuck out of the subject's head, and which we could connect to electronics outside the MRI itself. But the old "Ineraid" implants have been replaced, clinically, with implants using embedded electronics and usually magnets. Those are hideously unsafe to to even bring in the same *room* as an MRI, much less actually scan the brain of a person wearing one.

      So that experiment is unlikely to ever be repeated. Losing the data, and losing the extensive clinical records of those subjects, would be an immense loss to science. There is especially historical data from decades of testing on these subjects that show the long term effects of their implants, or of different types of redesigned external stimulators. That data is scientifically priceless. When I started that work, we used mag-tape for data, and scientific notebooks for recording measurements. I helped reformat and transfer that data to increasingly modern storage devices several times. We went through 3 different types of storage media in 10 years, and I remember having to write software to allow Exabyte drives to find the end of the tape and add data. (Exabytes had no End-Of-Tape marker.) Preserving that data.... was a lot of work.

    8. Re:Concerning... by Teun · · Score: 4, Insightful
      In the nineties I had a friend working for a company that bought a lot of old Soviet geophysical data.

      It needed some very special transcription technology but once in the clear and fed to modern 3D seismic software it revealed a lot more than the original reports gave.

      Retaining old reports is nice, retaining old raw data even nicer.

      --
      "The likes of Facebook and WhatsApp are free to those whose privacy is of zero value."
    9. Re:Concerning... by Lisias · · Score: 5, Insightful

      Wishful thinking.

      Let's make a deal: *first*, the gene therapy works. *THEN* we assume we can afford to lose the data the grandparent talks about.

      --
      Lisias@Earth.SolarSystem.OrionArm.MilkyWay.Local.Virgo.Universe.org
  3. Lifecycle management by FaxeTheCat · · Score: 4, Interesting

    So the institutions do not have any data lifecycle management for research data. Are we supposed to be surprised? Ensuring that data are not lost is a huge undertaking and cannot be left to the individual researcher. It may also require a change in the research culture at many institutions. As long as research is measured by the publications, that is where the resources go and where the focus will be.

    Will this change? Probably not.

  4. On the bright side... by ron_ivi · · Score: 4, Interesting

    ... poorly collected unreliable data also vanishes at at least the same rate (hopefully faster). And assuming shoddy data disapears faster than good data, then the quality of available data should continually increase.

  5. So...? by Anonymous Coward · · Score: 4, Insightful

    I'm a researcher and I don't have time or space to keep old data as I'm generating too much new data. We work hard to maximize the use of these data and analyses when we write and publish papers. If this was talking about the papers (or presentations), that were the product of the data, being lost at this rate it would be one thing, but the raw data isn't usually very useful to anyone without context or knowledge of subtle and poorly documented technicalities. This just seems like ammunition for the climate change deniers to bitch about. It's unreasonable to keep the old data indefinitely without a massive public repository that will be poorly indexed and organized.

  6. is/are by LMariachi · · Score: 5, Interesting

    Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.

    Whichever side of the "data is" vs. "data are" argument one falls on, I hope we can all agree that mixing both forms within the same sentence is definitely wrong.

  7. Re:Why must you have their data? by n1ywb · · Score: 5, Interesting

    No but it is amazing what NEW science you can do with OLD data. I've worked with the Transportable Array project for example http://www.usarray.org/researchers/obs/transportable it's over a decade old and scientists are still discovering new ways to take advantage of the data and will likely be doing so for decades to come. On the other hand a lot of data is just junk due to poor quality metadata; when was that instrument calibrated? I dunno. Damn. At leat in geophysics we have the National Geophysical Data Center to curate this stuff http://www.ngdc.noaa.gov/ at least until Congress cuts it's funding.

    --
    -73, de n1ywb
    www.n1ywb.com