Slashdot Mirror


Scientific Data Disappears At Alarming Rate, 80% Lost In Two Decades

cold fjord writes "UPI reports, 'Eighty percent of scientific data are lost within two decades, disappearing into old email addresses and obsolete storage devices, a Canadian study (abstract, article paywalled) indicated. The finding comes from a study tracking the accessibility of scientific data over time, conducted at the University of British Columbia. Researchers attempted to collect original research data from a random set of 516 studies published between 1991 and 2011. While all data sets were available two years after publication, the odds of obtaining the underlying data dropped by 17 per cent per year after that, they reported. "Publicly funded science generates an extraordinary amount of data each year," UBC visiting scholar Tim Vines said. "Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.' — More at The Vancouver Sun and Smithsonian."

189 comments

  1. And in 20 years... by Anonymous Coward · · Score: 5, Insightful

    And in 20 years, these results too shall be lost.

    1. Re:And in 20 years... by Anonymous Coward · · Score: 0

      I feel terrible that I laughed. This is terrible news...

    2. Re:And in 20 years... by Z00L00K · · Score: 1

      Unless it's published in a newspaper or magazine that is widespread. But printed matter seems to be on a decay.

      --
      If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    3. Re:And in 20 years... by queazocotal · · Score: 5, Insightful

      That's not the point.
      The actual published results - even if published in an obscure journal tend to stick around _much_ more.

      Even old journals which go out of publication get their archives and the rights to distribute them bought - as there is some small amount of value there, in addition to the copies in the various reference libraries around the world.

      The problem is that if you are wondering about that graph on page 14 of the paper that the whole paper rests on, you can't get the original data to recreate that graph.

      This is a major problem because the only way to check that graph is now to redo the whole experiment.

    4. Re:And in 20 years... by ObsessiveMathsFreak · · Score: 1, Funny

      Well, they're currently behind a paywall, so I don't see how most of us were even supposed to find them in the first place.

      --
      May the Maths Be with you!
  2. but when by Anonymous Coward · · Score: 0

    does it reappear?

  3. lulz by Anonymous Coward · · Score: 3, Funny

    thats okay, the nsa has a backup

    1. Re:lulz by Anonymous Coward · · Score: 0

      Unfortunately the NSA is like a write only memory, so it is there, but you can't get it out.

    2. Re:lulz by Anonymous Coward · · Score: 0

      You mean until Wikileaks or Anonymous releases some of it.

  4. Concerning... by AdamColley · · Score: 5, Insightful

    Trying to ignore that a paper about the unavailability of scientific data is locked behind a paywall.

    This is nothing new though, I do occasional conversion from ancient data formats, people need to pay better attention, imagine trying to read an 8" CP/M floppy today.

    As libraries move to digital storage rather than the dead tree that's been fine for thousands of years they are inviting a catastrophe, possibly only one well aimed solar mass ejection from massive data loss.

    1. Re:Concerning... by Dutch+Gun · · Score: 4, Insightful

      Paper has its own issues. Talk to me about the durability of paper after you recover the books lost throughout time due to natural decay, burning (intentional or otherwise), floods, wars, and social forces (politics, religion, etc). Digital data can be easily copied and archived (when not behind a paywall, of course). It seems to me that redundancy is the best form of insurance against data loss. A solar mass is not going to wipe out every computer with a copy of important data on it, and all the relevant backups. And if it does, we're probably in a lot more trouble for reasons other than losing some scientific research.

      Besides which, I sort of wonder if scientific data also follows the 80/20 rule. If so, how much are we really losing? I'm only half joking, of course, since it's difficult to ascertain the value of research immediately in some cases, but wouldn't it stand to reason that any important or groundbreaking research will naturally be widely disseminated, and thus protected against loss?

      --
      Irony: Agile development has too much intertia to be abandoned now.
    2. Re:Concerning... by Anonymous Coward · · Score: 2, Insightful

      The problem is not just an issue of digital storage, but also a problem of redundancy.

      In the "old days", people understood and accepted the risk that a paper copy would be lost. In fact, it was a GIVEN that they would eventually be lost (or damaged or misplaced or stolen or checked out and simply never returned). So multiple copies were kept because centuries of experience dictated that some copies would be lost no matter how strong, carefully maintained and well preserved the originals were.

      Nowadays, people simply think its a matter of "copy and paste". But, as you point out, its not. Different hardware formats on top of different software formats. The card catalog with its rigid but well defined categories was switched for a nebulous and vague "tagging" system. And god help you if the files are corrupted.

    3. Re:Concerning... by Eunuchswear · · Score: 5, Insightful

      Digital data can be easily copied and archived

      Can be. But mostly isn't.

      --
      Watch this Heartland Institute video
    4. Re:Concerning... by thunderclap · · Score: 3

      well dead tree has its own issues. Try finding a book written and published in 1910. Most likely you won't. The paper is so fragile that its has to specially sealed to survive. Rag paper on the other hand still looks good for its age.

    5. Re:Concerning... by clickclickdrone · · Score: 3, Interesting

      That's still 100 years which is a lot better than the data being talked about here.

      There was a documentry on the radio this week about the loss of letter writing as a form and how alarmed biographers were getting because it's getting very hard to trace someone's life, thoughts, actions etc without a paper trail as stuff like emails, digital photos etc generally get lost when someone dies.

      Personally, I find the increasing rate of loss quite alarming - so much of our lives are digital and so little is properly curated with a view to future access. We know so much about the past from old documents, often hundreds if not thousands of years old but these days we're hard pushed to find something published ten years ago.

      --
      I want a list of atrocities done in your name - Recoil
    6. Re:Concerning... by _Shad0w_ · · Score: 4, Informative

      I'd go to one of the British deposit libraries and ask to see their copy; deposit libraries have existed since the Statute of Anne in 1710. The British Library has 28,765 books and 1,480 journals in its catalogue from 1910...

      --

      Yeah, I had a sig once; I got bored of it.

    7. Re:Concerning... by serviscope_minor · · Score: 4, Interesting

      Besides which, I sort of wonder if scientific data also follows the 80/20 rule. If so, how much are we really losing?

      Probably not that much. I'm not claiming this is good, but I don't htink it's as bad as it appears.

      If a paper is unimportant and more or less sinks without a trace (perhaps a handful of citations), then the data is probably of no importance since someone is unlikely to ever want it. Generally this is because papers tend to get more obscure over time and also get supereseded.

      For important papers, the data just isn't enough: is a paper is important then it will establish some technique or result. In 20 years people will have generally already reanalysed the data and likely also independently verified the result if it is important enough. After 20 years I think the community will have moved on and the result will either be established or discredited.

      I think the exception is for things that are "hard" to find such or non-repeatable such as finding fossils. Then again the Natural History Museum has boxes and boxes and boxes of the things in the back room. They still haven't gotten round to sorting all the fossils from the Beagle yet (this is not a joke or rhetoric: I know someone who worked there).

      So my conclusion is that it's not really great that the data is being lost, but it's not as bad as it initially sounds.

      --
      SJW n. One who posts facts.
    8. Re:Concerning... by dbIII · · Score: 1

      imagine trying to read an 8" CP/M

      If I didn't already know of two companies that could do that I'd look in the yellow pages. I get your point though and there are older or rarer formats than that which would require a bit of legwork or possibly even reverse engineering.

    9. Re:Concerning... by clickclickdrone · · Score: 4, Informative

      As an extreme case, the BBC has reported on scrolls from Pompeii and Herculaneum that were 'destroyed' by Vesuvius are now starting to reveal their secrets using some pretty impressive techniques. http://www.bbc.co.uk/news/magazine-25106956

      --
      I want a list of atrocities done in your name - Recoil
    10. Re:Concerning... by martin-boundary · · Score: 4, Insightful

      This is nothing new though, I do occasional conversion from ancient data formats, people need to pay better attention, imagine trying to read an 8" CP /M floppy today.

      It's not that it's a new problem as such, it's that for the first time in history we have a simple way to solve it, yet we have stupid greedy rich people who sponsor and enact laws to stop us from solving the problem.

      The way to solve the problem is through massive duplication of all the data, over and over again through time. We have the technical means to do this on an unprecedented scale.

      Even 1000 years ago, people had to painstakingly copy books, by hand, one at a time. And after a handful of copies were produced, there still weren't enough to guarantee that most would survive the ages, wars, fires, censorship, etc. So we generally have tiny collections from the past.

      But now it's digital data. Anyone could copy it. We could have millions of copies of some obscure scientific work, all perfect duplicates. If even 0.1% of these copies survive, that's still thousands of copies.

      And what do we do? We let a bunch of 1 percenters, who themselves barely know how or care to read, sponsor draconian copyright laws to stop eeryone from copying all that stuff, just on the off chance that they might copy a bunch of songs or movies that are outmoded within two years. And the commercial scienrific pulishers are some of the worst.

      It's pathetic.

    11. Re:Concerning... by Yvanhoe · · Score: 1

      On this fight, Aaron Swartz came very close to make the whole world totally different.

      --
      The Wise adapts himself to the world. The Fool adapts the world to himself. Therefore, all progress depends on the Fool.
    12. Re:Concerning... by bfandreas · · Score: 3, Insightful

      The combination of insane copyright claims and the overrelyance on comparatively volatile storage technology is steering us directly into another dark ages.
      That's one take on things.
      On the other hand we have already lost so much stuff over the centuries that perhaps what I just said is idiotic alarmism. After all we have rebuilt western civilisation after the fall of Rome(that just took the Dark Ages) and we didn't all die off after the Great Library of Alexandria burned down. The stuff that gets often replicated will propably not be lost. But let's hope it isn't a retweet of Miley Cyrus' knickers.

      --
      20 minutes into the future
    13. Re:Concerning... by Anonymous Coward · · Score: 1

      In my experience books and journals from 1910 have survived very well. (These were for the most part printed in the UK and USA; I don't know how well publications from, say, Japan or China have fared.)

    14. Re:Concerning... by Anonymous Coward · · Score: 5, Interesting

      I designed and built the equipment for scientific experiments that will never be repeated: cochlear implant stimulation of one ear, done in an MRI. This was safe because the older implant technology had a jack that stuck out of the subject's head, and which we could connect to electronics outside the MRI itself. But the old "Ineraid" implants have been replaced, clinically, with implants using embedded electronics and usually magnets. Those are hideously unsafe to to even bring in the same *room* as an MRI, much less actually scan the brain of a person wearing one.

      So that experiment is unlikely to ever be repeated. Losing the data, and losing the extensive clinical records of those subjects, would be an immense loss to science. There is especially historical data from decades of testing on these subjects that show the long term effects of their implants, or of different types of redesigned external stimulators. That data is scientifically priceless. When I started that work, we used mag-tape for data, and scientific notebooks for recording measurements. I helped reformat and transfer that data to increasingly modern storage devices several times. We went through 3 different types of storage media in 10 years, and I remember having to write software to allow Exabyte drives to find the end of the tape and add data. (Exabytes had no End-Of-Tape marker.) Preserving that data.... was a lot of work.

    15. Re:Concerning... by Alien1024 · · Score: 1

      the dead tree that's been fine for thousands of years

      Not so fine.... The Alexandria library fire was perhaps the most catastrophic loss of human knowledge ever. For example, it destroyed the details of a heliocentric theory, which was postulated by Greek astronomer Aristarchus of Samos, millennia before Copernicus brought it to mainstream.

    16. Re:Concerning... by jabuzz · · Score: 1, Interesting

      No it won't because it 20-30 years we will be able to do gene therapy to "grow" or "regrow" the stereocilia and hence cochlear implants will be considered as barbaric as medieval blood letting. Consequently the data will only be of obscure historical interest.

    17. Re:Concerning... by Teun · · Score: 4, Insightful
      In the nineties I had a friend working for a company that bought a lot of old Soviet geophysical data.

      It needed some very special transcription technology but once in the clear and fed to modern 3D seismic software it revealed a lot more than the original reports gave.

      Retaining old reports is nice, retaining old raw data even nicer.

      --
      "The likes of Facebook and WhatsApp are free to those whose privacy is of zero value."
    18. Re:Concerning... by Lisias · · Score: 5, Insightful

      Wishful thinking.

      Let's make a deal: *first*, the gene therapy works. *THEN* we assume we can afford to lose the data the grandparent talks about.

      --
      Lisias@Earth.SolarSystem.OrionArm.MilkyWay.Local.Virgo.Universe.org
    19. Re:Concerning... by dj245 · · Score: 1

      We went through 3 different types of storage media in 10 years, and I remember having to write software to allow Exabyte drives to find the end of the tape and add data. (Exabytes had no End-Of-Tape marker.) Preserving that data.... was a lot of work.

      Do what everybody else does. Encrypt it using a strong password, then upload it to The Pirate Bay or the Semi-centralized Filesharing Platform Which Shall Not Be Named, and call it "insurance file xxxxx".

      --
      Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
    20. Re:Concerning... by Anonymous Coward · · Score: 0

      So what have YOU done to rectify the situation? beside whine ?

    21. Re:Concerning... by Richard_at_work · · Score: 1

      I have a collection of books on my shelf that date from the 1870s, all in top condition and never been stored in any special way.

    22. Re:Concerning... by DrLang21 · · Score: 1

      Considering that there is active research on original Chinese commentaries dating back to the Tang dynasty, I would say they are holding up pretty good.

      --
      I see the glass as full with a FoS of 2.
    23. Re:Concerning... by secretcurse · · Score: 1

      Why would a cochlear implant ever be considered as barbaric as medieval blood letting? The implants aren't perfect, but they provide a huge increase in the quality of life for a large number of patients. A potential better solution that's decades down the line doesn't make a currently effective treatment barbaric...

      --
      I'm using all of my mod points to mod ancient memes down. Please join me.
    24. Re:Concerning... by X0563511 · · Score: 2

      Just because it was sitting around in a library is no guarantee anything would have happened with it.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    25. Re:Concerning... by neoritter · · Score: 1

      Because a flood, war, catch-all for everything else you couldn't think of couldn't do the same amount of harm to a HDD or SDD. The natural decay of paper archives is so much slower than that of digital media it's laughable. We still have copies of the Bible from hundreds of years ago. I'd love to see an HDD or SDD last that long.

    26. Re:Concerning... by Impy+the+Impiuos+Imp · · Score: 1

      Tell me about it.

      Scientific Data Disappears At Alarming Rate, 80% Lost In Two Decades

      Man Bummed: Porn Disappears At Alarming Rate, 80% Lost In Two Decades

      --
      (-1: Post disagrees with my already-settled worldview) is not a valid mod option.
    27. Re:Concerning... by Beetle+B. · · Score: 1

      Can be. But mostly isn't.

      Kinda like paper.

      --
      Beetle B.
    28. Re:Concerning... by Anonymous Coward · · Score: 0

      Confirmation bias - while some books from that era are still around, a great many are long since lost. Having AN HDD or SSD still work is possible, having a large number, not so much. It's that same thing with "the Romans were such great engineers that their works endured" - no, their great works endured, so we think they were better engineers than they actually were.

    29. Re:Concerning... by Anonymous Coward · · Score: 0

      Trying to ignore that a paper about the unavailability of scientific data is locked behind a paywall.

      That pay wall insures that the paper will never be lost.

    30. Re:Concerning... by Anonymous Coward · · Score: 0

      Thanks to antiquated laws written by monopolists after their own interests (and not the public interest).

    31. Re:Concerning... by the+gnat · · Score: 1

      We let a bunch of 1 percenters, who themselves barely know how or care to read, sponsor draconian copyright laws to stop eeryone from copying all that stuff, just on the off chance that they might copy a bunch of songs or movies that are outmoded within two years. And the commercial scienrific pulishers are some of the worst.

      Commercial scientific publishers do indeed tend to be bottom-feeders, but if I'm understanding the article correct, they're not the root cause here - the issue is not that articles are being lost, it's that the underlying data used to generate them are lost. The journals can't help with that, because they're in the publication business, not the data archival business. We're talking about some grad student's lab notebook that contains the raw numbers used to generate a box plot, which then gets thrown out by mistake the next time the lab moves, or when the professor retires, etc.

      Many fields (genomics, structural biology) have mandatory data-deposition policies that ensure that the raw data is available to everyone, without charge (except for patents on commercial uses, but that's a separate rant). The problem here tends to be that we're usually archiving derived data, which is still a lot better than nothing at all, but limits the types of analyses that can be done.

    32. Re:Concerning... by Darinbob · · Score: 1

      Problem is that this data is funded and paid for while the research is active. After it's done the data is put in the closet so to speak. No one is being paid to go back every 5 years and re-read the data to convert to a new media format. Paper doesn't help because it can't store all this data anyway.

      Ie, satellite collects lots of data on solar activity, it all gets stored onto 9 track tape drives, which then end up on a minicomputer for data analysis, image enhancement, etc. Project funded through government contract or grant. Papers get written, results provide for more grants, etc. But eventually the researchers move on to new projects, new companies or institutions, and so forth. Mag tapes are put in a box and archived. 30 years later, no one knows how to read the tapes, or even where to find them. Someone may have a new theory that would be greatly helped if the old data could be used to verify it, but is out of luck and has to obtain new data. Climate change deniers dismiss the theory because there's no historical data to support it.

    33. Re:Concerning... by neoritter · · Score: 1

      Current thinking on any HDD storage as an archive is that with ideal conditions the drive would be able to retain data for 10-15 years. That number get's pushed to 25-30 years if it was brand new, zeroed, and written to once. There's no confirmation bias here, there's millions of pieces of paper that have lasted over 30 years with probably far more punishment than the ideal condition disk. Just go to your library and I'm sure you'll find plenty of books there that are over 30 years old. Those hundreds of year old Bibles are home runs and their life span is exponentially greater than what we think the data on these disks will last.

    34. Re:Concerning... by Anonymous Coward · · Score: 0

      It's also helpful to use a format with a mind towards making it easier to migrate. Simple is better. Comma-delimited ASCII can be imported into almost anything, but it's getting pretty difficult to open an old Quattro Pro spreadsheet, for example.

    35. Re:Concerning... by Alien1024 · · Score: 1

      Fair enough, it was just one piece of a massive collection of knowledge lost.

      I was just questioning the claim that paper has "been fine for thousands of years".

      While paper (if carefully stored and looked after) is more durable than any digital media invented so far, you can't ignore the many advantages of digital media: extremely easy and quick to copy with no loss of quality, possibility to do that remotely, far less bulky...

    36. Re:Concerning... by Alien1024 · · Score: 1

      paper (if carefully stored and looked after) is more durable than any digital media invented so far

      Except for punched cards and maybe other extremely bulky storage devices, now that I think about it.

    37. Re:Concerning... by mcswell · · Score: 1

      I'm a linguist. In our field, much of the data is irreplaceable: particularly data about languages that are going extinct, where it will be impossible to re-elicit the data. I know of a library that retains copies of every dictionary of African languages that gets published in paper. Those are safe for a century, longer if they've been printed on acid-free paper, and still longer if archival microfilms were made. I also had the opportunity to work on an electronic copy of a dictionary. It had been "archived" on 3 1/2" floppies in the 90s. One of the floppies had since gone missing, and two of the remaining 12 or so were more or less unreadable. That was in the early 2000s; I'm not even sure where I'd go to get a computer with a floppy reader now. Hard drive cable formats change rapidly, too. It's *far* easier for electronic data to go bad than paper data, even CDs and DVDs are not permanent. On the other hand, it's *far* easier to copy electronic data. And for archival purposes, that's what is standardly done: the electronic data is routinely copied over every few years. There's also an issue of data formats; data archived in old formats (old relational databases, for example) may be unuseable even if it's still readable. That's largely solvable, by picking appropriate archival formats; XML and Unicode are the current standards for at least lexicographic data. Archival PDF formats also exist, but it's harder to get computer-processable data out of them.

  5. In The Future by Anonymous Coward · · Score: 0

    By 2030, there won't be any left! We must act now!

  6. Lifecycle management by FaxeTheCat · · Score: 4, Interesting

    So the institutions do not have any data lifecycle management for research data. Are we supposed to be surprised? Ensuring that data are not lost is a huge undertaking and cannot be left to the individual researcher. It may also require a change in the research culture at many institutions. As long as research is measured by the publications, that is where the resources go and where the focus will be.

    Will this change? Probably not.

    1. Re:Lifecycle management by TubeSteak · · Score: 1, Troll

      Vines is calling on scientific journals to require authors to upload data onto public archives as a condition for publication.

      If authors put their data into the public sphere, people might notice how much of it is fudged.

      --
      [Fuck Beta]
      o0t!
    2. Re:Lifecycle management by ColdWetDog · · Score: 1

      The answer to both problems is to publish everything in the Journal of Irreproducible Results

      --
      Faster! Faster! Faster would be better!
    3. Re:Lifecycle management by N1AK · · Score: 1

      Even more reason for us to want it putting there. Publishing research based on falsified information should be a pretty major crime and shouldn't be tolerated. It misleads the public, wastes scientists time trying to build on it etc.

    4. Re:Lifecycle management by Anonymous Coward · · Score: 0

      So the institutions do not have any data lifecycle management for research data. Are we supposed to be surprised? Ensuring that data are not lost is a huge undertaking and cannot be left to the individual researcher. It may also require a change in the research culture at many institutions. As long as research is measured by the publications, that is where the resources go and where the focus will be.

      It is very expensive and even if the funding existed, how do you decide what data to keep? The institution can't make the decision. Plenty of data is poorly organized. Students tend to do research using spreadsheets and keep stuff in random places. We try to train them to do better, but they still publish papers before learning those lessons.

      Will this change? Probably not.

      It is changing. The fact that we're having this discussion is a sign it's changing. The best effort is coming from those who fund research. They have the cash and they believe the data has value. Many granting agencies now require that data products promised by a proposal are submitted to a data center, which is funded separately by the granting agency. Self documenting formats (netcdf and hdf) are being developed and used. Change has already been made.

  7. Precisely by Anonymous Coward · · Score: 2, Insightful

    This is bang on. As a system administrator for a STEM department at a Canadian institution, my budget is 0 for data retention. Long term data retention is just not in the mindset of researchers.

    1. Re:Precisely by Anonymous Coward · · Score: 0

      As a technologist for a Biology department at a American institution I see the same thing, and was horrified by it when I was hired on.

    2. Re:Precisely by cold+fjord · · Score: 2

      One of the places that I've worked did various sorts of science / engineering type project work. Quarterly backups of filesystems were archived indefinitely. Even if the data was staying online, at the completion of every project an archive was made of the data on a minimum of two pieces of backup media along with various bits of metadata regarding the media and data. The archival copies were tested by restore and diffed before actually going into the archive. Of course they kept examples of the different tape drives, and sometime systems, around to use as needed for quite some time.

      Having seen the ugliness of tape drives eating archive media I would be inclined to suggest at least 3 copies.

      --
      much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
    3. Re:Precisely by Z00L00K · · Score: 1

      Just increase the disk array size and copy the data as it grows to larger and larger storage systems. Data that's offline is useless.

      --
      If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
    4. Re:Precisely by cold+fjord · · Score: 1

      Some types of work generate enormous amounts of data in relatively short periods. The only way to keep things under control is to generate the data and ruthlessly pare back whatever isn't needed, preferably as you go. Big datasets cost real money to keep online, especially when there are many of them. Data that isn't needed for current work isn't helpful and doesn't need to be online, but you may have to bring it back online in the future. People say that disk is cheap, and it is, until it has to be high performance, highly reliable, accessible 24x7 to large user bases for simultaneous use, backed up, secured, and managed. Various forms of hierarchical storage can help but don't eliminate the issues unless your budget is very robust or your data creation is low paced.

      --
      much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
    5. Re:Precisely by X0563511 · · Score: 1

      I don't think you understand what "archive" means.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
    6. Re:Precisely by Anonymous Coward · · Score: 0

      No surprise to me either. Some people understand the value of data, some don't.

      When I worked in IT at a university I saw some who kept all data to ensue reporoducability and some who overwrote original data every time they made a transformation.

      What worried me the most was the horrible data management around medical research. Of course a lowly worm with a BSc. can't tell a tenured PhD what they are doing wrong.

  8. If... by Bartles · · Score: 1

    ...100% is retained for 2 years, and 17% is lost every year after that, then after 20 years, I get about 3.5% of the data still being accessible, not 20%. WTF, or did someone lose the data for this study and the article is really just a guess.

  9. On the bright side... by ron_ivi · · Score: 4, Interesting

    ... poorly collected unreliable data also vanishes at at least the same rate (hopefully faster). And assuming shoddy data disapears faster than good data, then the quality of available data should continually increase.

    1. Re:On the bright side... by Anonymous Coward · · Score: 1

      That's fine, until you want to know (for example) what some star was doing twenty years ago, so you can compare to what it's doing today. Then the archival data is your only chance - and shoddy data is better than no data. (Crudely-drawn ancient-Greek star-maps, for example, have been used to study the motion of stars over periods of thousands of years.)

    2. Re:On the bright side... by Anonymous Coward · · Score: 0

      ... poorly collected unreliable data also vanishes at at least the same rate (hopefully faster). And assuming shoddy data disapears faster than good data, then the quality of available data should continually increase.

      I disagree: it means that we will take the conclusions of the shoddy data at face value.

      "In 1998, 80% of /. posters were female - unfortunately we don't have the original data anymore, but that is to be expected."

    3. Re:On the bright side... by 140Mandak262Jamuna · · Score: 1
      That is a very unreasonable assumption.

      There are many entities with vested interest to keep the data that supports their point of view, or their profit motive or their meal ticket alive. For example data collected meticulously by a underfunded biology professor about the allotropic speciation of the salamanders around the lake hole-in-the-mud would disappear in a jiffy. But flawed research supporting the efficacy of a patented clot busting drug would be perpetuated. Epidemiological studies showing the adverse side effects of the same drug would be hunted down and eradicated.

      History is written by the winners. At least part of the data/research preservation is done people with vested interests preserving it selectively.

      --
      sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
  10. Aren't there always backup copies... by Anonymous Coward · · Score: 0

    ... at the NSA?

  11. So...? by Anonymous Coward · · Score: 4, Insightful

    I'm a researcher and I don't have time or space to keep old data as I'm generating too much new data. We work hard to maximize the use of these data and analyses when we write and publish papers. If this was talking about the papers (or presentations), that were the product of the data, being lost at this rate it would be one thing, but the raw data isn't usually very useful to anyone without context or knowledge of subtle and poorly documented technicalities. This just seems like ammunition for the climate change deniers to bitch about. It's unreasonable to keep the old data indefinitely without a massive public repository that will be poorly indexed and organized.

    1. Re:So...? by N1AK · · Score: 1

      Your an AC posting about something not remotely controversial so you're either lazy or lying and I'll take your claim with a pinch of salt on those grounds. I don't think anyone is claiming that keeping the data available is either simple or cheap; but those points don't make it any less important. If the data a paper is based on isn't available then the paper itself loses value because anyone can write a paper showing anything and if they don't need to provide the data then it's much harder to investigate. You are absolutely right that simply having the data available isn't always enough to be able to use it, however we've also seen examples of where dubious or wrong mathematical methods being applied to data in academic research so it's important that information on this is available with the results of the research.

    2. Re:So...? by Anonymous Coward · · Score: 0, Insightful

      but the raw data isn't usually very useful to anyone without context or knowledge of subtle and poorly documented technicalities

      Wow, what a load of patronising bollocks. "My data was so important that I published it in a peer reviewed journal, but nobody else is smart enough to review it.

      It's unreasonable to keep the old data indefinitely without a massive public repository

      The bound experiment notebook that any undergrad worth anything was taught to keep in the pre-computer era is just as reasonable to demand today. YOU make your living with data, YOU learn how to maintain backups, or use the democratic process in your academic institution to get someone else to do it.

      I do absolutely acknowledge that the move away from paper has made this vastly harder. Paper kept in a dry environment takes at least a lifetime to rot, and nearly every adult human in the developed world knows how to read and copy a sheet of paper. Maintaining electronic backup media usually takes far more frequent intervention, and greater expertise - not just with the hardware, but to ensure on-going readability of the data format. This is one of those things where the technologist who are entirely hep with every buzzword of the last 5 years forgets that the world's just slightly longer, and what seems like the only important set of tools in the world today will be a footnote in history tomorrow.

    3. Re:So...? by rnturn · · Score: 1

      ``the raw data isn't usually very useful to anyone without context or knowledge of subtle and poorly documented technicalities''

      Wouldn't documenting your experimental method be part of your job? There's really no reason why raw data should be this mysterious entity that nobody can possibly understand unless they were there when it was collected. IMHO, your results -- whatever they are (I only hope it doesn't have anything to do with a drug that physicians might be prescribing to patients) -- are highly questionable if the experiment cannot be reproduced. On the positive side, at least you admit that your documentation efforts were inadequate.

      --
      CUR ALLOC 20195.....5804M
    4. Re:So...? by mspohr · · Score: 1

      On several occasions I have tried to get data from researchers. Most of them guard their data jealously and will give any number of excuses for not distributing it, including:
      - telling me that I don't have the knowledge or context to properly understand the data
      - fear of me stealing their precious, precious secrets
      - fear of me "misrepresenting" the data
      - (unspoken) fear of me finding problems with their data or analysis
      Unfortunately, most researchers live in a very closed, secretive world and fear exposure. It's unusual for them to allow access to their data. They would just as soon destroy the data and have only their published paper persist as the record of their research.
      Drug research studies are a special case where the FDA has mandated that the data from all studies (not just the ones that have "good" results) be published and available but pharma routinely ignores this rule.

      --
      I don't read your sig. Why are you reading mine?
    5. Re:So...? by the+gnat · · Score: 1

      On several occasions I have tried to get data from researchers. Most of them guard their data jealously

      I should note that this almost certainly violates the terms of publication for most journals, and possibly the terms of their research grants as well. I actually had one professor complain to me that it was "his" data and I had no right to it - conveniently ignoring the fact that he (like me) was being funded by taxpayers (albeit in different countries). My views on this subject aren't particularly radical - I do think scientists should be allowed to keep data private until they publish (or give up) - but any academic researcher with this kind of attitude needs to find a new job.

      For some projects the NIH has gone even further and said that data need to be publicly archived immediately, regardless of publication plans. This is problematic for most fields, but at least the funding agencies are being militant about this.

    6. Re:So...? by Anonymous Coward · · Score: 0

      Maybe slow down and be more thorough.

    7. Re:So...? by bingoUV · · Score: 1

      On several occasions I have tried to get data from researchers. Most of them guard their data jealously and will give any number of excuses for not distributing it

      Did you have any right to the data? Moral / legal / procedural ? If publicly funded, most people should have right to the data, but there might be a procedure to access it, I wouldn't blame anyone for establishing a light procedure to bug their scientists.

      --
      Bingo Dictionary - Pragmatist, n. A myopic idealist.
  12. we're in 'perfect' shape again here by Anonymous Coward · · Score: 0

    subject to change like the 'weather' & everything else in time space & circumstance. unperfectness abounds what a gig. free the innocent stem cells

  13. what the hell? by Anonymous Coward · · Score: 2, Insightful

    I think it is ridiculous that Slashdot's keep posting articles that are behind paywalls. How the hell are we supposed to see them? Do you expect us to pay for subscriptions to services we'd only use once? you, OP, are out of your mind. articles such as this should be rejected as most users, if not all, can't even access the story. This site really has gone down hill in the last few years, over populated with clueless simpletons, frauds, so-called armchair IT experts and -obvious- subscription pushing trolls.

    1. Re:what the hell? by Anonymous Coward · · Score: 1

      Here you go, lets just say I'm saving this research paper from being "lost" in 20 years - http://imgur.com/a/ozwBa

    2. Re:what the hell? by Anonymous Coward · · Score: 0

      Most of the articles that have scientific papers also have an account in the general media, usually several with varying degrees of utility. Read those instead. They are generally enough to at least engage in discussion at some level on Slashdot.

      If you actually need the academic paper, then do one of several things: keep searching the internet, including the authors pages, it may be there or in some less obvious place, or put it on the list to read the next time you go to a research library that has the journal, or pay for it. The fact that you don't read it today doesn't make the information disappear.... at least if you don't wait a couple of decades. It should still be there next week, month, or year.

      You're complaint is difficult to separate from a troll.

    3. Re:what the hell? by Anonymous Coward · · Score: 0

      what are they supposed to do? the article is nor available anywhere else, it's not even PRINTED yet. and, otoh, i guess one out of two slashdot readers know someone with access to the paper (a lot of universities have, and since the access there is IP based, you have access to it from every box connected to the university's LAN).

  14. Not the worst of it by Karmashock · · Score: 1

    Many things are based on this data... and when the data is gone it cannot be audited which makes it impossible to verify the finding of the data which is later simply referenced... but the data upon which it is based... *poof*

    This practice also gives a free reign to fraudsters because if you don't catch them quickly they can claim the data was just in their other pair of trousers.

    --
    I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
    1. Re:Not the worst of it by serviscope_minor · · Score: 1


      This practice also gives a free reign to fraudsters because if you don't catch them quickly they can claim the data was just in their other pair of trousers.

      No, the timespan is 20 years. Within 20 years the results will either be sunk without a trace, disproven or replicated. A fourth option is very unlikely.

      For example, I doubt the original measurements of superconductivity are still around. If they are, they'd be interesting from a historical perspective, but you could replicate the results yourself with some fairly standard cryolab equipment. No one's going to doubt superonductivity because the original results have gone.

      --
      SJW n. One who posts facts.
    2. Re:Not the worst of it by Karmashock · · Score: 1

      In most cases I would agree. However there are some large tables of data that are at least that old that are referred to currently. And they have not been replicated since.

      Further, the tables are themselves not raw data but modified data with the raw data and methodology no longer available.

      --
      I've decided to stop wasting my time responding to AC trolls/sockpuppets... so if you want a response from me... login.
  15. lifecycle management isn't obvious to everyone by Anonymous Coward · · Score: 0

    any organization worth its weight in salt have vast libraries of data that go back many decades. not all 'institutions' are so poorly run where data from 20+ years ago cannot be accessed. must be a 'canadian' thing.. not that we'll know.. since the story is behind a bloody paywall.

    "Will this change? Probably not."

    I have access to vast libraries of data that date back 30+ years, some datasets (this includes computer software too) date back to the early 70s. why? because these institutions/corporations were organized, they knew that retaining data is important and they kept up with technology to ensure that no data is lost. there is no excuse to lose vast amounts of data. the only excuse for not retaining such data that I can think of is cost. the longer you leave datasets rotting away on old tapes, disks and hard drives, the harder it becomes to salvage and finding people who are experts at retrieving data from old media gets harder and more expensive.

  16. GNU it by Faisal+Rehman · · Score: 0

    Publish under GPL license and save it forever.

  17. derp by Anonymous Coward · · Score: 0

    "I'm a researcher and I don't have time or space to keep old data as I'm generating too much new data."

    well if that isn't the silliest thing I've ever read. there's no excuse for not retaining data, no matter how large the sets may be. storage in 2013 is incredibly cheap and there's many different systems, with incredible amounts of storage space you could use to back it all up on but I figure this more of a financial reason than your excuse of 'I'm generating too much data' nonsense.

    "but the raw data isn't usually very useful to anyone without context or knowledge of subtle and poorly documented technicalities"

    I seriously doubt you are a researcher of any kind based on the quote above. It doesn't really matter about the 'context' or 'poorly documented technicalities' as you so elegantly put it. You cannot just assume that if someone were to pick up your data they won't understand the 'context'. that is ridiculous. It's all do with unorganized researchers/institutions and money.

    "This just seems like ammunition for the climate change deniers to bitch about."

    "climate change deniers". very amusing. if you want your data and research to stand up to scrutiny then keep all your datasets. what have you to hide? are you hiding the fact that you became a climate researcher so can you stick your hand out for free research money while producing data that is laughable? I have a feeling that's the case :)

  18. Its just entropy by Chrisq · · Score: 2

    ... wait what was it again ... its gone!

  19. I still have the raw data by mocm · · Score: 1

    that I used for my paper 15 years ago. It is on a tape, that is somewhere in a drawer, that I have no tape drive for. On the other hand, the LaTeX file and the C and FORTRAN programs I used to evaluate and create the data and write the paper are still on a hard drive that is running on a computer in my network and I can access it right now. I probably can*t compile the the program without change (was written for Solaris and DEC machines) and maybe not even run LaTeX on it without getting some of the included styles, but still it is there.
    Since my work was in theoretical physics and numerical the loss of the raw data is probably not as bad as long as you still have the software, but I guess for an experimental physicist the problems would be much greater to keep the massive amount of data they sometimes have and if lost to reproduce the data.

    --
    ***Quis custodiet ipsos custodes***
  20. is/are by LMariachi · · Score: 5, Interesting

    Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.

    Whichever side of the "data is" vs. "data are" argument one falls on, I hope we can all agree that mixing both forms within the same sentence is definitely wrong.

    1. Re:is/are by Anonymous Coward · · Score: 0

      Dr. Egon Spengler: There's something very important I forgot to tell you.
      Dr. Peter Venkman: What?
      Dr. Egon Spengler: Don't cross the streams.
      Dr. Peter Venkman: Why?
      Dr. Egon Spengler: It would be bad.
      Dr. Peter Venkman: I'm fuzzy on the whole good/bad thing. What do you mean, "bad"?
      Dr. Egon Spengler: Try to imagine all life as you know it stopping instantaneously and every molecule in your body exploding at the speed of light.
      Dr Ray Stantz: Total protonic reversal.
      Dr. Peter Venkman: Right. That's bad. Okay. All right. Important safety tip. Thanks, Egon.

    2. Re:is/are by Anonymous Coward · · Score: 0

      Actually, he is being consistent.

      He says 'data are' throughout. The only thing he says 'is' about is the word 'much'.

      Grammar Nazi needs to learn grammar first.

    3. Re:is/are by LMariachi · · Score: 1

      On the off chance you're reading this: Wrong. Much of these data are... and is thus irreplaceable. You can argue that "is" refers to the "much" rather than the "data" -- it's ambiguous due to the inconsistency -- but "much" definitely refers to the "data," and "much" is singular.

  21. Misleading figure caption by Bazman · · Score: 1

    Some idiot sub-editor wrote a misleading figure caption here. The article (which I've read) says nothing about how data is lost with age. It only says something about how much data is lost for papers of a given age as of now.

    In other words it does not mean that in 10 years time, 10 year old papers will have such drastic data loss. The world 20 years ago was a very different place in terms of communication, scientific practice, and data storage than it was 10 years ago or is now.

    The Slashdot article repeats the fallacy by saying "scientific data disappears". No it doesn't. Some has disappeared, but the paper cannot say anything about whether it is still disappearing.

    Come back in 10 years time for that conclusion.

    1. Re:Misleading figure caption by Anonymous Coward · · Score: 0

      i agree. I also noticed the subject they surveyed was somewhat limited. If I was cynical I would suggest a funding tin was being waggled....

      In all seriousness the public needs to keep a few things in mind.
      1) Staff move
      2) institutions of all types are usually only funded year to year
      3) Not all data lost if a bad thing. Depends on subject, depends on findings.
      4) Not all data is lost , sometimes it is superseded. This is particularly true in molecular biophysics/bioinformations etc... Ironially, I think climate modelling keeps an archive of their "best guesses"
      5) There is too much data - just about every HEP or sequencing or microarray...

      I will repeat, this article (yes behind a paywall) picked a somewhat narrow subject and as the poster above said "misleading" would capture it.

  22. And this study will be lost as well by Psychotria · · Score: 1

    a) because it's behind a paywall; and b) how can the original data even hope to be located when a majority of the population can't even read the paper?

    1. Re:And this study will be lost as well by Anonymous Coward · · Score: 0

      If it is an academic article published in a peer-reviewed journal, the majority of the population doesn't have the background to understand it anyway. That's why they have press releases or articles written by places (Scientific American, etc.) that specialize in translating research for a general audience.

    2. Re:And this study will be lost as well by Anonymous Coward · · Score: 0

      this is a touch condescending.

        Open Access after 6 months would suffice for many, but any publication worth reading should be interpretable with a graduated scale of knowledge. There is a world of difference in understanding a published field and publishing in it. We all benefit by publications being made available to the widest possible audience.

  23. Simple solution by frrrp · · Score: 0

    Defund the NSA, kick them out of the Utah data center - and do something useful with it. Like giving all the lost data a permanent home.

    --
    smilies are for reetards
  24. Lost forever by The+Cornishman · · Score: 3, Interesting

    > many other data sets are expensive to regenerate...
    Or maybe impossible to regenerate (for certain values of impossible). I remember reading a classified technical report (dating from the 1940s) related to military life-jacket development, wherein the question arose as to whether a particular design would reliably turn an unconscious person face-up in the water. The experimental design used was to dress some servicemen (sailors, possibly, but I don't recall) in the prototype design, anaesthetise them and drop them in a large body of water, checking for face-down floaters to disprove the null hypothesis. Somehow, I don't think that those data are going to be regenerated any time soon. I hope to God not, anyway.

  25. NSA by Anonymous Coward · · Score: 0

    NSA has backups :p

  26. Hmm, is Google working on this? by Anonymous Coward · · Score: 0

    This sounds like the sort of "big problem" Google would love to tangle with, considering their mission statement.

    1. Re:Hmm, is Google working on this? by emj · · Score: 1

      add revenue on data.. hmph..

  27. Dave...Dave... by MrKaos · · Score: 1

    I'm....losing...my..mind..Dave......Dave....Would you like me to sing a song?

    --
    My ism, it's full of beliefs.
  28. And It Makes Me Wonder by Zamphatta · · Score: 1

    The very fact that "Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.", makes me wonder if this could even be considered "scientific data" anymore. Since the data is unique to a time & place and irreplaceable, it would completely destroy the reproducibility aspect of the scientific process. Given that, should the lack of reproducibility mean that lost scientific data should be redefined as experimental data or hypothesis data? It also brings up the idea in my mind that scientific data has a half life since it can degrade back to hypothesis or experimental data if not properly stored.

    1. Re:And It Makes Me Wonder by dj245 · · Score: 1

      The very fact that "Much of these data are unique to a time and place, and is thus irreplaceable, and many other data sets are expensive to regenerate.", makes me wonder if this could even be considered "scientific data" anymore. Since the data is unique to a time & place and irreplaceable, it would completely destroy the reproducibility aspect of the scientific process. Given that, should the lack of reproducibility mean that lost scientific data should be redefined as experimental data or hypothesis data? It also brings up the idea in my mind that scientific data has a half life since it can degrade back to hypothesis or experimental data if not properly stored.

      Completely incorrect! How can you study "how X has changed over time" if you don't have data from other times? It is also impossible in many, if not most, cases to gather such historical data in the present time.

      --
      Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
    2. Re:And It Makes Me Wonder by the+gnat · · Score: 2

      Since the data is unique to a time & place and irreplaceable, it would completely destroy the reproducibility aspect of the scientific process.

      This gets tricky in some fields, however. I work in a field where generating the data is a notoriously difficult and haphazard process, subject to many non-experimental variables, such that the use of a different pipette or stock solution can make the difference - or even just the speed of the researcher's manual labor. Temperature and humidity play a role too, and these are not as precisely calibrated as one might like. So if an experiment is performed at 8pm on a Saturday night by a grad student in Colorado, there is no guarantee that a postdoc in Singapore will be able to do the same thing based on reading the paper. (Actually, from past experience, there's no guarantee that the original experimenter will be able to reproduce it either!) But the data may be just as good, and they're difficult to fake, and they're deposited in a public database. Since everyone in the field is accustomed to the complexities of the process and we have decent archival policies, this problem is accepted as a fact of life.

      I am quite certain that some of my (published) results from grad school would be difficult at best to reproduce exactly. I stand by my data - and am happy to share them - but it is always troubling to wonder if someone else in a different environment would have reached different conclusions.

    3. Re:And It Makes Me Wonder by Zamphatta · · Score: 1

      But you can't study "how X has changed over time" if you don't even have the original data that you'd be comparing it to?

      Still, that's not really my point. I'm saying that without the original data (and remember this is data that cannot be gotten again even with effort), one cannot re-do the study and see if the results are reproducible. Therefore, the entire scientific process is impossible with studies that have lost & irretrievable data.

    4. Re:And It Makes Me Wonder by Zamphatta · · Score: 1

      If I could mod that up, I would. Very interesting & insightful.

  29. So will you host it? by Anonymous Coward · · Score: 0

    And keep proven logs to show it is not tampered with?

    Will you pay for the researchers to keep the data forever? Will you insist that they stop researching anything new because the data storage exponentiates and the old stuff will need moving to new media, checking and eventually more work goes into looking after the media than on archiving new stuff?

    Will you accept higher taxes to pay for this, and taxes that increase year-on-year exponentially to cover it?

    No?

    Then you're going to "lose" data.

  30. Why must you have their data? by Anonymous Coward · · Score: 0

    Reproduction of results isn't "add the numbers that they produced to see if they sum to the value they said it did". That isn't replication of science.

    Since the science is supposed to be repeatable and the paper (if valid science not pseudoscience bollocks) contain enough information to do the assessment again (e.g. like a patent as supposed to be), then you MUST consider it BETTER to re-do the experiment again and collect your OWN data and see if the data fits the result of the previous paper.

    What if, for example, there was a bias on the original potentiometer, making all voltages appear different from what they are? The result would be WRONG, but your method of "redoing the experiment" would NEVER show this. Doing the experiment again and producing your OWN data would.

    1. Re:Why must you have their data? by n1ywb · · Score: 5, Interesting

      No but it is amazing what NEW science you can do with OLD data. I've worked with the Transportable Array project for example http://www.usarray.org/researchers/obs/transportable it's over a decade old and scientists are still discovering new ways to take advantage of the data and will likely be doing so for decades to come. On the other hand a lot of data is just junk due to poor quality metadata; when was that instrument calibrated? I dunno. Damn. At leat in geophysics we have the National Geophysical Data Center to curate this stuff http://www.ngdc.noaa.gov/ at least until Congress cuts it's funding.

      --
      -73, de n1ywb
      www.n1ywb.com
    2. Re:Why must you have their data? by 0123456 · · Score: 1

      Reproduction of results isn't "add the numbers that they produced to see if they sum to the value they said it did". That isn't replication of science.

      1. The data may not support their results. Without it, you can't verify that.
      2. The data may be, let's say, 'adjusted' to give better results without admitting it. You may be able to show that by statistical checks, but you can't do that without the data.

      Yes, you could completely re-do the experiment, but a) it may be historical data which can't be measured again (e.g. deceleration of a space probe from 1980 to 2000) and b) that may massively increase your costs.

    3. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1
      This.

      "1. The data may not support their results. Without it, you can't verify that. 2. The data may be, let's say, 'adjusted' to give better results without admitting it. You may be able to show that by statistical checks, but you can't do that without the data."

      And lots of people didn't seem to understand or care that this is why others caused an uproar when "original data" went missing from EAU and CRU right around the time of "climategate".

      Without the original data, there is no way to reproduce the science to see if it was done responsibly. Without pointing fingers at anybody in particular, we know that in at least some cases, it is not.

    4. Re:Why must you have their data? by Anonymous Coward · · Score: 0

      I think the general points are that there's a cost to data storage and that data quality is important. An argument's been made (too lazy to find citation) that data quality can suffer from non-use, in other words it becomes stale over time as it becomes divorced from context in which it was collected. I used to argue w/one of my managers about getting rid of old data we'd collected that we never used nor would ever likely use. He couldn't see the cost in storing it.

    5. Re:Why must you have their data? by khayman80 · · Score: 2

      "Any independent researcher may freely obtain the primary station data. It is impossible for a third party to withhold access to the data. Regarding data availability, there is no basis for the allegations that CRU prevented access to raw data. It was impossible for them to have done so." [Muir Russell Review, p48,53]

    6. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1
      CRU website:

      "We are not in a position to supply data for a particular country not covered by the example agreements referred to earlier, as we have never had sufficient resources to keep track of the exact source of each individual monthly value. Since the 1980s, we have merged the data we have received into existing series or begun new ones, so it is impossible to say if all stations within a particular country or if all of an individual record should be freely available. Data storage availability in the 1980s meant that we were not able to keep the multiple sources for some sites, only the station series after adjustment for homogeneity issues. We, therefore, do not hold the original raw data but only the value-added (i.e. quality controlled and homogenized) data."

      Source: www.cru.uea.ac.uk

    7. Re:Why must you have their data? by khayman80 · · Score: 2

      That's why it was "impossible" for CRU to have withheld access to the raw data. Because they didn't collect it in the first place. Anyone who was actually interested in the data could always have gotten them from the same sources that CRU did.

    8. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1

      "That's why it was "impossible" for CRU to have withheld access to the raw data. Because they didn't collect it in the first place. Anyone who was actually interested in the data could always have gotten them from the same sources that CRU did."

      I didn't claim that it was withheld. I merely stated that it was missing.

      Further, initially others could NOT access that data, because National Meteorological Services in various countries refused to release the data to anyone else.

      Granted, that situation has been largely fixed, but it WAS the situation when the "uproar" over the data was originally taking place. And without access to that data, there was simply no way to evaluate the quality of CRU's work.

      According to the record, it is only because some people made a big stink about the original data, that it is available now at all.

    9. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1

      I will correct myself, however: the phrase "went missing" was probably not the right one to use.

      For a while there was a perception that original data was "missing", but as you correctly point out, it was uncovered that most of the original data could (later) be obtained from the original sources. But it wasn't without a bit of a struggle with some of those sources.

    10. Re:Why must you have their data? by khayman80 · · Score: 1

      Years ago, I explained in excruciating detail that this played absolutely no role in evaluating the quality of CRU's work because the majority of data in CRU's dataset "are derived from the same freely-available raw data sets used by NOAA and NASA." The Muir Russell review reproduced the necessary code in two days without any help from CRU.

      And, of course, this isn't CRU's fault because “the authority for releasing unpublished raw data to third parties should stay with those who collected it.” Oddly, many people seem to ignore this point and blame CRU.

      By the way, I debunked the misinformation that you and Lonny Eachus were spreading about Cowtan and Way 2013. Feel free to retract your misinformation (or double down on it) here. Lonny Eachus is welcome to do the same, but for some reason he never replied.

    11. Re:Why must you have their data? by khayman80 · · Score: 1

      it was uncovered that most of the original data could (later) be obtained from the original sources

      I didn't notice this comment before I wrote mine, otherwise I'd have been forced to correct this incorrect claim too. Again, the majority of data in CRU's dataset "are derived from the same freely-available raw data sets used by NOAA and NASA." Most of the data was already in the public domain, which is why the FOIA blizzard against CRU was so hysterically pointless.

    12. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1

      That's a straw-man. A really great straw-man, but a straw-man nevertheless.

      Repeat: access to the RAW DATA was NOT available. Only data that has already been "massaged" (to an unknown degree) was available before the "official" release, and that release was prompted by complaints about this very (and very valid) issue.

      July 2011, and 5,113 weather stations, to be more precice, in that particular release. Even then, some countries were holding out. (Most notably Poland.)

      Whether the Muir-Russel review managed to come up with similar results is irrelevant to the point being discussed here: the fact that access to original data is vital to verifying and reproducing results.

      The fact that results might have been reproduced in one (or however many) cases makes no difference to that point whatever.

    13. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1

      "Most of the data was already in the public domain, which is why the FOIA blizzard against CRU was so hysterically pointless."

      I agree with you that much of the data was already in the public domain. However, CRU could have avoided the FOIA requests if they'd simply handled things in a professional, reasonable manner, as opposed to one that was blatantly arrogant and dismissive.

      They needlessly pissed a lot of people off. When you do that, you should not expect them to not piss you off in return.

    14. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1

      Just to avoid an argument over something I'm NOT saying, I would like to just clarify my point again:

      1. Correctly, or incorrectly, there was a perception that data was missing or being withheld.

      2. The importance of original data, which was perceived to be missing, was why people were raising a stink over it.

      I'm not trying to say data was actually "missing", but it is true that some of it was not available. And CRU's documented attitude regarding requests about it contributed to an atmosphere of distrust.

    15. Re:Why must you have their data? by khayman80 · · Score: 1

      access to the RAW DATA was NOT available

      Previously, you could have used your ignorance as an excuse. Now you're just lying. And apparently neither you or Lonny Eachus have enough intellectual integrity to retract your latest steaming pile of civilization-paralyzing misinformation. This flood of misinformation isn't just staining "Jane Q. Public's" sock puppet legacy. It's also staining Lonny Eachus's real human legacy. Please stop.

    16. Re:Why must you have their data? by bbsalem · · Score: 1

      Unless the science is an historical science, meaning that the data source is an historical artifact than can be lost. An example that comes to mind is a fossil collection. Consider how few fossils the human lineage is based on? It is quite small, losing any of that material could be damaging to revisiting the scientific reasoning behind the resulting phylogeny. Science is not about the result, ever, it is about the steps taken to get the result, and quite often the steps have to be repeated to refine or modify the resulting model, and for historical science, that means preserving and revisiting historical artifacts,

      Even in physical science there are historical artifacts. Whenever a nova blows up, especially close to us, astronomers go back and look at old images of the region of the sky to look for the progenitor star, which they often find, and sometimes the information is on fragile glass plates, so someone had to take care of them.

    17. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1

      "Previously, you could have used your ignorance as an excuse. Now you're just lying. And apparently neither you or Lonny Eachus have enough intellectual integrity to retract your latest steaming pile of civilization-paralyzing misinformation. This flood of misinformation isn't just staining "Jane Q. Public's" sock puppet legacy. It's also staining Lonny Eachus's real human legacy. Please stop."

      I'm "lying"? WTF?

      That's straight from EAU's own website!

      Further as I wrote elsewhere, all this is STILL irrelevant to the point I was trying to make. It was YOU who wanted to argue about it. Well, suck it up, read the goddamned article from EAU's own website, and stop accusing people of "lying" when they're pointing you to clearly documented facts.

      I really don't think I -- or that other person -- have anything to worry about, from simply telling the truth.

    18. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1

      And here is the announcement of the release of that data, direct from the Met Office. Note that the date given for the release is July, 2011.

      You can download the data yourself HERE, compare it to previous HadCRUT data that was available, and see what information is new in this release. If you count, you will find approximately 5,000 weather stations that weren't in previously-released data.

      Met Office Announcement of new data release.

    19. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1

      And here's another source, if for some reason you don't like your own:

      OK, climate sceptics: here's the raw data you wanted

    20. Re:Why must you have their data? by khayman80 · · Score: 1

      Again: "Any independent researcher may freely obtain the primary station data. It is impossible for a third party to withhold access to the data. Regarding data availability, there is no basis for the allegations that CRU prevented access to raw data. It was impossible for them to have done so."

      Your continued attempts to smear CRU while refusing to retract your latest misinformation are noted. Since you and Lonny Eachus keep spreading misinformation which threatens the future of our civilization, I have no choice but to keep debunking you and Lonny Eachus. Stay tuned.

    21. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1

      "Your continued attempts to smear CRU while refusing to retract your latest misinformation are noted. Since you and Lonny Eachus keep spreading misinformation which threatens the future of our civilization, I have no choice but to keep debunking you and Lonny Eachus. Stay tuned."

      WTF are you talking about? I did no such thing.

      I WROTE EARLIER, as others can clearly read for themselves, that I was NOT accusing them of "withholding" data. What I wrote was that it was not available, but I did not -- even once -- claim in this exchange that it was being "withheld" on purpose.

      Your repeated accusations that I have done things that I have in fact provably NOT done is exactly WHY I thought -- and still think -- you're such a flaming, large-bore asshole. And the fact that you do it whenever somebody shows you to be wrong amplifies my opinion manyfold.

    22. Re:Why must you have their data? by khayman80 · · Score: 1

      ... this is why others caused an uproar when "original data" went missing from EAU and CRU right around the time of "climategate". ... there was simply no way to evaluate the quality of CRU's work. access to the RAW DATA was NOT available. Only data that has already been "massaged" (to an unknown degree) was available before the "official" release, and that release was prompted by complaints about this very (and very valid) issue. ... access to original data is vital to verifying and reproducing results. ... CRU could have avoided the FOIA requests if they'd simply handled things in a professional, reasonable manner, as opposed to one that was blatantly arrogant and dismissive. They needlessly pissed a lot of people off. When you do that, you should not expect them to not piss you off in return. ... I'm not trying to say data was actually "missing", but it is true that some of it was not available. And CRU's documented attitude regarding requests about it contributed to an atmosphere of distrust. ...

      Jane Q. Public, please use your feminine voice to tell Lonny Eachus that when he finds himself deep in a hole, he should use his masculine strength to... stop digging.

    23. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1

      I have not the slightest idea what you're talking about.

      I mentioned that some data was "not available" at first. Then I proved it. (It's straight from the Met Office's own website, and I cited another reliable source as well.)

      I no nothing of this "hole" you refer to or any of those other things you're ranting about.

    24. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1

      And I will amend the previous comment to summarize EXACTLY what went on here:

      I used the unfortunate phrase "went missing". I should have written "was perceived to be missing". I recognized this and corrected myself.

      But the facts, according to both EAU and the Met Office and New Scientist magazine -- which I firmly established later with the citations I provided above -- are these:

      (1) Data from a full 5113 weather stations used in CRU statistics were not available to others at the time. The claimed reason the data was not available, which I have no reason to doubt, was that it was proprietary information from Meteorological Services that provided the information to EAU and CRU on the condition of confidentiality.

      (2) According to EAU's own statements, (which I showed you and which are still online), this "raw" data was not kept by EAU. They only preserved data that had already been manipulated. They claimed storage space was the reason. (Which may be true but I don't know and I don't care.)

      (2) Because of the stink that many people raised over the unavailability of this data, the Met Office (and possibly EAU as well) decided to negotiate with those sources so the data could be released. It even released some of the information in spite of the objections of the sources (Trinidad and Tobago). Poland held out, and flatly refused to release data.

      (3) The result of all this was that the so-called "missing" data was released by the Met Office... some time after the release of the other HadCRUT data. (More specifically, in July of 2011.)

      Now... I don't know where you think there is a "lie" in any of this, considering that I showed you statements from EAU and the Met Office that say these things, and a link to a New Scientist article that further backs them up. But I did not assert, anywhere in this thread, that CRU or EAU were "deliberately withholding" this information from others. I only stated that it was not available.

      Q.E.D.

      So if there is any "hole" here, it is on your side. I haven't the foggiest idea what this stuff is you're blathering on about.

    25. Re: Why must you have their data? by Anonymous Coward · · Score: 0

      Nobody will be surprised that you "no nothing".

    26. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1

      Meh. Numbering got off during editing. Points should have been numbered 1 through 4. And in the prior comment "no" should have been "know".

    27. Re:Why must you have their data? by Jane+Q.+Public · · Score: 1

      And one more thing:

      While Jane Q. Public is obviously a pseudonym, I can (and shall) use any pseudonym I want, when and how I want. As for this other person you think I am: I find that pretty laughable, but almost certainly not for any reasons you think.

    28. Re:Why must you have their data? by khayman80 · · Score: 1

      To nobody's surprise, Jane "pulled a Jane" again. Retracting the two words "went missing" ignores all your other baseless smears, which I helpfully listed here. It's strange that you say I think you are another person. Anyone who reads this thread can confirm that I never said any such thing.

  31. So you have that raw data, archived, yes? by Anonymous Coward · · Score: 0

    I mean, I would like to check that you do find more data, so you have it, right? The raw data?

    Is it torrented?

    And the programs for manipulating, are they available too? And the results from it?

    That makes 2x as much data you have.

    Of course, if I reanalyse it, if I have any data, I now must archive it. 3x.

    If anyone else wants to recreate it... 4x

    Alternatively, for the cost of 20years storage, it may be possible to redo all the measures with UAVs and nanobots in future. And for less cost than 30 years storage...

  32. The Rosetta Project: building a 10000 year library by QilessQi · · Score: 3, Interesting

    The Long Now Foundation has devised an interesting mechanism for storing important information which, although not optimal for machine readability, is dense and has an obvious format: a metal disk etched with microprinting, whose exterior shows text getting progressively smaller as an obvious way of saying "look at me under a microscope to see more":

    http://rosettaproject.org/

    I highly recommend reading The Clock of the Long Now if you're interested in the theory and practice of making things last.

  33. Given the responses you've got by Anonymous Coward · · Score: 0

    Given the responses you've got, looks like you nailed it.

    "Oh, how arrogant!" from one poster. From another "You don't know anything". Isn't that second one arrogant?

    Note too how they claim you don't know what you're talking about but even insist they have no method to know better.

    So looks like you nailed it.

  34. There's a bigger problem by onyxruby · · Score: 1

    Before science gets hot and bothered about the loss of data scientists need to do something about the quality of the data they produce to begin with. Frankly given the complete lack of quality controls that a lot of scientists use the loss of their data is probably for the best. Depending on the field as much as 60% of all scientific research cannot even be reproduced. Work that cannot be reproduced by another team is far from isolated to one field either:

    http://online.wsj.com/news/articles/SB10001424052970203764804577059841672541590
    http://www.popsci.com/science/article/2013-05/half-cancer-scientists-have-been-unable-reproduce-studies-survey-finds
    http://www.slate.com/articles/health_and_science/science/2012/08/reproducing_scientific_studies_a_good_housekeeping_seal_of_approval_.html
    https://www.xsede.org/gateways-for-open-science
    http://www.eusci.org.uk/articles/data-doesnt-lie-scientists-do

    Depending on the study that means that either the data has been fabricated by unethical scientists, or the data has been misrepresnted for political purposes. Studies are often improperly interpreted by failing to take into account sound statistical modeling and noise is reported as science. In some fields politics have effectively taken over (e.g. social sciences) and standards are used that would never be tolerated in other scientific fields.

    The very culture of science that demands quantity over quality needs to change as the rat race that inspires junk science to begin with. I can't think of any other field where those kinds of failure rates about the reproducibility of your work would do anything other than get you fired for fraud and destroy your career. I like science, I have since I was a young child, but the junk were getting labeled as science doesn't deserve the label.

  35. except... by Anonymous Coward · · Score: 0

    Except that as mentioned in TFA, many data sets are unique to a time and place, and thus can never be replicated. They may for example reflect the social temperaments, behaviors, or material/physical qualities of a particular population at a particular point in time.

  36. How much storage space is that? by Anonymous Coward · · Score: 0

    When you say "put the data into the public", how much storage space and how does it get there?

    Will you pony up storage and taxes for this?

    Will you ask the same of the "private" data of corporations that rely on government largesse to exist?

    And when it's passed to the public, on a thousand servers, how do you know if the one you happened to get to first is genuine or been fudged by someone with an agenda against the science? Do you think AIG would mirror honestly the genetic proof of evolution?

    And at what point is it no longer the science institutions requirement to pass this data to the public? Because until then, you'll still need to pay for that access and storage. Then what's to stop every public copy being deleted because nobody cares any more? "the public" won't change storage media and veryfy contents for ever you know.

    It's very easy to claim as you have done, but what do you mean by it?

  37. Best legitimate use of P2P by naasking · · Score: 1

    Universities should band together to distribute all data from published material on P2P networks so it's redundantly stored at mulitple locations. This has the side-benefit of making a legitimate use of P2P obvious.

  38. Odd coincidence... by rnturn · · Score: 2

    Some years ago I picked up a copy of "Dark Ages II -- When the Digital Data Die" by Bryan Bergeron (2002) but only now have gotten around to finishing reading (for some reason I never got past the first chapter at the time). When I bought it I had just had my own experience with the not-so-long life of digital data (some CDs I'd burned a few years earlier were already unreadable). The book's a bit dated (it says that there are many people out there with Zip drives connected to their PCs) as, obviously technology marches on, leaving older media in the dust but that's the point of the book and the ideas are still relevant. Worth looking for at your public library if you're still of the mind that a digital format is superior to everything else for long-term storage. Personally, I think we're looking at trouble if everything's converted to bits thinking that it'll always be available. Continued access to one of those aforementioned 8" CPM floppies is a good example. My failed CD-Rs are another.

    --
    CUR ALLOC 20195.....5804M
    1. Re:Odd coincidence... by X0563511 · · Score: 1

      Personally, I think we're just fine if everything is converted to bits and we remember that there is no guarantee that a set of bits might not be damaged or lost.

      Just like books - there's no guarantee those won't fall victim to water damage, or fire, etc. You have to take care of it, and guard against the applicable failure modes. With digital data this is just as possible, but the techniques are as different as the failure modes.

      --
      For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
  39. The InterPARES Project by cold+fjord · · Score: 1

    The InterPARES Project

    The International Research on Permanent Authentic Records in Electronic Systems (InterPARES) aims at developing the knowledge essential to the long-term preservation of authentic records created and/or maintained in digital form and providing the basis for standards, policies, strategies and plans of action capable of ensuring the longevity of such material and the ability of its users to trust its authenticity. The findings and products of the first three phases of the project can be found on this website.

    Out of mind, out of sight,gone forever

    --
    much of left-wing thought is a kind of playing with fire by people who don't even know that fire is hot - George Orwell
  40. not just scientific data by larry+bagina · · Score: 1

    slashdot used to purge -1 and 0 rated comments from old stories. "So what?", you say. "Why should they store goatse links and ascii art penises?" But before the misnamed lameness filter, there was a vibrant troll culture. These were works of art that spawned adequacy.org and had a lot of time, creativity, and effort put into them. Much more interesting than the "linux good, microsoft bad" groupthink that made it to +5 informative and wasn't purged.

    --
    Do you even lift?

    These aren't the 'roids you're looking for.

    1. Re:not just scientific data by ron_ivi · · Score: 1

      Interesting and sad.

      Those links would probably also be great for NSA & Google like data mining to monitize those users too.

  41. Need extremely fine, fine print by retroworks · · Score: 1

    As a former paper industry professional (recycled pulp), Paper is fine except that people limit its use to readable font. That is what led to Microfiche (which is now being dumped by the truckload at recycling stations as "obsolete tech"). If you printed a hard copy of everything either to microfiche or extremely small 1-point font, you could store the data in a type of seedbank or gene bank.

    A salt mine may not be appropriate, but I'd like to start a business where everyone could send their hard drives to a giant 100 year Time Capsule Vault in the Sonoran desert. We are shredding retired professors hard drives which the professors probably would prefer to see preserved. The "half life" of privacy risk is different for different data... experiments, emails, credit card numbers, and porn browsing cookies are not posing the same posthumous risk/benefit. We are cremating too many of our future fossils.

    IMHO the biggest threat to raw data is misplaced or randomized fear of privacy combined with copyright planned obsolescence (or mandated "e-waste" shredding for working tech, out of fear that poor people will misuse a display device). Certain data does need to be destroyed, and certain papers shredded. Treating all "data" as having the same expiration date has something to do with the loss of the data in the article.

    --
    Gently reply
  42. WE ARE LIVING IN A FUTURE DARK AGE by TheRealHocusLocus · · Score: 2

    [OP] "disappearing into old email addresses and obsolete storage devices, a Canadian study (abstract, article paywalled) indicated

    Well so much for the study. Money changes everything. Eventually one hundred thousand copies of the abstract will exist on the Internet, but the authors' future descendants will find only only one actual link that leads to content, which terminates at a page saying "this domain is for sale".

    You'd think that even science data of extremely low bit rate such as original weather station temperature data should be out there somewhere. A lot of other people did too... but all that is available now might be "value added" ajusted data. Not an evil conspiracy per se, it's human nature at it's best and worst.

    A handy chronology of the history of data retention:

    [2500BC] King Fuckemup boldly slew the enemy and I, Scribe Asskissus hath inscribed it in stone. He is an asshole who owes me back wages."
    [1500] "With quivering quill I will write mine own data."
    [1866] "Data published at great expense into leather-bound volumes. Dust sold separately."
    [1970] "This is really important. we should print it and store it in a binder."
    [1971] They didn't.
    [1983] "I'll write it to floppy disk with a notsosticky label"
    [1985] "After a long and desperate search, the label has been found!"
    [1987] "Unlabeled floppy disk keeps coffeemaker level."
    [1995] "Roxio CD storage is forever, and Real Scientists don't close their data sessions."
    [2003] "Microsoft Word has experienced a problem updating from an older document format and will now close. Save your work as soon as possible."
    [2005] "I'll just email it to myself and shut the computer off immediately, then pick it up at work."
    [2009] "Yes, three copies! In the safe. There was a fire. Yes, inside the safe. It was a fireproof safe, so no one noticed."
    [2010] "This is really important. I should print it and store it in a binder. But my ink cartridge is dry."
    [2013] "Our data has been uploaded to the Cloud where it will live forever."
    [2500] "King Grapeape slew the primitive humans and buried their statue on the beach. I, Scribe Anthopoapologus hath incribed it in stone."

    Perhaps the most mystiying data retention escapade of Modern Times is the missing Apollo 11 SSTV moon tapes which contained a multiplexed stream of raw telemetry and the original slow-scan TV signal broadcast from the moon. Not 'missing' really, rather we know they were re-used and recorded over because everyone assumed it was someone else's job to ensure that at least one copy was in a safe place. While the earth station operators dutifully sent their tapes to NASA where the sharpest signal of the moon landing was sure to be perserved for posterity (not), fortunately there were some librarians on duty, and you can aquire DVDs of the moonwalk with better quality than the recordings you've seen in countless movies -- an 8mm film camera pointed at an original SSTV monitor at Honeysuckle Creek, and the best quality scan-converted version.

    In the Foundation series, Asimov envisioned Gaia, a world in which a telepathic network of sentient (and sensuous) beings kept a 'working set' retrievable data in-memory -- but also via access to progressively less and non-sentient objects, such as plants and even rocks -- a vast archive. Ask the mountain, it will answer in time, a long time.

    Our own Earth has a Gaia storage mechanism, a record of its magnetic field over geologic time stored as polarization in crystallized lava floes. But it i

    --
    <blink>down the rabbit hole</blink>
    1. Re:WE ARE LIVING IN A FUTURE DARK AGE by Anonymous Coward · · Score: 0

      Now universities have a department whose job it is to keep the dust off. They're called the "information technology" department. We may have lost some data over the last 20 years, while how society dealt with data was in flux, but now we have a new profession whose job maintaining data is, and we have a new standardized ways to save and share data. As time passes we'll become better and better at maintaing data using our new tools that are much better than the old tools we had.

  43. superconductors interesting example by Anonymous Coward · · Score: 0

    I once read a story, don't know it's true, that the team that discovered those Yttrium-Barium-Copper oxide high-temperature superconductors had made a silly mistake in sending their very important breakthrough paper for peer review: they had (by accident of course) changed every Y for Yttrium into Yb for Ytterbium.

    The peer review process took *ages*. Eventually the paper was accepted. A quick erratum "change Yb to Y everywhere. oops. our secretary made a typo."
    The Nobel prize the very next year!!



    Meanwhile, several large competing labs in the world had been buying Ytterbium like there was no tomorrow and writing articles about experiments in superconductivity with Ytterbium (which doesn't work) ;-)

  44. The NIDDK was aware of this years ago. by guru42101 · · Score: 1

    The NIDDK was aware of this years ago and had commissioned a feasibility study on creating a storage mechanism that all grant paid research would have to use. Unfortunately after a successful feasibility study the reviewers for the follow up real grant responded with "I do not see the scientific value of this research" and the grant went away with Vanderbilt as the only applicant. I've heard through the vine that someone picked up a new similar grant to work on it, but I haven't seen anything from it yet. The big problem is that researchers do not want to share their unpublished research. From what I've gleamed they want to keep things in their back pocket for future grants/publications.

    The site was http://dkcoin.org/

  45. thank elitism by Anonymous Coward · · Score: 0

    And paywalls and the overall exclusivity-oriented nature of academia are to blame for this.

    When you do stuff in the open and share it, it's (at least in our current information age) immortal.

    When you're a prick about it. It's lost. And most of academia is composed of pricks.

  46. how this happens by Goldsmith · · Score: 3, Informative

    Our scientific research system is built around the process of joining a lab, mastering the work there, and then leaving. There are very few long term research partnerships. The people who stay in place are the professors, who generally do not do the research work.

    So you join a lab, produce a few terabytes of data a year, pull a few publishable nuggets out of that and then leave. I have a few backup hard drives that move around with me with what I consider my most important data, probably total 1/10 of the data I have taken. After a few years, this data is really unimportant to me as the labs I have left have done a good job of continuing the research and I have to spend my time and money on something else.

    The original data is eventually overwritten by researchers a few "generations" removed from me and that's the end of it.

  47. Yes and? by Meeni · · Score: 1

    How is that different from the previous state of affairs?

    Before digital age, Scientists would have work booklets that would get lost or destroyed when they change job, or when they become too numerous.

    Drawning in an overflow of data is about as useful as having no data at all. It could be argued that forgetting is actually a good thing that puts forward important matter, those that we care to keep because they are valuable. Sure, some valuables get lost in the process, but anyway, who would go sort trough all data they ever generated, even if they had them available forever?

  48. Doesn't matter how much that is by Anonymous Coward · · Score: 0

    You can still do just as much with NEW data as you could with OLD data, you just have to pay to collect it, which is a cost, but then again, you're not going to chip in on the cost of someone else's costs so you can save something later potentially, are you.

    NGDC cuts are because so many merkins are 100% anti-tax.

    Storing the data costs.

    You cut funding, they have to cut costs.

  49. Magnificent Desolation - Behind the scenes by RogueWarrior65 · · Score: 2

    When John Knoll (yes, THE John Knoll, co-creator of Photoshop and VFX wizard extraordinaire) wanted to reproduce the Apollo moon landing in CG he ran into a small problem. He went to NASA to obtain the telemetry data for altitude and orientation but apparently the data had been tossed a long time ago. However, he was able to find physical prints of graphs of the telemetry channels. So he scanned them in, made them an underlay in a 3D modeling program, and painstakingly traced them by hand in order to extract the data. The results can be seen in Magnificent Desolation Apollo 15 landing sequence. And BTW, that's his modeling work for the lander too.

  50. Destruction policies by Anonymous Coward · · Score: 0

    Often data is subject to strict retention and destruction policies.

    This isn't news.

  51. How much raw data from LHC? by Anonymous Coward · · Score: 0

    So feel like backing that up on your RAID array for free?

    1. Re:How much raw data from LHC? by Anonymous Coward · · Score: 0

      Tape is the word you're looking for.

  52. Ah memories... by jythie · · Score: 1

    I am thinking back to one lab I used to work in that had boxes and boxes of old tape spools sitting out in the hallway, it was always sad to wonder what might be on them since the machine used to create the data had already been disassembled to make space.

    And then I think about the actual project I was working on, which produced something like 1GB/hour every hour every day. Only a fraction of the raw data really made it through cooking, but if there turned out to be a flaw in that initial processing our ability to go back and reprocess was limited by 'do we happen to have that run still?'.

  53. NSF is worried by weakref · · Score: 1

    From what I've heard, National Science Foundation is worried a lot about scientific data preservation. Here is some reading http://en.wikipedia.org/wiki/Datanet

  54. that's the difference by superwiz · · Score: 1

    between data and information. Information is data which reduces confusion. Data can actually carry negative information value if it increases confusion. Any data which is highly informative survives. And just because money was spent to obtain it, doesn't mean it was fruitful. Research is, almost by definition, a walk in the dark. It attempts to reduce confusion. And, as such, is bound to have misses more often than hits.

    --
    Any guest worker system is indistinguishable from indentured servitude.
  55. Reminds me of a dilbert cartoon by ryanmc1 · · Score: 1

    Check out the Dilbert comics from Sept 6 - Sept 16

  56. Science paper data lost! by MonsterMasher · · Score: 1

    Damn. This just had me realize the original raw data collected for my most significant publication is gone.

    Of course. Your publication should generally stand on it's own, providing enough details in methodology and statistical handling to make the raw data less valuable.

    That said, I've felt since the creation of the web that all data generated using public funding should be easily and simply accessed, so that others may evaluate or even expand on your work. Including programs developed (and source code and details of systems used.)

    Ideally, we should work towards the kind of open databases that amateur astronomers now have access too.. and continuously adding to the value of the collected data.

  57. We Need Legacy Support by pubwvj · · Score: 1

    We Need Legacy Support - I keep saying this and the little kids keep dissing me but we desperately need to maintain legacy support. In 30 more years what else will we have lost through rapid obsolescence?

    Companies like Apple and Microsoft need to reach back and provide it all the way to their earliest systems forward. We need to be able to access our old data and that means being able to run our old applications.

    Congress needs to put forth the legal framework that allows all software to be legal cross compiled, enveloped and emulated so that it can run on future hardware and in future operating systems.

    This does not require ballooning of operating systems. It can be done through fairly simple emulation or better yet cross compilation and enveloping. We have the technology.

  58. floppy and laser disks by k6mfw · · Score: 1

    I have a box with about 200 3.5" floppy disks of facility data. And another box with several laser disks from HP data systems (1980s that ran RMB) because those floppies could only store four hours of data. Data is not "scientific" but facility pressure, temperature, stresses, etc. Don't know what to do with all this, I don't think is important like data from Voyager or Pioneer but one never knows. We don't have the equipment anymore to read it. Maybe we can find it used, ebay perhaps? I remember those HP instrument controllers ***never crash***. There may have been times when someone pulls the power cord. Only crashes I experienced was inadvertent divide by zero so the program halts. But. the data is still there including values in the variables i.e. TSPTEMP still has temperature data.

    --
    mfwright@batnet.com
  59. Re re search by Anonymous Coward · · Score: 0

    Business is booming and look who's buying. :-)

    Its the new new new.

  60. Save the Rainforest by Anonymous Coward · · Score: 0

    Sudy

  61. Dryad by Anonymous Coward · · Score: 0

    Perhaps, though if you jump to the bottom of the article, you can see that they are making an effort to keep the data by archiving it with Dryad.

  62. LOCKSS (Lots Of Copies Keep Stuff Safe) by Mark+Leighton+Fisher · · Score: 1

    You might want to look at LOCKSS (Lots Of Copies Keep Stuff Safe (http://www.lockss.org/)) -- we are integrating PURR with the MetaArchive Private LOCKSS Network at Purdue (PURR is the Purdue University Research Repository, which is a Trusted Digital Repository for research data).

    --
    "Display some adaptability" -- Doug Shaftoe, _Cryptonomicon_
  63. Data is probably useless by tie_guy_matt · · Score: 1

    Scientific data by themselves are probably useless. So we have a bunch of numbers. What was the setup of the experiment that generated those numbers? What exactly was the instrument, what are the units of measurement? Did you make any major modifications to the instrument? How was it calibrated? Where is your control? Are those numbers from a good test or a test where someone spilled coffe on the sample? Was that data taken during one of the trials where you left the lens cap on? Reminds me of a bad sci fi movie. That disk has random "scientific data" on it. Any "scientist" should be able to read it and instantly see what is going on here.

    Your notes and documentation are probably more important than just the numbers you collect and those are often still stored on lab notebooks. You know what is really important? The journal articles and papers that you write that show all your methods and have pretty pictures showing your good data. A lot of those are still on paper so they aren't going away. So we are loosing a lot of random numbers from obsolete equipment from setups that no one remembers anymore. I am not going to loose sleep over it assuming we still have backups of the papers people published that talked about their setups and outlined their final results.

  64. Research Data and Metadata degradation over time by figlet · · Score: 1

    I made a diagram (derived from a diagram in an earlier publication) that presents this data (and metadata) loss really well: Research Data and Metadata at Risk: Degradation over Time as part of a paper I co-authored on this subject, Facilitating Data Sharing in the Behavioral Sciences.

  65. Re:Research Data and Metadata degradation over tim by figlet · · Score: 1
  66. Re:Research Data and Metadata degradation over tim by figlet · · Score: 1
  67. We just rescued some data by John+Jorsett · · Score: 1

    A couple of us just rescued some 20-year-old data that had been stored on 3.5 inch floppies. We actually had to go to one of our old retired colleague's houses because he was the only person we could find who had a computer with a floppy drive capable of reading them. Even so, some of the data was unrecoverable.

    I know probably the best option right now for preservation in digital form would be several copies on CD/DVDs of the proper archival type, but I'm wondering if there are any free online services such as Amazon Web Services (which has free accounts for limited usage) where there'd be a prayer they'd keep it around for decades. After all the stuff that Google has abandoned over the years, I'd never count on them, but is there anyone else who might be any better?

  68. I beg to differ.... We archive "forever" by PeterM+from+Berkeley · · Score: 1

    Hello,

        Our mindset at my research institution is very different. We generate a certain amount of data per year (several terabytes), but the cost of storage decreases so fast we just copy old data onto new media and never delete ANYTHING.

          In fact, we consider the cost of actually figuring out what data to delete to be higher than simply buying more storage.

        I would not call it "well-indexed" however.

        Our backup strategy is tailored to the nature of our data. Most of our data is simulation results. We back up "lightweight" data and analyzed results, input files, and log files. "Heavyweight" data we do not back up, since we consider the cost of reproducing this data (given the input files and the log files) modified by the low probability of actually ever needing it to be lower than the cost of backing it up. This results in our backup requirement to be maybe 5% of our "live" data archive.

        If it gets to the point where we can't afford the storage anymore, we'll delete the "heavyweight" data ourselves to reduce the data footprint.

    --PeterM

  69. It's a matter of organization by countach44 · · Score: 1

    For one example, for one project let's say I have roughly 300GB of simulation data. Of out that data, how much will be used to generate a figures for publication? Maybe 1%? The rest of it is from testing, fine tuning, and exploring the parameter space. The real problem isn't where to save it all, but that there is exteremely little incetive to to go through the trouble of sifting through and archiving the important stuff. 80% is proably a lower bound, IMHO. Futhermore, let's say you save that im portant precious data. Good luck future scientist in figuring out what is in those files and how to analyze it.
    I realize that not all science is like this, but I think I'm speaking about the majority, not the minority.

  70. Our approach in our research group by enriquevagu · · Score: 1

    This problem occurs even for people in the same group, who often find problems to repeat the simulations from our own papers, and even as recent as one year ago. The problems typically come from people leaving (PhD finished, grants that expire, people that move to a different job), changes in the simulation tools, etc.

    In our Computer Architecture research group we employ Mercurial for versioning the simulator code. Thus, we can know when each change was applied. For each simulation, we store both the configuration file that is used to generate that simulation (which also includes the Mercurial version of the code which is being used) and the simulation results, or at least only the interesting results. Multiple simulators allow for different verbosity levels, and in most cases most of the output is useless, so we typically store the interesting data (such as latency and throughput) because otherwise we would have no disk space.

    Even with this setup, we often find problems trying to replicate the exact results of our own previous papers, for example because of poor documentation (this is typical in research, since homebrew simulation tools are not maintained as one would expect from commertial code), changes that introduce subtle effects, code that gets lost when some person leaves or simply large files that get deleted to save disk space (for example, simulation checkpoints or network traces, which are typically very large).

    However, you typically do not need to look back and replicate results, so keeping all the data is a useless effort. I completely understand that research data gets lost, but I think that it is largely unavoidable.

  71. And part of the reason is Pay Walls by Anonymous Coward · · Score: 0

    And part of the reason is Pay Walls... Just like the one blocking the paper from the public.

  72. so it's actually possible by Anonymous Coward · · Score: 0

    to reinvent the wheel?

  73. Paper vs. electronic by mcswell · · Score: 1

    Loss is irrelevant to the argument, because loss can occur to both paper copies and electronic copies. The argument is about what you can do with the media if it is *not* lost. Paper copies can be read for centuries (at least on acid-free paper). Hard drives probably last 10 to 30 years (we'll know in 30 years, although we can get some idea sooner by exposing hard drives to high temps etc.). CDs, surprisingly (ok, it surprises me; ymmv) don't last much longer (at least we don't think they do).

  74. Re:The Rosetta Project: building a 10000 year libr by Anonymous Coward · · Score: 0

    Interesting. Would be even more interesting to have this disc backed with a silicon wafer. You could store a lot of data in 300mm2 of ROM even at something conservative like 0.1micron.

  75. Yet the NSA can store all our emails, phone calls, by Anonymous Coward · · Score: 0

    ..presumably until the end of time, or until they can find some nefarious use for it.

    Now, just -what- do I pay taxes for every April 15?