On Data Obsolescence and Media Decay
mouthbeef asks: "What's the future of storage media? With CDs and tapes prone to relatively speedy decay, and hard-drives an entropic nightmare of moving parts, how
will we keep our data safe over the long haul? I just got some e-mail from a writer pal who isn't really technologically sophisticated, alarmed because someone told him that his backup CDs would decay and rot in 20 years. He's an sf writer, and he was thinking "big picture:" a coming infopocalypse in which sysadmins devote their every waking moment to re-archiving their old backup data." Is such a scenario likely? Why or why not? (More)
"I wrote back that I didn't think that would happen, because:
- Every time I buy a computer, it's got more storage on-board than all the computers I've owned until then, and I just migrate all the data files I've ever created or saved to the new box, like a hermit-crab changing shells
- With broadband becoming more real and more cheap, it makes sense that in the long run we'll store most (if not all) of our data on remote servers -- encrypted, of course -- that are managed by trained pros with access to mirror drives, climate-controlled vaults, etc. etc.
- Even if this doesn't happen, most of your data files will be in stupid, proprietary formats like Word 3.0 that won't be openable, anyway
How reasonable does this seem to you folks? What do you do with data that you need to preserve for the ages? "
Only garuanteed storage mechanism! Good for thousands of years.
Capacity: 2Kb/tablet
I/O: 1byte/hr
Media cost: £50/tablet
Error rate*: 1 per 100bytes
Note: Error rate assumes fully qualified and certified stone mason.
Deleted
The tapes in question are 3600 foot, 7-track analog tapes recorded at 15 inches per second. During recovery, the analog experiment data is digitized at 40,000 (16-bit) samples per second. That comes out to about 230 megabytes per track. Not all of the tracks are used for experiment data, some are used for frequency reference, time code and low rate spacecraft PCM data, others are unused. Assuming one track with experiment data, the result is about 250 megabytes per tape.
Mea navis aericumbens anguillis abundat
I can't speak for anyone else here, but so far my personal experience has been that Maxell CD-R's are the absolutely worst available out there and Verbatim have so far been the best.
By saying that I'm refering to how I bought my first CD-R about three years ago, and of the 20 or so Maxell disks that I've archived data onto, only one is still readable by any CD-ROM/CD-R that I insert it into. By contrast, every one of the verbatim disks that I've burned, which were stored in exactly the same environment as the Maxell's are fully-readable, and I haven't had any problems with them.
I've also used a few Sony and Memorex disks with which I haven't had any problems (that I'm aware of) but I have found my verbatum disks to be incredibly durable. I burned 20 or so Audio CD's onto verbatim disks two years ago before leaving on a cross-country road trip, and despite vast changes of heat and cold, as well as being literally tossed around my car, every one of those CD's is also still working.
Again, this is just my personal experience, but whenever I see someone picking up a spindle of 50 or so no-name brand disks at a local computer store, I have to wonder how important the data they're putting on there must be...
--Cycon
Your Brain + EEG + LEGO Robots = Brainstorms
This is a very real problem, but it won't amount to an apocalypse unless we ignore the issue.
As others have pointed out, the exponential increase in storage capacity makes it relatively easy to "keep buying more disk" and migrating your data all the time. Certainly the convenience of having everything online is nice, too. And everything on line should have periodic backups happening. I've managed to do this for the past decade with my data, but I've lost the eight or so years before that, and I miss some it.
But there's logical as well as physical bitrot. The media itself deteriorates, making it hard to get the information back, but understanding what that bitstream represents after a few years can be a real problem. If you've got binary word processor files from an Apple2 or C64, you'll probably not be able the read them unless you also have the binary and can get it running in an emulator. Given the amazing progress that's been made in the last 150 years deciphering the records of dead civilizations, I wouldn't say that reading your MS Word 5 documents will be impossible in twenty years, but it might not be worth the effort. Open standards and open source really help alot with this issue. If you can find a document describing the file format, you're saved. And the same applies to hardware formats. Also, it's much easier to keep open source software alive--essentially carrying the 'make a copy on the new system' over to executables.
I'd say the solution is pretty much that simple: keep track of your data, plan to make a complete copy every 5-10 years, and choose formats and that are publicly documented and that (you hope) will be easy for future software to support.
This approach will well work with anything that is in daily use by a reasonably large group of people. Also it works best with information already stored in digital form. There is other information worth keeping, historical data, literature, even texts intended only for reading once (adverts, notes, email) may give later generations an insight into present everyday life and hence be worth keeping.
Many of these texts are not yet broadly available in digital form and are not important or interesting enough for enough people to be kept handy. Try looking for some older book by a not so famous author. Even encyclopaedic works are worked over for each new edition and older bits of information have to make place for newer ones.
With historical facts it's even worse, in most cases there's at least two versions of one event and who was in the right is mostly determined by who survived. Just have a look how warfare now concentrates on media control or try to imagine the twisted version of history if the nazis had won WWII, even now there are some denying the existence of the holocaust.
I think all this information is well worth keeping, and since it's difficult to see today what later generations might find worthy the 'evolutionary' approach (if i/we don't want to keep it later generations won't want it either) doesn't work. And it doesn't suffice to just keep this information somewhere, it has to be kept in an accessible form, on media readable with modern equipment (who will go through the trouble reading an old magnet tape) and indexed (if you have 1GB of unsorted texts/textfragments on a harddisk are you ever going to wade throgh that to get that piece of information presently of interest?)
"By the way if anyone here is in advertising or marketing... kill yourself." -- Bill Hicks
I disagree almost entirely.
Very little of the data volume becomes useless, because we don't know what "useless" will be to the readers in the future. Contemporary archaeologists spend much useful time sifting the contents of rubbish pits and latrines - if that turns out to bhe interesting, how can we ever say that data won't be. Maybe your schoolwork is dull and uninteresting to you, but how about an educational historian in a century or so ? Wouldn't you like to know how teaching was carried out in the past ?
Also the majority (by volume) of data will always automatically generated sensor data (humans can't type fast to keep up), and that tends not to become useless with time. NASA have already lost interesting telemetry data.
Authors have definitely lost early book drafts because modern WPs don't open old WP formats. Word 1.0 isn't old ! that's not even a decade ago. What about stuff from the '70s on hardware formats that no longer have players ? CP/M WP formats used by some of the first great novelists to work digitally ? (mind you, losing the whole of Pournelle is fine by me). Personally I'd find it very hard to read my own degree work, and I'd probably have to do it by scanning in the paper copies
Solutions ? I'm not a hardware guy, so I can only talk about the soft data side of it. I think XML (and similar) has a big part to play here. Let's stop thinking of data formats subjectively as "the data format that belongs to SprongWriter 4.2a" and instead work with formats that have objective definitions that extend beyond the client app of the day. Why should I need a copy of that particular WP to open the data, if the data is already in a format that's inherently accessible. We already have the technical skills and tools for this, I call on all developers to make use of them and to stop writing these proprietary data oubliettes.
Book Recommendation: The Clock of the Long Now Stewart Brand Why this sort of thing matters, and what a few people are trying to do about it. Best book I've read this year.
PS - SciAm also had a piece on digital data loss, a year or so back.
This is not a new problem. People have been dealing with the question of recovering data from old media for years. As a first data point, a number of years ago, about 5 IIRC, some people finally decided that some old music tapes had to be rescued.
The method used was to find this old RCA gentleman how had retired more than a few years before then. They then went to the Smithsonian and got the last remaining version of the tape recording/play back device that had been used to make the original master tapes. The RCA guy used the specs and his knowledge to tune the tape deck to perfection. They then put a high quality amp and spliter down stream of the tape deck to feed 2 digital tape decks (The professional version, not DAT, more bits and a bit faster sampling rate) and a couple of analog tape decks as well.
After testing, they carefully placed one of the Master tapes on the deck, started all the recorders and press "play". As the Master tape played it just came apart. They had to keep the heads clean but this was a one time, one chance thing. They succeeded.
From the recordings they made some wonderful CDs. Amazingly enough, the Master tape had almost no "hiss" in it.
Data point two. MIT I believe it was, decided to move some of their older theises to CDROM for easier online access. The first thing they noticed is that many of the data tapes they had stored things on were 7 track tapes, and of course they had no 7 track tape drives any more. Again people went to the museums got out a 7 track drive, spent the time to fix it and make it work, then built an interface box to connect it all up and away they went.
3rd data point. Somebody sent out to a mailing list that they were looking for some old code to run on a mulator for a PDP11(?). We ended up going into our machine room and found some old release tapes. This included a copy of BRLUNIX (Based on a BSD release) and I think, an AT&T Sixth addition. These were 9 track reel to reel tapes. We went into the machine room, powered up the tape drive, copied the tapes verbatium to disk. We set it up to do the least amount of reading. These tapes were around 15 or 20 years old.
Because of this rescue which happened late last year, we saved the tape drive when the machine was tossed due to "Inability to prove Y2K compliance". So the tape drive still sits on the machineroom floor. The operators turn it on and clean it once a week. But it isn't currently hooked up to anything, but we expect it to be hooked up to something again in the next year or two. Just to be able to read all those old tapes we still have.
At home I use EXABYTE-8200s for my back ups. I have 3 drives and you can still get them referbished. While each tape only holds 2GB (Compared to a max of 150MB for a 9 track tape). The media is small and low cost. The exabyte encoding also has a great deal of redundency in it making it an exclent choice for long term storage.
At work they do much of their backups EXABYTE 8500s. For the Crays, they use to use IBM 3480 tape cartrages, when they changed tape formats, they spent a few weeks moving all the data from the older format to the new format.
Of course our most reliable storage medium to date has been our paper tape and punch cards. While they maybe low density and sometimes we've had to make readers for them (Auto feed to a flat bed scanner which scanned the card. Process the card for holes and voloa).
CDROMs have the problem of decaying do to light contamination. If you want to keep them for years and years and years, they have to be kept out of sunlight. And because our long term, low cost, storage methods keeps dropping in cost and increasing in size, I suspect that what we will find in 3 years is that everybody is carefully copying all their data from CDROM to DVDs which will have a twenty year life span.
The basic rules on saving your data for the long term are:
Chris
This suggests that ALL data should be made freely available for archiving. If NASA had made an effort to make sure as many people as possible had copies of that data, then you wouldn't need to do all this transferring. It would have been transferred to newer systems by someone already.
Apart from with MAME, nobody is making any effort to archive old computer games. The BBC managed to destroy a lot of valuable origional video tapes (Apparently they taped over their copy of the moon landings). These show that data is kept around much longer if copying is encouraged rather than discouraged.
I am an archivist. My job is to sift through data and decide what is worth saving. Generally about 5 percent of collections of modern records are saved. Popular culture is indeed documented to some degree in any historical library and there are several repositories which are dedicated specifically to the preservation of popular culture.
The filter of decay has served mankind well? How illogical, when you have no idea of what has been destroyed how do you know mankind has been served well? Was mankind well served by the destruction of the Library of Alexandria, the Aztec library destroyed by the Spanish, the historical libraries destroyed by the Serbs in the Balkans?
Sure CDs may last 100 years (we really don't know) but it is unlikely they will be able to be read by anything. Paper is still the most stable format available (although it is impractical for many reasons to transfer digital data to paper as some of my colleagues are prone to doing) and there are many vast libraries of data open to the public. We had well over 40,000 researchers use our library last year and less than 1 percent were scholars.
My profession is wrestling with two technology related questions.
1. How to make paper collections accessible electronically. For example the papers of ONE congressman (approx. 400k documents)took 5 years and nearly 3 million dollars to digitize. We have one collection which has 32M documents. Sure digital copies are cheap - IF the original was electronic and in a form easily translated.
2. How to preserve much of the information which currently only exists in electronic form, be it governmental databases, personal computer files or web pages. We did an interesting experiment a couple of years ago when we captured about six dozen web sites which documented the devestating Red River flood in Minnesota, North Dakota and Canada. Most of these sites existed on the internet for only 2-3 months and were disappearing as we captured them. I think it will be possible to study how the internet was used as a tool in response to catastrophe from the governmental level to local churches and organiqations. Of course current copyright law makes it illegal for us to post this database of websites on the internet but thats another issue.
Aging Newbie is correct in the assertion that only a small percentage of data need be preserved, yet I feel that conscious, reasoned choices about what should be saved serves mankind far better than the filter of decay. I also believe tha solution ultimately will involve a combination of strategies including electronic.
Skavvy(whose firewall apparently won't allow him to register)
WOW, i cannot beleive that half of the /. readers are not working on data recovery as we speak. I spent a good couple months of my life running back and fourth across hallways doing tape retreival because the machines that were made in the late 70s, early 80s couldn't be replaced. This was made even worse by the fact that half the tapes were courrupted. Fact is, we have lost a lot of the voyager space probe missions. With data centers poorly funded, the race to copy all the data from older 7 track format tape to new media is slow and gruiling. 7 track machines are NO LONGER MADE and the companies outfitting newer tape heads to read the old data are charging way more than the scientific centers can afford. Not only voyager, but magellin and so fourth.. GONE... and going as we speak. As the few machines that can retreive the data struggle to re-read the tapes literally hudreds of times trying to recovered those last missing bits, tapes yet to be re-archived are falling apart. Once the data is stored, what does one DO with half-complete 1970s computer records? There is yet an "emulator" to read most of this stuff. Fact is, it is gone, and anyone who says this problem isn't going to pop up again has yet to store anything important on a floppy drive. bortbox
However, I think he was mistaken. Ancient societies left stone tablets, cave paintings and the like behind, and there's no-one who fully understands the languages or the contexts (when an archaeologist says an object is of "ritual significance" he actually means he doesn't know what it's for). We do have the technology now, as the poster says, to migrate our data ever forwards into new storage, assuming no cataclysm occurs. And even if it does, it is far more important, in terms of recovering data, that the language (source code) survives, rather than CD ROM drives, Minidisc players etc (the binaries), because then data recovery is an essentially straightforward task.
I expect acid-free paper to survive long enough after an ecological catastrophe or, say, a meteor strike, to be useful to the survivors (better start moving the engineering textbooks down into the bunkers). And of course, Ship-It awards will outlast the end of time, not to mention non-biodegradeable shopping bags.
As a civilisation, if we wish to preserve a legacy, we currently posess the skills and technologies to do so - if we choose to.
From what I've understood, the lifespan of a CD-R is around 20yr for those which are based on cyanine or AZO (and which appear blue or blue-green when you look at them) and around 100yr for those based on phtalocyanine (which appear golden to the eye).
Of course, it depends very much on the way you treat those CD. If you put one in a light-free, dust-free, safe deposit box, it can probably survive several kyr (uh, thousands of years) without damage.
The unfortunate thing, however, is that because the error correcting codes work so well, it is not always easy to tell that a CD has begun noticeably deteriorating until the data is actually unreadable, and then it is too late. It would be nice if the drives could return some sort of ``CD quality'' status.
I always write down (on paper) the md5 fingerprint of the raw ISO image when I burn a CD. In that way, I can be sure whether I have pristine data yet. (And if I make copies, I can be sure the copy is exactly identical to the original.)
This information is provided in the hope that it will be useful but WITHOUT ANY WARRANTY. Without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Yamaha CD-R site
Josh
One could always do what Linus did for backing up his work --sharing it with the world. I heard he didn't have a tape drive for many years until he was given an Alpha, but his work could always be found somewhere on the internet in good hands.
The internet will always save your best work and discard the junk.
As someone who just loves books .. most are not printed on acid free paper anymore and a huge amount of them is going to be lost within the next 10 to 30 years.
I'm sorry to hear that. I've been fascinated by this phenomenon in our university library. Up until the 1930's somewhere, journals are pretty well preserved. Then they suddenly get awful as paper mills switched to new methods. Pages are yellowed and brittle. In the 1950's the error was discovered and pages become white again with the switch back to acid-free paper.
Let's hope we don't make the same mistake with digital media. And it could be worse: almost all the film from the first half of the century is lost to self-rot and enviromental damage. For all its faults, DVD is probably the best thing that's ever happened to film from a historical perspective.
What I've noticed is that most of the data we're accumulating is quickly becoming useless. 10 year old schoolwork isn't something so worthy of archiving. The data you really want to keep shouldn't be very large anyway...
.WRI and Word 1.0, and I don't see that likely to change in the near future. The filters will probably stay, but be optional. If you want to future-proof your documents, run a mass conversion utility on them and convert them to a more "standard" format than Word or Wordperfect. Say, pure ASCII, HTML or RTF. Sure, you're going to lose formatting, but if those are documents you're not likely to use ever again, yet there may be a slight chance you will, then losing formatting isn't important. If you need the content again, you shouldn't mind too much having to redo the formatting correctly again...
Modern word processing still opens really old file formats like Windows
Floppy disks are degrading rapidly, but most people's floppy collection can fit on a single CD-R. Then again, most people just don't care about their floppy collection, and will just let it die. The data contained on it isn't useful anymore.
Let's see about Audio CDs. They degrade over time (scratches) and possibly rot. I believe that what will happen is that we're going to convert them to some format like MP3. I'm fairly certain that MP3 capability will continue to be implemented in computer for a very long time.. And if it shows signs of getting phased out, then you might simply batch-convert everything to the new format. Or just rerip your Audio CDs that are sitting in storage, if you really care about the quality (since batch conversion will result in degradation, unless we find a way to actually enhance the audio quality... which might or might not happen...)
Movies. VHS tapes degrade... Probably, we'll be converting what we really want onto some kind of optical disk in the future. And the rest willl decay, and we won't care about it decaying. When the format (DVD-R perhaps ?) is being phased out, since it's in digital format, it should be possible quite easily to simply transfer our DVD-Rs to the higher capacity medium... Perhaps 10 discs on a single one... Saving a lot of space, and having the format live another 20 years. After all, how hard will it be to include MPEG-2 decompression in next generation video players ? The cost of an MPEG-2 decoding circuit probably won't be very high anymore.
The other possibility I see is that bandwith gets cheap enough so that we may consider remote storage vaults. That has a couple of privacy issues I'm certain you can see... But it's incredibly convenient and will probably be adopted by everyone if we just find a way to have a high speed switched pipe to everybody's home at a reasonable cost..
If we do indeed have high bandwith in every house, I see that the media companies might also get their acts together and start putting up their own gigantic media-archive. They could offer a monthly media-license that'd give you access to any music or movie you want. Or perhaps just make you pay for every access to the archive. Of course, such a thing.. I can think of so many ways it could go wrong. What if they decide to have only censored material on the archive ? What about independant artists ? Perhaps we'll just see a protocol to access and pay for access to media archives, and have a dozen appear. Let's say, DisnABCTimeAOL could have theirs, AndoTransmeVAMicrosoChryslerDaimler could have theirs...
This could be so horrible if not properly done - a lot of "non approved" content could suddenly become unavailaible if you killed the distribution channels except those media-archives... So. Is this just an incoherent rant ? Would you care to add any constructive comment to it ? Answers ? Questions ? Anything at all.
In many later books Lem refers to an informatic catastrophe: sometimes it is caused by a necro-virus, a product of a computer evolutions (the arm race was banned from Earth and transported to the Moon, where sophisticated computer systems worked automatically on weapon development. Each nation was allowed to get the weapons back on Earth, but that meant others could equally prepare; somehow, the automata on the Moon get out of control and start evolving, finally leading to a nanobot-virus thriving on silicon chips - therefore the title, "Peace on Earth"), sometimes by basic physical properties (in a humorous story "Prof. A. Donda" the title hero discovers a basic equality between energy, mass *and* information, and one of the consequences is that if information achieves a certain density it changes into matter, that - a new universe. God's word was counting from infinity to zero in an infinitely small time :-) ).
I admit - I was gestaltet by Lem's writing. Many of his ideas from sixties and seventies came to life in the nineties (e.g. virtual reality or sciences which deal only with information retrieval). I do believe that information storage is a problem - but not because the medium would not last forever, but because of the signal / noise ratio you have even in your personal files. As I look on the four Macs we work with in our lab, and the couple of Gigabytes of data, and then dozens of GB of backups, different versions, obsolate versions, alternate versions, gel pictures you have no idea where they came from and who needs them, and so on, and so on... Yes, there are better solutions than using a Macintosh in a multiuser environment, but that's not the point. I've been using Linux for years and have my personal data at home, and I seem to have a GB or so of data I'm to afraid to remove just in case. And there are so many alternatives of storage, backup, databases... and I'm just a simple biologist!
Returning to Lem - yes, I do believe we are approaching a critical point, like a bifurcation in a chaotic equation, and the word "chaotic" fits here in especially well. What happens next? He who cometh and giveth us a system (not OS, but an information retrieval system), he hath the power and our souls. Well, mine at least. Hope he doesn't come from Redmont, though.
Regards,
January