Fault Tolerant Archive Solutions?
"There are several solutions for on-line storage; RAID, UPS, and frequent backups. As I fill a CD-R I make several copies of it and send them to reletives who live out of state. So I am fairly well protected against local disaster.
But what happens when the CD-R itself becomes degraded - possibly from scratches or bad lamination - and cannot be read by the normal file system? Murphy's Law would guarentee that all the backup CD-s were from the same bad batch or were lost, etc. So I am left with a CD that is 90% good, but that ugly 10% prevents me from getting my file.
I remember studying about N-dimentional parity and Hamming codes in Comp Sci class, so I know that it is possible to store a file with signifficant error correction capabilities. But has any such scheme been actually implemented?
I would expect any such scheme to include the ability to adjust the degree of recoverability (size vs. robustness) and to be able to span volumes. Since most physical damage is contiguous, you would hope that the storage would be non-contiguous. And you would think that this would either represent a unique file system or a custom raw storage methodology useable only by the storage aplication.
Thanks for your insights."
You could do raid on a single disk (or, presumably, disc, if you're using a CD). Since you're assuming that most of the media will be good, you simply treat the disc as a collection of, say, ten 70M regions, using nine (or eight) for data and the remaining one (or two) for parity, which would allow for reconstruction of any one (or or two) dammaged regions.
Of course, raid assumes that you errors are self-detecting, but I suspect that this is also true of CD media failure.
Now the trick is to implement it. You could encode it such that it looks like a normal ISO-9660 CD, except for special "garbage" written to the last 70M or so. You would need a special version of mkisofs, as well as special recovery tools.
In theory, you could have mkisofs figure out exactly how much extra space you have left on the CD and use it for parity of each previous block of that size. If you have more than some threshold of free space, it could use more than one parity block for multi-block failure recovery. Then if you cram the disc to within a meg or two of being full, you are still protected from failure, but only if the failure covers a small area.
This sounds like a good project for a data storage class.
Because of the relatively high failure rates of floppys (in my experience), I would always make duplicate images on disks of an original. First off, if you're using a CDR, buy quality recordable media - if you're using the 100 pack that you bought for $30, I'm just going to laugh at you. Make an ISO of the filesystem you want to burn, then make, say, five copies of it. You might want to check that they are all really identical (using dd to reextract the ISO raw from each disc) otherwise this isn't going to work. So then, say that you are using an archived cd, and its failing reading a particular file, you can 'dd' the image from the cd (turn off the 'terminate on read error', and make sure it puts empty blocks in where it wasnt able to read the data) Just record the data sectors that it couldnt read, and splice them in from one of your copies (all blocks are identical, use 'dd' to extract the ones you need). Rewrite the image to a new disc, and you're done. "But, that seems like an awful lot of work" - And you're right. Chances are that at least one of your five or so discs will work "out of the box" with out splicing needed, but what happens when all of them have issues? A redundant filesystem is a great idea, adding parity so that it "just works" when blocks go bad. But if you're using cdr's, and want ISO9660 filesysem, you get the iso9660 filesysem that doesnt have the parity. The method I outlined above works, I've used it on floppy's, I know others that have used it on DAT tapes, and it will even fit in with the "make lots of copies and send them to relatives" approach you already have.
The only way to be sure you won't loose your data some day under a bizzare set of circumstances is to have infinite coppies of it - unfortionatly this costs infinite money and takes infinte time to acomplish, just a small implementation problem.
I currently work for a DoD contractor on the east coast of Florida and we tend to worry about hurricanes and brush fires and aircraft that might miss the runway and land on the building. We use DLT and other mature media that has a 30+ year life. The general idea is each quarter we make an offsite copy that goes to another facility and sits in a vault. Whenever there is a threat (fire hurricane etc) we pull the most recent full backups and fly them out of town on a plane that goes to some other location to be stored just in case. I am dealing with backups on the order of 1TB per week of fulls +40GB of incrementals on top of that or I would make a second copy more often, however management has made a cost/risk decision that quarterly is often enough to make the offsite coppy.
At my last job (across the street so same concerns) we had a rotation that sent a copy of last weekend's tapes to another building 5 miles away, the tapes at that location went to Orlando, the tapes in Orlando went to Harrisburg, PA, and the tapes in PA came back home and were recycled the next week. This gave us 5 weeks of tapes in multiple locations around the country at any given time.
The bottom line is this aint cheap and you need to make a determination what the data is worth vs how much you are willing to spend ($ and time) to protect it and arrive at an acceptable level of protection.
I'm blithely pretending that CDs will last forever. If they don't, then I should check the integrity of all of my backup CDs every couple of years, and copy the data from failing sets to new sets. This involves doing something like calculating a CRC code for all files in the archive, and storing a copy of this with each copy of the backup tree.
I also keep a paper copy of anything really important (that will fit on paper, at least).
There's also an active system that I'm interested in trying, but it would have to be continually maintained (CDs and tapes can be left unattended for years, if they're stored well). The active system would be a bunch of servers with RAID drives that stored the files to be preserved, along with CRC information for the files. These servers actively mirror content from each other, trying to each keep a complete set of the data (updates propagate through the mirroring network). They'd also perform integrity checks on their own data and data from other servers (let the other server know that its data differs from the local copy). As long as the servers are maintained and swapped out when they fail, the data should be preserved intact forever.
The catch is that, while you could in principle preserve storage media for a century, I wouldn't want to bet on a server network being maintained (in whatever form) for a century.
The question is about fault tolerant archival. Tapes are a backup mechanism, yes, but not a good archival mechanism. The reason CD-R is mentioned is because the shelf-life of a CD-R is many times that of tape. Tape degrades, and fast.
1) While I have had unrecoverable errors with fingerprints, I will take your word on the scratches. Besides, both can be cleaned/polished to some degree.
2) Exactly! It is the media durability which scares me. Good today, but crap next decade - long after I've removed the files from on-line storage.
3) Which leads me to another favorite lecture of mine - the media needs to last as long as the drive, but not much longer. Digital storage must move from standard to standard or be lost. How much longer before my eight inch floppies are useless?
I suspect that 30 years is sufficient for CD-R's. They are compatable with the next generation, DVD, but probably not with the generation after that. Jumping generations seems about right and should average about 20 years - assuming that we stay with "consumer" media. And that I avoid such wildly popular formats as the 8-track tape ;-).
Thanks again,
Bob Washburne
See any major corporations using CD-Rs for backups lately? Big guys use Tapes for backup. They have been proven reliable for years. I'd suggest using tapes instead of CD-Rs for backups. If data integrity is paramount it's worth the extra thousand or so for the tape drive. Plus, you can reuse your tapes in a backup cycle! (lowering media cost somewhat)
Still, the other ideas suggested like multiple copies of backups are a good idea too, and when used with tapes make your solution even more effective.
More info: geeky, geekier, geekiest. An interesting tidbit is that the data is interleaved serially, meaning the data and the parity codes are spread across wide arcs of disc. That's why it's recommended to clean discs from the center out, not around the discs (so if you scratch it, you damage unrelated segments).
So, I think the idea of duplicating your CD-Rs and sending them to your relatives is a good one. For more fault tolerance, just send some more copies to some more relatives.
Couple of things...
First of all, an experiment:
Take a CD / CDR with known good data. Take out your keys. Carve a good scratch into the data surface. Put it in the drive and check the integrity of your data. Yup, it's still there (probably). CDs already incorporate ECC (error correcting codes) similar to RAID.
The real question in my mind is the time durability of your media. People make different claims about CD-Rs, but you can probably count on a minimum of 30 years if you store these things in a box in the dark (the dyes are photo sensitive).
Magnetic media, on the other hand, tends to have a much shorter life span. Think about where you're going to find a tape drive in 20 years. You're much more likely to find something that reads CDs, because they're a consumer technology (and because CD-ROM's can have virtually indefinite shelf lives).
Stick with CD-R / DVD-R, but think about your migration strategies. And you can't beat RAID for on-line storage.
It is often possible to predict the lifetime of a product (or a key component of a product) by subjecting it to accelerated life testing--that is, by increasing the stress on the component until it fails. It is not clear, however, which stresses will lead to failure, or if increasing the stress accurately predicts what happens at the end of a component's life.
The disc of a CD-ROM is made of polycarbonate plastic and an encapsulated thin, reflective layer of aluminum. The digital information on the disc is imprinted in that aluminum layer. There are a number of possible wear-out mechanisms that could damage or destroy the information on a compact disc. Ultraviolet light can alter the optical properties of the polycarbonate plastic; cold flow of the plastic could lead to mechanical distortion of the disc; and oxidation could impair the readability of the aluminum reflective layer.
Practically speaking, the most likely wear-out mechanism for CD-ROMs will be the changing technology of data storage. Long before the disc itself becomes unreadable, it is likely that the CD-ROM will be replaced by a new medium and that it will not be possible to find a CD-ROM reader, except perhaps in a museum."
Apply it to ISO filesystem you are creating, so when CD get's damaged, read everything you get to file, and use this to get original ISO-image.
Only problem is, error correction is for bit changes, and normally, when you have broken CD, you can't read part of it at all. So you have to insert missing bits to ISO-image (plain 0) to correct place before using error correction. Don't ask me how, dd conv=noerr will give you all data it can get, no idea where's something missing.
One solution might be writing some structured data, where you can find block numbers from data. CD doesn't have to have ISO filesystem, or filesystem at all. Just create own dataformat, and write it to CD. And tell us when you have solution available!
Not that it would be cheap, but have you thought of using several DLT drives in a RAID configuration of some sort?
I know I have heard of mirroring and parity systems using multiple parallel running tape drives - I am not sure how reliable they are/were - but it should be possible (if expensive).
Tape is actually very robust, and lasts a long time. There is a reason it is used so much in the industry.
I guess another possibility would be to find a card punch and a large cache of cards and... oh nevermind...
Worldcom - Generation Duh!
Reason is the Path to God - Anon