One Way To Save Digital Archives From File Corruption
storagedude points out this article about one of the perils of digital storage, the author of which "says massive digital archives are threatened by simple bit errors that can render whole files useless. The article notes that analog pictures and film can degrade and still be usable; why can't the same be true of digital files? The solution proposed by the author: two headers and error correction code (ECC) in every file."
If this type of thing is implemented at the file level every application is going to have to do its own thing. That means to many implementations most of which wont be very good or well tested. It also means applications developers will have to be busy slogging though error correction data in their files rather than the data they actually wanted to persist for their application. I think the article offers a number of good ideas but it would be better to do most of them at the filesystem and perhaps some at the storage layer.
Also if we can present the same logical file when read to the application even if every 9th byte is parity on the disk that is a plus because it means legacy apps can get the enhanced protection as well.
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
include par2 files
Done. +1 to the poster who said there is some round transportation implement being reinvented here.
>>>"...analog pictures and film can degrade and still be usable; why can't the same be true of digital files?"
The ear-eye-brain connection has ~500 million years of development, and has learned the ability to filter-out noise. If for example I'm listening to a radio, the hiss is mentally filtered-out, or if I'm watching a VHS tape that has wrinkles, my brain can focus on the undamaged areas. In contrast when a computer encounters noise or errors, it panics and says, "I give up," and the digital radio or digital television goes blank.
What we need is a smarter computer that says, "I don't know what this is supposed to be, but here's my best guess," and displays noise. Let the brain then takeover and mentally remove the noise from the audio or image.
"I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
ZFS.
Next topic....
The PNG image format divides the image data into "chunks", typically 8kbytes each, and each having a CRC checksum. You'd archive two copies of each image, presumably in two places and on different media. Years later you check both files for CRC errors. If there are just a few errors, probably they won't occur in the same chunk, so you can splice the good chunks from each stored file to create a new good file.
What files does a single bit error irretrievably destroy? Obviously it may cause problems, even very annoying problems when you go to use the file. But unless that one bit is in a really bad spot that information is pretty recoverable.
Stupid idea. Nowadays digital preservation is more about file format conversion then about bit rot.
Don't save anything.
Life takes interesting turns, but the most interest is when you're off the beaten path.
Off course this can be fixed by "block redundancy" (like RAID does), "block recovery checksums" or old-fashioned backups.
extern warranty;
main()
{
(void)warranty;
}
It is about time that somebody (hopefully some of the commercial vendors AND the open source community too) get wise to the problems of digital storage.
... I will have to ponder that. Maybe not, my programs seem to ephemeral for that ... Then again, so did people think about their 1960es COBOL programs.
I always create files with unique headers and consistent version numbering to allow for minor as well as major file format changes. For storage/exchange purposes, I make the format expandable where each subfield/record has an individual header with a field type and a length indicator. Each field is terminated with a unique marker (two NULL bytes) to make the format resilient to errors in the headers with possible resynchronisationthrough the markers. The format is in most situations backward compatible to a certain extent as an old program can always ignore fields/subfields it does not understand in a newer format file. If that is not an option, the major version number is incremented. This means that a version 2.11 program can read a version 2.34 file with only minor problems. It will not be able to write to that format, though. The same version 2.11 program would not be able to correctly read a version 3.01 file either.
I have not implemented ECC in the formats yet, but maybe the next time I do an overhaul
Because all of those are compressed, and take up a tiny fraction of the space that a faithful digital recording of the information on a film reel would take up. If you want lossless-level data integrity, use lossless formats for your masters.
... Efficiency is the enemy of redundancy!
Old documents, saved in 'almost like ascii' is still 'readable'. I once salvaged a document from some obscure ancient word processor by opening it in a text editor. I also found some "images" (more like icons) on the same disk (a copy of a floppy), even these I could "read" (by changing the page width of my text editor to fit the width of the uncompressed image).
As long as the storage space keep growing...
It has been done like that for decades. Look at what archival tape does or DVDisaster or modern HDDs.
Also, this does not solve the problem, it just defers it. Why is this news?
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
Oh, yeah. It doesn't.
And who gets to pay for existing apps to be rewritten?
I remember reading a story of a guy who had to download a file from Apple that was over 4 gigabytes, and had to attempt it several times because each came back corrupted due to some problem with his internet. Eventually, he gave up and found the file on bit torrent, but realized if he saved it in the same location as the corrupted file, it would check the file and then overwrite it with the correct information. He was able to fix it in under an hour using bittorrent rather than trying to re-download the file while crossing his fingers and praying for no corruption.
I know it's not a perfect example, but just one way of looking at it.
Name...That...Autocomplete!
Quite frankly data is so duplicated today bit-rot is not really an issue if you know what tools to use, especially if you use tools like quickpar on important data that can handle bad blocks.
Much data is easily duplicated, the data you want to save if it is important should be backed up with care.
Even though much of the data I download is easily downloaded again, the stuff I want to keep I quickpar the archives and burn to disc, and really important data that is irreplacable I make multiple copies.
http://www.quickpar.co.uk/
Parchive: Parity Archive Volume Set
It basically allows you to create an archive that's selectively larger, but contains an amount of parity such that you can have XX% corruption and still 'unzip.'
"The original idea behind this project was to provide a tool to apply the data-recovery capability concepts of RAID-like systems to the posting and recovery of multi-part archives on Usenet. We accomplished that goal." [http://parchive.sourceforge.net/]
KPH
Just don't compress anything, if a bit corrupts in a non compressed bitmap file or in a plain .txt file, no more than 1 pixel or letter is lost.
I would like this. Some options I could work with: Extensions to current CD/DVD/Bluray ISO formats, new version of "ZIP" files and a new version of True Crypt files.
If done in an open standards way I could be somewhat confident of support in many years time when I may need to read the archives. Obviously backwards compatibility with earlier iso/file formats would be a plus.
Ten years ago my old company used to advocate that for individuals who wanted to convert paper to digital, they first put them on microfilm and then scan them. That way when their digital media got damaged or lost they could always recreate it. Film last for a long long time when stored correctly. Unfortunately that still seems the be the best advice, at least if you are starting from an analog original.
Quantum computers will save us. They could examine every combination possible to rebuild a file in seconds.
CD/DVD/etc have error correction already.
Excuse me, but please get off my Pennisetum Clandestinum, eh!
I'd recommend just NOT using X-MODEM or Z-MODEM! Bit errors everywhere. Especially when mom picks up the telephone! ggggrrrrrrrrrrrrrrrr
640YB ought to be enough for anybody.
Just use ZFS, its already been done.
kthxbai.
You can tell how powerful someone is by the magnitude of the crime they can commit and be able to get away with.
As we're on the cusp of moving much of our data to the cloud, we've got the perfect opportunity to improve the resilience of information storage for a lot of people at the same time.
"We receive as friendly that which agrees with, we resist with dislike that which opposes us" - Faraday
I believe that Forward-error correction is an even better model. Already used for error-free transmission of data over error-prone links in radio, and USENET using the PAR format, what better way to preserve data than with FEC?
Save your really precious files as Parchive files (PAR and PAR2). You can spread them over several discs or just one disc with several of the files on it.
It's one thing to detect errors, but it's a wholly different universe when you can also correct them.
http://en.wikipedia.org/wiki/Parchive
Kriston
Problem is not in error correction, but actually in linearity of data. Using only 256 pixels you could represent an image brain can interpret. Problem is, brain can not interpret an image form first 256 pixels, as that would probably be a line half long as the image width, consisting of mostly irrelevant data.
If I would want to make a fail proof image, I would split it to squares of, say, 9(3x3) pixels, and than put only central pixel(every 5th px) values in byte stream. Once that is done repeat that for surrounding pixels in the block. In that way, even if part of data is lost, program would have at least one of the pixels in a 9x9 block and it could use one of nearby pixels as a substitute, leaving up to person to try and figure out the data. You could repeat subdivision once again, achieving pseudo random order of bytes.
And this is just a mock up of what could be done to improve data safety in images without increasing the actual file size.
In old days of internet, designers were using images in lower resolution, to lower page loading time, and than gradually exchanging images with higher res versions once those loaded. If it had sense to do it then, maybe we could now use integrated preview images to represent the average sector of pixels in the image, and than reverse calculate missing ones using pixels we have.
This could also work for audio files, and maybe even archives. I know I could still read the book even if every fifth letter was replaced by a incorrect one.
Cheers,
DLabz
The in-file checksum thing is a good idea, but it may be redundant to disk- or filesystem-level checksums.
Another useful thing is to store information in "chunks" so that if a bit goes bad no more than one "chunk" is lost. A chunk could be a pixel or group of pixels in certain graphics formats, a page, in certain "page" formats such as PDF or multi-page TIFF, a cell in a spreadsheet, a maximum-length run of characters in a word-processing document, etc.
Storing files in "ascii-like" formats where it makes sense to do so is also a good idea from a data-recovery perspective.
For files that represent "events in time" such as music, video, or some scientific data collections, a "chunk" might be a second or some other period of time.
Many of today's data formats already operate at a "chunk" level. Many do not.
On another note, these days, "space is cheap" on disk, but not necessarily so when it comes to networking or the time it takes to make backups. 1TB is under $100 on a home machine, several times that on a server, a relative pittance over the life of the drive. However, copying 1 TB takes a non-trivial amount of time.
Knowledge is how to play a game, intelligence is how to win, wisdom is knowing what game to play.
For any media you actually expect to retain for years, possibly without touching, create par2 files - say 10% to correct any errors later.
#!/bin/sh
for filename in "$@"; do
# Create a 10% recovery data with blocksize of 300KB
nice par2 create -s307200 -r10 "$filename"
done
Use ZFS if you are in OpenSolaris/FreeBSD land, or use a Reed-Solomon tool - a GPL implementation of mine was Slashdotted 2 years ago, here: http://users.softlab.ntua.gr/~ttsiod/rsbep.html
In the context of Hard Drives. Hard Drives are NOT digital.
They store a value via "majority". If that bit is overwritten, it can easily be recovered with the right software, several times and you are maybe needing some specific hardware.
Why don't OSes (and manufacturers) take advantage of this? There are effectively 2 layers per disc you could use to store data without degradation.
As long as the disc is kept in good condition, you can use this extra layer.
Instead, what we have are companies squeezing sectors closer together and making this method unreliable the higher the density.
Stop treating a magnetic disc as an optical disc, you can store much more on it.
This could drop the cost of drives significantly and still retain the same size as we currently have. (1gig, 2 gigs at a stretch but i still wouldn't risk it)
There are several levels of error correction at the disk level, plus at the RAID level, plus possibly at the file system level. And the whole thing has been wrapped up so well that users don't have to worry about. If users are still getting bit errors, someone hasn't been paying attention to their SMART, RAID, and file system logs.
No amount of error correction will protect you from that; sooner or later, disks go bad, and you have to replace them before there are too many errors for the system to recover.
It depends on exactly what is encoded with this number. If it is a pixel in an image flipping one bit won't destroy the image. The same is true for video and plain text. Flipping a bit in a text file would change exactly on letter.
The problem lies in the encoding and the type of information saved.
If, for example, a binary format is used, there has to be a way to identifiy the borders of the different data formats. For information of fixed length, this can be done with counting. If the information has no fixed length, this can be done with byte stuffing.
If we want to save a linked list in a binary format using byte stuffing, we would take one byte and define it to be the new list character. Then we code the list in a way that our list character newer turns up in the expression used to save a list. Normally this is done by defining an escape character which tells us that the next character is not the special character used for byte stuffing.
For example, we want to save a list of sentences using ASCII and define the pipe symbol "|" as special character. The resulting file would look like this:
Hello world|This is a simple example|for the almighty interweb
Now, the only way the file would turn out to be unusable after flipping a bit is, if a bit of the byte used for the pipe symbol was flipped. Flipping any other bit would change the meaning, but the program would be able to load it anyway.
So it depends on what and how it is saved. For most media files flipping a bit would not render the file useless. For an account it probably would.
If the data is important, add redundancy, that is additional bits for error correction. If the data is not important, don't add redundancy because it increases the file size.
That is exactly the reason why compressed files are useless after a flipped bit. A compression algorithm removes redundant bits to decrease file size.
zfec is much, much faster than par2: http://allmydata.org/trac/zfec
Tahoe-LAFS uses zfec, encryption, integrity checking based on SHA-256, digital signatures based on RSA, and peer-to-peer networking to take a bunch of hard disks and make them into a single virtual hard disk which is extremely robust: http://allmydata.org/trac/tahoe
Injecting ECC into the stored streams may help a bit, but please stop thinking about methods that preserve the content of single streams.
Just duplicate the streams to multiple physical media, it's cheap, easy and can remedy many more situations.
Many of the jpgs that I took with my first digital camera were damaged or destroyed to bit corruption. I doubt I'm the only person who fell to that problem; those pictures were taken back in the day when jpg was the only option available on cameras and many of us didn't know well enough to save it under a different file format afterwards. Now I have a collection of images where half the image is missing or disrupted - and many others that just simply don't open at all anymore.
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
I've taken to using DVDisaster to actually pad ECC data into the ISO filesystem, so that there's a good chance of recovering the data, even if a file becomes unreadable. Just another layer / method of protection.
"Ending is better than mending". :)
Consumers should welcome file corruption; it's a chance to throw away those old files and buy some brand new ones instead
Actually, I would not be surprised if the media companies were busily trying to invent a self-corrupting DRM format to replace DVDs and suchlike.
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
Just resign yourself to the fact that the Code of Hammurabi will outlive your pr0n.
What's with the useless acronym at the end of the summary? I hate useless and made up acronyms (umua)
Come on, ISO, where are you? We all need the best (or alternatively, least-worst) glidepath now. When I retired, the argument was all about proprietary formats for formatted text, and this and that. USians seemed to want to take the lead on everything and thereby 'offer' formal Secretariat (and steering). Now there's something worth doing - fixit, folks - and non-proprietary, pretty-please.
What's wrong with physical storage for say 200 years? Most data we save these days will be fine stored this way. The issue is what we choose to store just because we can which caused problems. The net is overflowing with the minutia of boring existances. What we really need is a data scythe to cut the rubbish out and then store it. The Victorian era handed down lots of books, magazines and pamphlets. Some of these are preserved and read by historians. In 150 years will we really care about the financial statements for Goldman-Sachs or want the blog of Paris Hilton? (No lewd comments, it was rhetoical).
> The solution proposed by the author: two headers and error correction code (ECC) in every file."
When there are two possibilities, which one do you chose? Three allows the software to have a vote among the headers, and ignore or correct the loser (assuming that there IS one, of course).
Also, keeping the headers in text, rather than using complicated encoding schemes to save space where it doesn't much matter, is probably a good idea, as well. Semantic sugar is your friend here.
I second that :
RAID (Specially RAID-6) seems indeed inevitable for long-term storage, combined with a modern FS implementing checksums.
(And an intelligent stack which is able to leverage the FS's checksums while scanning/repairing the array. ZFS is such an exemple. Probably Linux will follow although this will require both support from the Raid driver *and* the file system or their respective checking tools (mdadm using info from fsck to repair the corrupter blocks)
This setup can more easily resist to silent bit rot. (If for some random rare reason, data is altered but still readable. A simple mirror or a RAID-5 will be able to determine that data was corrupted, but will have a hard time to determine which copy is the "good one" if all sectors are still readable after corruption).
And the best part is that once such a stack is deployed, data survivability comes for free : no additional efforts required, no new libraries to support newer "redundant" formats and/or no need for the user to re-mux the media into a container with its ECC option enabled.
Add to that a file system which supports snapshots and you get also a good protection against accidental file deletion.
The best part : All the necessary piece are already available for free (both gratis and freedom) now.
Thus it's not complicated for a current user to create the perfect data vault, with only minimal cost (upfront of building the vault and then yearly maintenance to replace parts and media with newer as the hardware ages).
The key problem is not the data integrity (necessary tools are here), nor the hardware (just replace parts as they age).
The key problem with long term archives is the format :
In some distant hypothetical future, how are you going to open Microcrosoft Word 97 file correctly, when we'll be using a new VisualBasic-Message-Passing version of Office Singularity wunning on Microsoft Singularity OS ?
Most media currently rely on pretty well documented formats (JPEG, MPEG, MP3, etc.). Nonetheless there are formats, both on the consumer end (MS-Word) and on the professional end (RAW pictures format) which are poorly documented or kept secret. And as such nothing guarantees that 20years down the line these will still be opennable.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
But then who polices the police bits?
I'd agree with you about the amount and detail of data out there right now. I'm not sure who's going to be archiving this stuff so that there is a danger of storing 100 terrabytes of 'what i am eating right now' blogs. However, the two examples you mentioned I think should be preserved as monuments to failitude of society. After all, you know that old saying about repeating history. My boss didn't read about world war 2 and now, every other week we have a Jewish Holocaust at work.
Not enough error correction to accommodate one good scratch. Detecting an error and recovering from it can be very different things.
One time I was downloading a movie that didn't quite finish. .avi container] in VLC, and most of the movie played fine;, obviously it stuttered over the missing sections .zip or .rar at least), so we wouldn't face compression-algorithm issues.
I went to watch the file [XviD in an
So, perhaps there's something to say for this format for video-archiving.
(This format doesn't compress much further (with
I listen to both RIAA and non-RIAA stuff if I like the music, tangential business/politics nonwithstanding.
Create massive redundant copies of each work (with MD5 checksums), and keep copying them to new media on a staggered basis. Whenever one copy fails a checksum check, replace it with a good copy. Memory is cheap; why keep just one or a few copies of anything that important? To the RIAA, let me say this: we are just trying to insure that none of your valuable intellectual property becomes lost due to data corruption. You release it, we'll archive it for you!
I've abandoned my search for truth; now I'm just looking for some useful delusions.
Let me qualify. Digital is pretty much like analog when you're dealing with data that's designed to be consumed non-digitally anyway, say like the smile of Mona Lisa or a pop song. A single off byte shouldn't matter to human eyes and ears. In contrast, an encrypted file is meant to be consumed digitally by the decrypting software before a human reads, watches, or listens to it: every bit counts.
So the problem is clearly with the type of digital storage being used. For most video or audio files, I find that preserving the first and last few megabytes should be enough for the file to be partly accessed. Any errors in between would, or should, result in a digital blip that is no different from a smudge on a piece of paper where only the smudged part is rendered illegible. (I add "should" because in some real cases the player program crashes.)
Of course, severely degraded media would be a problem. But how is this different from getting your precious million-dollar painting damaged in a fire or flood?
Aaah. That must be what happened with the East Anglia files, when the bit went bad in a researcher's brain, and caused him to substitute New Hampshire local data for world average data, and prove his point of the "shepherd's crook."
That must also be what happened when a bit went bad in Geithner's brain causing him to break the law (conspiracy to commit fraud) in forcing Bank of America to eat the losses in that takeover... and break the constitution in bringing forward a bailout... and destroy America's job situation to preserve the Wall Street fornication industry.
And thus our politician's damage is increased by anywhere from 1 to 2 billion or EVEN A LOT MORE.
A software program eventually crashes or stops because of a wrong bit. But a book is still readable if some letters are unclear or a word is missing. - We shouldn't store information that isn't "all or nothing" in an all-or-nothing format.
I remember a storage example from doing neural networks:
An algorithm writes numbers to a matrix. A vector holds "1.2;4.8;0.9". If you add "5", the same memory now holds "1;5;1;5" It of course could be used the other way around as well: Loss of one memory cell degrades quality of the whole, but the whole information is still accessible.
By now, computer would be fast enough to implement a kind of "lossy digital" holographic archival file format. For scans or picture archives this would be great. And it doesn't prevent you from adding additional checksums or correction blocks.
Everyone who is even considering spouting the letters "ZFS" in response to this article should really just STFU. Seriously.
Allow me to explain. Yes, ZFS is a very nice and very robust filesystem with great data protection and recovery features (although still subject to failure and data loss under some conditions, don't even try to deny it, it isn't perfect).
But all the ZFS zealots need to stop and think about all the other filesystems currently in use, and realize that ZFS will NEVER replace most of those filesystems in most situations. There needs to be a solution to bit rot that does not entail switching the entire world to a new filesystem. NTFS, FAT12/FAT16/FAT32, HFS+, Ext3/4, ReiserFS, UDF, all of these and more will continue to be in use in millions of computers and on billions of devices using removable or embedded media for many decades, and more filesystems will be invented in the future. You will never see a digital camera with built-in ZFS support, for instance. ZFS is totally unfeasible for that kind of application. It takes far too much processing power and memory to run ZFS for it to ever become anything resembling a universal filesystem. Filesystems like ZFS are not a panacea, there needs to be a solution (like PAR2) that is portable between ALL different filesystems that are now or ever will be in use.
Basically, things like the PAR2 parity archiving format already solve this type of problem, but in a way that is too limited. It needs to be better integrated into the filesystem or operating system level so that it works automatically on all kinds of different filesystems. Right now, the parity information is something that you have to manually create with a separate software tool like Parchive when you are interested in "archiving" something. This kind of functionality needs to be somehow tacked on to the file storage process so that the parity data is created, updated and continuously checked by whatever is reading and writing to the file, no matter where that file is stored. It needs to be part of the file itself, so that when a file is copied or moved, the parity data is not lost.
As usual, to any particular problem there is an answer that is straightforward, simple and WRONG (I forget what smart person said that first). For this problem, ZFS is not the ultimate answer. It's great for specific situations like file servers, but that's about it. As soon as you remove a file from that file server, poof, you lose access to that parity information. That's just dumb. For important data that needs to be self-repairing, the only real solution is to include the parity information alongside the data, in a portable format.
Personally I've been quite surprised over the years that almost no modern filesystem in use anywhere has the kind of parity information built-in that ZFS has. So much data could be easily recovered if filesystems were robust enough to handle simple things like bit errors or unreadable sectors. Why should my 2GB file be ruined just because a single 512-bit sector became unreadable in a critical location in the file? It's idiotic to need to have multiple complete duplicate copies of every single type of data we ever store in order to be sure we can recover from simple forms of data degredation like bit rot.
Each block uses ECC
On a modern disk, every block already has ECC. Furthermore, there are APIs to query disks about disk failures. There is no reason to reimplement block-level ECC at the file system level. If people aren't checking for hardware failures, what's the point of giving them another set of ECC errors to check?
This truly is the filesystem every other one is playing catchup with.
God, I hope not. I wouldn't want to use anything as poorly designed as ZFS.
I can say that HDDs make a lot of use of ECC. They misread bits all the time, but only seldom do these misreads require a re-read, much less cause actual corruption.
I assume that if an OEM requested higher ECC (at the loss of data capacity) they could do so.
For a good laugh, check out this statue by Michelangelo: Statue of Moses in San Pietro in Vincoli
Result of a slight mistranslation that Moses had horns. Makes him easier to distinguish from the other prophets, I guess...
To be, or not to be: isn't that quite logical, Slashdot Beta?