One Way To Save Digital Archives From File Corruption
storagedude points out this article about one of the perils of digital storage, the author of which "says massive digital archives are threatened by simple bit errors that can render whole files useless. The article notes that analog pictures and film can degrade and still be usable; why can't the same be true of digital files? The solution proposed by the author: two headers and error correction code (ECC) in every file."
If this type of thing is implemented at the file level every application is going to have to do its own thing. That means to many implementations most of which wont be very good or well tested. It also means applications developers will have to be busy slogging though error correction data in their files rather than the data they actually wanted to persist for their application. I think the article offers a number of good ideas but it would be better to do most of them at the filesystem and perhaps some at the storage layer.
Also if we can present the same logical file when read to the application even if every 9th byte is parity on the disk that is a plus because it means legacy apps can get the enhanced protection as well.
Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
include par2 files
>>>"...analog pictures and film can degrade and still be usable; why can't the same be true of digital files?"
The ear-eye-brain connection has ~500 million years of development, and has learned the ability to filter-out noise. If for example I'm listening to a radio, the hiss is mentally filtered-out, or if I'm watching a VHS tape that has wrinkles, my brain can focus on the undamaged areas. In contrast when a computer encounters noise or errors, it panics and says, "I give up," and the digital radio or digital television goes blank.
What we need is a smarter computer that says, "I don't know what this is supposed to be, but here's my best guess," and displays noise. Let the brain then takeover and mentally remove the noise from the audio or image.
"I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
ZFS.
Next topic....
What files does a single bit error irretrievably destroy? Obviously it may cause problems, even very annoying problems when you go to use the file. But unless that one bit is in a really bad spot that information is pretty recoverable.
Don't save anything.
Life takes interesting turns, but the most interest is when you're off the beaten path.
Off course this can be fixed by "block redundancy" (like RAID does), "block recovery checksums" or old-fashioned backups.
extern warranty;
main()
{
(void)warranty;
}
It is about time that somebody (hopefully some of the commercial vendors AND the open source community too) get wise to the problems of digital storage.
... I will have to ponder that. Maybe not, my programs seem to ephemeral for that ... Then again, so did people think about their 1960es COBOL programs.
I always create files with unique headers and consistent version numbering to allow for minor as well as major file format changes. For storage/exchange purposes, I make the format expandable where each subfield/record has an individual header with a field type and a length indicator. Each field is terminated with a unique marker (two NULL bytes) to make the format resilient to errors in the headers with possible resynchronisationthrough the markers. The format is in most situations backward compatible to a certain extent as an old program can always ignore fields/subfields it does not understand in a newer format file. If that is not an option, the major version number is incremented. This means that a version 2.11 program can read a version 2.34 file with only minor problems. It will not be able to write to that format, though. The same version 2.11 program would not be able to correctly read a version 3.01 file either.
I have not implemented ECC in the formats yet, but maybe the next time I do an overhaul
Because all of those are compressed, and take up a tiny fraction of the space that a faithful digital recording of the information on a film reel would take up. If you want lossless-level data integrity, use lossless formats for your masters.
... Efficiency is the enemy of redundancy!
Old documents, saved in 'almost like ascii' is still 'readable'. I once salvaged a document from some obscure ancient word processor by opening it in a text editor. I also found some "images" (more like icons) on the same disk (a copy of a floppy), even these I could "read" (by changing the page width of my text editor to fit the width of the uncompressed image).
As long as the storage space keep growing...
It has been done like that for decades. Look at what archival tape does or DVDisaster or modern HDDs.
Also, this does not solve the problem, it just defers it. Why is this news?
Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
I remember reading a story of a guy who had to download a file from Apple that was over 4 gigabytes, and had to attempt it several times because each came back corrupted due to some problem with his internet. Eventually, he gave up and found the file on bit torrent, but realized if he saved it in the same location as the corrupted file, it would check the file and then overwrite it with the correct information. He was able to fix it in under an hour using bittorrent rather than trying to re-download the file while crossing his fingers and praying for no corruption.
I know it's not a perfect example, but just one way of looking at it.
Name...That...Autocomplete!
Parchive: Parity Archive Volume Set
It basically allows you to create an archive that's selectively larger, but contains an amount of parity such that you can have XX% corruption and still 'unzip.'
"The original idea behind this project was to provide a tool to apply the data-recovery capability concepts of RAID-like systems to the posting and recovery of multi-part archives on Usenet. We accomplished that goal." [http://parchive.sourceforge.net/]
KPH
Just don't compress anything, if a bit corrupts in a non compressed bitmap file or in a plain .txt file, no more than 1 pixel or letter is lost.
Ten years ago my old company used to advocate that for individuals who wanted to convert paper to digital, they first put them on microfilm and then scan them. That way when their digital media got damaged or lost they could always recreate it. Film last for a long long time when stored correctly. Unfortunately that still seems the be the best advice, at least if you are starting from an analog original.
As we're on the cusp of moving much of our data to the cloud, we've got the perfect opportunity to improve the resilience of information storage for a lot of people at the same time.
"We receive as friendly that which agrees with, we resist with dislike that which opposes us" - Faraday
Problem is not in error correction, but actually in linearity of data. Using only 256 pixels you could represent an image brain can interpret. Problem is, brain can not interpret an image form first 256 pixels, as that would probably be a line half long as the image width, consisting of mostly irrelevant data.
If I would want to make a fail proof image, I would split it to squares of, say, 9(3x3) pixels, and than put only central pixel(every 5th px) values in byte stream. Once that is done repeat that for surrounding pixels in the block. In that way, even if part of data is lost, program would have at least one of the pixels in a 9x9 block and it could use one of nearby pixels as a substitute, leaving up to person to try and figure out the data. You could repeat subdivision once again, achieving pseudo random order of bytes.
And this is just a mock up of what could be done to improve data safety in images without increasing the actual file size.
In old days of internet, designers were using images in lower resolution, to lower page loading time, and than gradually exchanging images with higher res versions once those loaded. If it had sense to do it then, maybe we could now use integrated preview images to represent the average sector of pixels in the image, and than reverse calculate missing ones using pixels we have.
This could also work for audio files, and maybe even archives. I know I could still read the book even if every fifth letter was replaced by a incorrect one.
Cheers,
DLabz
Because no-one yet has ever managed to pull things from this theoretical "historical" layer without at least something like a electron microscope costing tens or hundreds of thousands, thousands of hours of skilled *manual* work and having to crack the damn harddrive open and destroy it (if at all)? I believe there is a still a challenge going around with a hard drive that was "zeroed" quite simply and if anyone can recover the password in the single file that was on it before it was zeroed, then can get a few thousand dollars - nobody has even done more than look at it yet. (It certainly can't be done by software alone - are you thinking of unzeroed filesystem residue that has nothing to do with hardware at all?)
In theory you might think you were right, but digital is nothing to do with historical layering (which is doubtful whether it exists in a practical sense that can be utilised)... it's the method of recording - 1 or 0 or more possible patterns? Hard drives might store by majority by they do it for a reason - because a single bit it *useless* on such a fine recording medium because it *can* change over time or just by slight inaccuracies in the recording/reading methods, so you have to swipe a whole bunch of the disk to be assured of reading back a 1 or 0 with your reader (which could never read more than the consensus of 1 or 0 because it's just not that accurate - it has to have a large bunch of magnetised particles to make any reading at all, it doesn't read each individually and then think "Oh, that's enough to be a 1" - when it reads it back, only a certain amount "trigger" it to think the thing is a 0 or 1 - thus it *IS* digital because the only answer it can give is 0 or 1 and not "well, almost a 1").
And if manufacturers thought for a second any of that was do-able in even enterprise drives, it would be done already and sold to the highest bidder. The fact is that it just isn't feasible or even possible - it's almost impossible to do that in a device small enough to fit in your car, or reliably, or without totally destroying the operation or performance of a drive, or for less than the price of a large rack full of storage.
You might be able to find some suggestions on how to fix that on Gopher.
Asking for a definition of ecc turns it up, so it's obviously not that uncommon. And as we're talking about data corruption, it's the obvious one.
Most IT techs would recognise the term from "ECC Ram", which is ram that is capable of correcting bit errors and is often required by server motherboards.