Slashdot Mirror


One Way To Save Digital Archives From File Corruption

storagedude points out this article about one of the perils of digital storage, the author of which "says massive digital archives are threatened by simple bit errors that can render whole files useless. The article notes that analog pictures and film can degrade and still be usable; why can't the same be true of digital files? The solution proposed by the author: two headers and error correction code (ECC) in every file."

36 of 257 comments (clear)

  1. To much reinvention by DarkOx · · Score: 5, Interesting

    If this type of thing is implemented at the file level every application is going to have to do its own thing. That means to many implementations most of which wont be very good or well tested. It also means applications developers will have to be busy slogging though error correction data in their files rather than the data they actually wanted to persist for their application. I think the article offers a number of good ideas but it would be better to do most of them at the filesystem and perhaps some at the storage layer.
        Also if we can present the same logical file when read to the application even if every 9th byte is parity on the disk that is a plus because it means legacy apps can get the enhanced protection as well.

    --
    Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    1. Re:To much reinvention by paradxum · · Score: 5, Insightful

      It already exists, it's called ZFS on solaris boxxen. Each block uses ECC, it can correct itself on each read, and generally can indicate a failing disk. This truly is the filesystem every other one is playing catchup with.

    2. Re:To much reinvention by MrNaz · · Score: 5, Insightful

      Ahem. RAID anyone? ZFS? Btrfs? Hello?

      Isn't this what filesystem devs have been concentrating on for about 5 years now?

      --
      I hate printers.
    3. Re:To much reinvention by Whalou · · Score: 3, Funny

      ReiserFS is good for that also. If you make a deal with the 'file system' it will tell you where your 'file' is hidden.

      --
      English is not this .sig mother tongue...
    4. Re:To much reinvention by Interoperable · · Score: 2, Insightful

      I agree that filesystem level error correction is good idea. Having the option to specify ECC options for a given file or folder would be great functionality to have. The idea presented in this article, however, is that certain compressed formats don't need ECC for the entire file. Instead, as long as the headers are intact, a few bits here or there will result in only some distortion; not a big deal if it's just vacation photos/movies.

      By only having ECC in the headers, you would save a good deal of storage space and processing time. It wouldn't need to be supported in every application either, just the codecs. Individual codecs could include it fairly easily as they release new versions, which wouldn't be backward compatible anyway so you don't introduce a new problem. I think it's a good idea, it would keep media readable with very little overhead, just a few odd pixels during playback even in a corrupted file.

      --
      So if this is the future...where's my jet pack?
    5. Re:To much reinvention by bertok · · Score: 3, Insightful

      If this type of thing is implemented at the file level every application is going to have to do its own thing. That means to many implementations most of which wont be very good or well tested. It also means applications developers will have to be busy slogging though error correction data in their files rather than the data they actually wanted to persist for their application. I think the article offers a number of good ideas but it would be better to do most of them at the filesystem and perhaps some at the storage layer.

          Also if we can present the same logical file when read to the application even if every 9th byte is parity on the disk that is a plus because it means legacy apps can get the enhanced protection as well.

      Precisely. This is what things like torrents, RAR files with recovery blocks, and filesystems like ZFS are for: so every app developer doesn't have to roll their own, badly.

    6. Re:To much reinvention by An+dochasac · · Score: 3, Informative

      "says massive digital archives are threatened by simple bit errors that can render whole files useless.
      Isn't this what filesystem devs have been concentrating on for about 5 years now?

      Not just 5 years. ZFS's CRC on every datablock and Raid Z (no raid hold) are innovative and obviously the next step in filesystem evolution. But attempts at redundancy aren't new. I'm surprised the article is discussing relatively low teck old hat ideas such as two filesystem headers. Even DOS's FAT used this raid0 type of brute force redundancy by having two FAT tables. The Commodore Amiga's Intuition filesystem did this better than Microsoft back in 1985 by having forward and backward links in every block which made it possible to repair block pointer damage by searching for a reference to the bad block in the preceding and following block.
      And I suppose if ZFS doesn't catch on, 25 or 30 years from now Apple or Microsoft will finally come up with it and say, "Hey look what we invented!"

    7. Re:To much reinvention by Hatta · · Score: 5, Interesting

      Don't forget PAR2. I never burn a DVD without 10%-20% redundancy as par2 files. Even if the filesystem gets too damaged to read, I can usually dd the whole disk and let par2 recover the files.

      --
      Give me Classic Slashdot or give me death!
    8. Re:To much reinvention by Rockoon · · Score: 2, Interesting

      File Systems are in the software domain. If you arent getting good data (what was written) off the drive, the File System ideally shouldn't be able to do any better than the hardware did with the data. Of course, in reality the hardware uses a fixed redundancy model that offers less reliability than some people like. The danger of software-based solutions is that it allows hardware manufacturers to offer even less redundancy, or even NO redundancy at all, causing a need for even MORE software based redundancy.

      The ideal solution is to make sure the data is good at every step, rather than allow the device (or transmission medium) to consider that bad data is being good data. With ZFS or any other File System solution, the device wont know that the data is bad.. and thats bad.

      --
      "His name was James Damore."
  2. par files by ionix5891 · · Score: 5, Informative

    include par2 files

  3. It's that computer called the brain. by commodore64_love · · Score: 5, Interesting

    >>>"...analog pictures and film can degrade and still be usable; why can't the same be true of digital files?"

    The ear-eye-brain connection has ~500 million years of development, and has learned the ability to filter-out noise. If for example I'm listening to a radio, the hiss is mentally filtered-out, or if I'm watching a VHS tape that has wrinkles, my brain can focus on the undamaged areas. In contrast when a computer encounters noise or errors, it panics and says, "I give up," and the digital radio or digital television goes blank.

    What we need is a smarter computer that says, "I don't know what this is supposed to be, but here's my best guess," and displays noise. Let the brain then takeover and mentally remove the noise from the audio or image.

    --
    "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
    1. Re:It's that computer called the brain. by commodore64_love · · Score: 4, Interesting

      P.S.

      When I was looking for a digital-to-analog converter for my TV, I returned all the ones that displayed blank screens when the signal became weak. The one I eventually chose (x5) was the Channel Master unit. When the signal is weak it continues displaying a noisy image, rather than go blank, or it reverts to "audio only" mode, rather than go silent. It lets me continue watching programs rather than be completely cutoff.

      --
      "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
    2. Re:It's that computer called the brain. by ILongForDarkness · · Score: 2, Insightful

      And how well did that work for your last corrupted text file? Or a printer job that the printer didn't know how to handle? My guess you could pick out a few words and the rest was random garble. The mind is good at filtering out noise but it is an intrinsically hard problem to do a similar thing with a computer. Say a random bit is missed and the whole file ends up shifted one to the left, how does the computer know that the combinations of pixel values it is displaying should start one bit out of sync so that the still existing data "looks" good? Similarly with a text file, all the remaining bits could be valid characters, how is a computer to know what characters to show other than having the correct data?

    3. Re:It's that computer called the brain. by Phreakiture · · Score: 4, Interesting

      What we need is a smarter computer that says, "I don't know what this is supposed to be, but here's my best guess," and displays noise. Let the brain then takeover and mentally remove the noise from the audio or image.

      Audio CDs have always done this. Audio CDs are also uncompressed*.

      The problem, I suspect, is that we have come to rely on a lot of data compression, particularly where video is concerned. I'm not saying this is the wrong choice, necessarily, because video can become ungodly huge without it (NTSC SD video -- 720 x 480 x 29.97 -- in the 4:2:2 colour space, 8 bits per pixel per plane, will consume 69.5 GiB an hour without compression), but maybe we didn't give enough thought to stream corruption.

      Mini DV video tape, when run in SD, uses no compression on the audio, and the video is only lightly compressed, using a DCT-based codec, with no delta coding. In practical terms, what this means is that one corrupted frame of video doesn't cascade into future frames. If my camcorder gets a wrinkle in the tape, it will affect the frames recorded on the wrinkle, and no others. It also makes a best-guess effort to reconstruct the frame. This task may not be impossible with more dense codecs that do use delta coding and motion compensation (MPEG, DiVX, etc), but it is certainly made far more difficult.

      Incidentally, even digital cinemas are using compression. It is a no-delta compression, but the individual frames are compressed in a manner akin to JPEGs, and the audio is compressed either using DTS or AC3 or one of their variants in most cinemas. The difference, of course, is that the cinemas must provide a good presentation. If they fail to do so, people will stop coming. If the presentation isn't better than watching TV/DVD/BluRay at home, then why pay the $11?

      (* I refer here to data compression, not dynamic range compression. Dynamic range compression is applied way too much in most audio media)

      --
      www.wavefront-av.com
  4. Sun Microsystems..... zfs..... by HKcastaway · · Score: 3, Insightful

    ZFS.

    Next topic....

  5. What files does a single bit error destroy? by jmitchel!jmitchel.co · · Score: 2, Insightful

    What files does a single bit error irretrievably destroy? Obviously it may cause problems, even very annoying problems when you go to use the file. But unless that one bit is in a really bad spot that information is pretty recoverable.

    1. Re:What files does a single bit error destroy? by Jane+Q.+Public · · Score: 3, Informative

      That's complete nonsense. Just for one example, if the bit is part of a numeric value, depending on where that bit is, it could make the number off anywhere from 1 to 2 BILLION or even a lot more, depending on the kind of representation being used.

    2. Re:What files does a single bit error destroy? by Rockoon · · Score: 5, Insightful

      Most modern compression formats will not tolerate any errors. With LZ a single bit error could propagate over a long expanse of the uncompressed output, while with Arithmetic encoding the remainder of the file following the single bit error will be completely unrecoverable.

      Pretty much only the prefix-code style compression schemes (Huffman for one) will isolate errors to short sgements, and then only if the compressor is not of the adaptive variety.

      --
      "His name was James Damore."
    3. Re:What files does a single bit error destroy? by gzipped_tar · · Score: 2, Funny

      Perhaps that is what the poster meant by "bad spot". If "Hitler" were altered to read as "Hatler", I'm pretty sure the meaning would still be clear from the context.

      Godvin.

      --
      Colorless green Cthulhu waits dreaming furiously.
  6. Easy... by realsilly · · Score: 2, Funny

    Don't save anything.

    --
    Life takes interesting turns, but the most interest is when you're off the beaten path.
  7. What about the "block errors"? by MathFox · · Score: 4, Informative
    Most of the storage media in common use (disks, tapes, CD/DVD-R) already do use ECC at sector of block level and will fix "single bit" errors at firmware level transparently. What is more of an issue at application level are "missing block" errors; when the low-level ECC fails and the storage device signals "unreadable sector" and one or more blocks of data are lost.

    Off course this can be fixed by "block redundancy" (like RAID does), "block recovery checksums" or old-fashioned backups.

    --
    extern warranty;
    main()
    {
    (void)warranty;
    }
    1. Re:What about the "block errors"? by tepples · · Score: 3, Informative

      anyone know of the equivalent RAID model for things like tape?

      Four tapes data, one tape PAR2.

  8. About time by trydk · · Score: 3, Interesting

    It is about time that somebody (hopefully some of the commercial vendors AND the open source community too) get wise to the problems of digital storage.

    I always create files with unique headers and consistent version numbering to allow for minor as well as major file format changes. For storage/exchange purposes, I make the format expandable where each subfield/record has an individual header with a field type and a length indicator. Each field is terminated with a unique marker (two NULL bytes) to make the format resilient to errors in the headers with possible resynchronisationthrough the markers. The format is in most situations backward compatible to a certain extent as an old program can always ignore fields/subfields it does not understand in a newer format file. If that is not an option, the major version number is incremented. This means that a version 2.11 program can read a version 2.34 file with only minor problems. It will not be able to write to that format, though. The same version 2.11 program would not be able to correctly read a version 3.01 file either.

    I have not implemented ECC in the formats yet, but maybe the next time I do an overhaul ... I will have to ponder that. Maybe not, my programs seem to ephemeral for that ... Then again, so did people think about their 1960es COBOL programs.

  9. Lossy by FlyingBishop · · Score: 2, Insightful

    The participant asked why digital file formats (jpg, mpeg-3, mpeg-4, jpeg2000, and so on) can't allow the same degradation and remain viewable.

    Because all of those are compressed, and take up a tiny fraction of the space that a faithful digital recording of the information on a film reel would take up. If you want lossless-level data integrity, use lossless formats for your masters.

  10. Do not compress! by irp · · Score: 2, Interesting

    ... Efficiency is the enemy of redundancy!

    Old documents, saved in 'almost like ascii' is still 'readable'. I once salvaged a document from some obscure ancient word processor by opening it in a text editor. I also found some "images" (more like icons) on the same disk (a copy of a floppy), even these I could "read" (by changing the page width of my text editor to fit the width of the uncompressed image).

    As long as the storage space keep growing...

  11. Very, very old news.... by gweihir · · Score: 2, Informative

    It has been done like that for decades. Look at what archival tape does or DVDisaster or modern HDDs.

    Also, this does not solve the problem, it just defers it. Why is this news?

    --
    Most ACs are not even worth the keystrokes to insult them. Be generically insulted by this and ignored otherwise.
  12. Also, Bittorrent by NoPantsJim · · Score: 4, Informative

    I remember reading a story of a guy who had to download a file from Apple that was over 4 gigabytes, and had to attempt it several times because each came back corrupted due to some problem with his internet. Eventually, he gave up and found the file on bit torrent, but realized if he saved it in the same location as the corrupted file, it would check the file and then overwrite it with the correct information. He was able to fix it in under an hour using bittorrent rather than trying to re-download the file while crossing his fingers and praying for no corruption.

    I know it's not a perfect example, but just one way of looking at it.

  13. Parchive: Parity Archive Volume Set by khundeck · · Score: 5, Interesting

    Parchive: Parity Archive Volume Set

    It basically allows you to create an archive that's selectively larger, but contains an amount of parity such that you can have XX% corruption and still 'unzip.'

    "The original idea behind this project was to provide a tool to apply the data-recovery capability concepts of RAID-like systems to the posting and recovery of multi-part archives on Usenet. We accomplished that goal." [http://parchive.sourceforge.net/]

    KPH

  14. Solution: by Lord+Lode · · Score: 2, Insightful

    Just don't compress anything, if a bit corrupts in a non compressed bitmap file or in a plain .txt file, no more than 1 pixel or letter is lost.

    1. Re:Solution: by igny · · Score: 2, Funny

      II ccaann ssuuggeesstt eevveenn bbeetteerr iiddeeaa.
      II ccaann ssuuggeesstt eevveenn bbeetteerr iiddeeaa.

      --
      In theory there is no difference between theory and practice. In practice there is. - Yogi Berra
  15. Film and digital by CXI · · Score: 2, Interesting

    Ten years ago my old company used to advocate that for individuals who wanted to convert paper to digital, they first put them on microfilm and then scan them. That way when their digital media got damaged or lost they could always recreate it. Film last for a long long time when stored correctly. Unfortunately that still seems the be the best advice, at least if you are starting from an analog original.

  16. Cloud computing provides an opportunity by davide+marney · · Score: 3, Funny

    As we're on the cusp of moving much of our data to the cloud, we've got the perfect opportunity to improve the resilience of information storage for a lot of people at the same time.

    --
    "We receive as friendly that which agrees with, we resist with dislike that which opposes us" - Faraday
  17. Linearity is the real problem by designlabz · · Score: 2, Insightful

    Problem is not in error correction, but actually in linearity of data. Using only 256 pixels you could represent an image brain can interpret. Problem is, brain can not interpret an image form first 256 pixels, as that would probably be a line half long as the image width, consisting of mostly irrelevant data.
    If I would want to make a fail proof image, I would split it to squares of, say, 9(3x3) pixels, and than put only central pixel(every 5th px) values in byte stream. Once that is done repeat that for surrounding pixels in the block. In that way, even if part of data is lost, program would have at least one of the pixels in a 9x9 block and it could use one of nearby pixels as a substitute, leaving up to person to try and figure out the data. You could repeat subdivision once again, achieving pseudo random order of bytes.
    And this is just a mock up of what could be done to improve data safety in images without increasing the actual file size.
    In old days of internet, designers were using images in lower resolution, to lower page loading time, and than gradually exchanging images with higher res versions once those loaded. If it had sense to do it then, maybe we could now use integrated preview images to represent the average sector of pixels in the image, and than reverse calculate missing ones using pixels we have.
    This could also work for audio files, and maybe even archives. I know I could still read the book even if every fifth letter was replaced by a incorrect one.

    Cheers,
    DLabz

  18. Re:Incorrect... by ledow · · Score: 2, Insightful

    Because no-one yet has ever managed to pull things from this theoretical "historical" layer without at least something like a electron microscope costing tens or hundreds of thousands, thousands of hours of skilled *manual* work and having to crack the damn harddrive open and destroy it (if at all)? I believe there is a still a challenge going around with a hard drive that was "zeroed" quite simply and if anyone can recover the password in the single file that was on it before it was zeroed, then can get a few thousand dollars - nobody has even done more than look at it yet. (It certainly can't be done by software alone - are you thinking of unzeroed filesystem residue that has nothing to do with hardware at all?)

    In theory you might think you were right, but digital is nothing to do with historical layering (which is doubtful whether it exists in a practical sense that can be utilised)... it's the method of recording - 1 or 0 or more possible patterns? Hard drives might store by majority by they do it for a reason - because a single bit it *useless* on such a fine recording medium because it *can* change over time or just by slight inaccuracies in the recording/reading methods, so you have to swipe a whole bunch of the disk to be assured of reading back a 1 or 0 with your reader (which could never read more than the consensus of 1 or 0 because it's just not that accurate - it has to have a large bunch of magnetised particles to make any reading at all, it doesn't read each individually and then think "Oh, that's enough to be a 1" - when it reads it back, only a certain amount "trigger" it to think the thing is a 0 or 1 - thus it *IS* digital because the only answer it can give is 0 or 1 and not "well, almost a 1").

    And if manufacturers thought for a second any of that was do-able in even enterprise drives, it would be done already and sold to the highest bidder. The fact is that it just isn't feasible or even possible - it's almost impossible to do that in a device small enough to fit in your car, or reliably, or without totally destroying the operation or performance of a drive, or for less than the price of a large rack full of storage.

  19. Re:Use TCP/IP by nomadic · · Score: 2, Funny

    You might be able to find some suggestions on how to fix that on Gopher.

  20. Re:Ecc? by TheThiefMaster · · Score: 2, Informative

    Asking for a definition of ecc turns it up, so it's obviously not that uncommon. And as we're talking about data corruption, it's the obvious one.

    Most IT techs would recognise the term from "ECC Ram", which is ram that is capable of correcting bit errors and is often required by server motherboards.