Slashdot Mirror


One Way To Save Digital Archives From File Corruption

storagedude points out this article about one of the perils of digital storage, the author of which "says massive digital archives are threatened by simple bit errors that can render whole files useless. The article notes that analog pictures and film can degrade and still be usable; why can't the same be true of digital files? The solution proposed by the author: two headers and error correction code (ECC) in every file."

20 of 257 comments (clear)

  1. To much reinvention by DarkOx · · Score: 5, Interesting

    If this type of thing is implemented at the file level every application is going to have to do its own thing. That means to many implementations most of which wont be very good or well tested. It also means applications developers will have to be busy slogging though error correction data in their files rather than the data they actually wanted to persist for their application. I think the article offers a number of good ideas but it would be better to do most of them at the filesystem and perhaps some at the storage layer.
        Also if we can present the same logical file when read to the application even if every 9th byte is parity on the disk that is a plus because it means legacy apps can get the enhanced protection as well.

    --
    Repeal the 17th Amendment TODAY! Also Please Read http://www.gnu.org/philosophy/right-to-read.html
    1. Re:To much reinvention by paradxum · · Score: 5, Insightful

      It already exists, it's called ZFS on solaris boxxen. Each block uses ECC, it can correct itself on each read, and generally can indicate a failing disk. This truly is the filesystem every other one is playing catchup with.

    2. Re:To much reinvention by MrNaz · · Score: 5, Insightful

      Ahem. RAID anyone? ZFS? Btrfs? Hello?

      Isn't this what filesystem devs have been concentrating on for about 5 years now?

      --
      I hate printers.
    3. Re:To much reinvention by Whalou · · Score: 3, Funny

      ReiserFS is good for that also. If you make a deal with the 'file system' it will tell you where your 'file' is hidden.

      --
      English is not this .sig mother tongue...
    4. Re:To much reinvention by bertok · · Score: 3, Insightful

      If this type of thing is implemented at the file level every application is going to have to do its own thing. That means to many implementations most of which wont be very good or well tested. It also means applications developers will have to be busy slogging though error correction data in their files rather than the data they actually wanted to persist for their application. I think the article offers a number of good ideas but it would be better to do most of them at the filesystem and perhaps some at the storage layer.

          Also if we can present the same logical file when read to the application even if every 9th byte is parity on the disk that is a plus because it means legacy apps can get the enhanced protection as well.

      Precisely. This is what things like torrents, RAR files with recovery blocks, and filesystems like ZFS are for: so every app developer doesn't have to roll their own, badly.

    5. Re:To much reinvention by An+dochasac · · Score: 3, Informative

      "says massive digital archives are threatened by simple bit errors that can render whole files useless.
      Isn't this what filesystem devs have been concentrating on for about 5 years now?

      Not just 5 years. ZFS's CRC on every datablock and Raid Z (no raid hold) are innovative and obviously the next step in filesystem evolution. But attempts at redundancy aren't new. I'm surprised the article is discussing relatively low teck old hat ideas such as two filesystem headers. Even DOS's FAT used this raid0 type of brute force redundancy by having two FAT tables. The Commodore Amiga's Intuition filesystem did this better than Microsoft back in 1985 by having forward and backward links in every block which made it possible to repair block pointer damage by searching for a reference to the bad block in the preceding and following block.
      And I suppose if ZFS doesn't catch on, 25 or 30 years from now Apple or Microsoft will finally come up with it and say, "Hey look what we invented!"

    6. Re:To much reinvention by Hatta · · Score: 5, Interesting

      Don't forget PAR2. I never burn a DVD without 10%-20% redundancy as par2 files. Even if the filesystem gets too damaged to read, I can usually dd the whole disk and let par2 recover the files.

      --
      Give me Classic Slashdot or give me death!
  2. par files by ionix5891 · · Score: 5, Informative

    include par2 files

  3. It's that computer called the brain. by commodore64_love · · Score: 5, Interesting

    >>>"...analog pictures and film can degrade and still be usable; why can't the same be true of digital files?"

    The ear-eye-brain connection has ~500 million years of development, and has learned the ability to filter-out noise. If for example I'm listening to a radio, the hiss is mentally filtered-out, or if I'm watching a VHS tape that has wrinkles, my brain can focus on the undamaged areas. In contrast when a computer encounters noise or errors, it panics and says, "I give up," and the digital radio or digital television goes blank.

    What we need is a smarter computer that says, "I don't know what this is supposed to be, but here's my best guess," and displays noise. Let the brain then takeover and mentally remove the noise from the audio or image.

    --
    "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
    1. Re:It's that computer called the brain. by commodore64_love · · Score: 4, Interesting

      P.S.

      When I was looking for a digital-to-analog converter for my TV, I returned all the ones that displayed blank screens when the signal became weak. The one I eventually chose (x5) was the Channel Master unit. When the signal is weak it continues displaying a noisy image, rather than go blank, or it reverts to "audio only" mode, rather than go silent. It lets me continue watching programs rather than be completely cutoff.

      --
      "I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
    2. Re:It's that computer called the brain. by Phreakiture · · Score: 4, Interesting

      What we need is a smarter computer that says, "I don't know what this is supposed to be, but here's my best guess," and displays noise. Let the brain then takeover and mentally remove the noise from the audio or image.

      Audio CDs have always done this. Audio CDs are also uncompressed*.

      The problem, I suspect, is that we have come to rely on a lot of data compression, particularly where video is concerned. I'm not saying this is the wrong choice, necessarily, because video can become ungodly huge without it (NTSC SD video -- 720 x 480 x 29.97 -- in the 4:2:2 colour space, 8 bits per pixel per plane, will consume 69.5 GiB an hour without compression), but maybe we didn't give enough thought to stream corruption.

      Mini DV video tape, when run in SD, uses no compression on the audio, and the video is only lightly compressed, using a DCT-based codec, with no delta coding. In practical terms, what this means is that one corrupted frame of video doesn't cascade into future frames. If my camcorder gets a wrinkle in the tape, it will affect the frames recorded on the wrinkle, and no others. It also makes a best-guess effort to reconstruct the frame. This task may not be impossible with more dense codecs that do use delta coding and motion compensation (MPEG, DiVX, etc), but it is certainly made far more difficult.

      Incidentally, even digital cinemas are using compression. It is a no-delta compression, but the individual frames are compressed in a manner akin to JPEGs, and the audio is compressed either using DTS or AC3 or one of their variants in most cinemas. The difference, of course, is that the cinemas must provide a good presentation. If they fail to do so, people will stop coming. If the presentation isn't better than watching TV/DVD/BluRay at home, then why pay the $11?

      (* I refer here to data compression, not dynamic range compression. Dynamic range compression is applied way too much in most audio media)

      --
      www.wavefront-av.com
  4. Sun Microsystems..... zfs..... by HKcastaway · · Score: 3, Insightful

    ZFS.

    Next topic....

  5. What about the "block errors"? by MathFox · · Score: 4, Informative
    Most of the storage media in common use (disks, tapes, CD/DVD-R) already do use ECC at sector of block level and will fix "single bit" errors at firmware level transparently. What is more of an issue at application level are "missing block" errors; when the low-level ECC fails and the storage device signals "unreadable sector" and one or more blocks of data are lost.

    Off course this can be fixed by "block redundancy" (like RAID does), "block recovery checksums" or old-fashioned backups.

    --
    extern warranty;
    main()
    {
    (void)warranty;
    }
    1. Re:What about the "block errors"? by tepples · · Score: 3, Informative

      anyone know of the equivalent RAID model for things like tape?

      Four tapes data, one tape PAR2.

  6. Re:What files does a single bit error destroy? by Jane+Q.+Public · · Score: 3, Informative

    That's complete nonsense. Just for one example, if the bit is part of a numeric value, depending on where that bit is, it could make the number off anywhere from 1 to 2 BILLION or even a lot more, depending on the kind of representation being used.

  7. About time by trydk · · Score: 3, Interesting

    It is about time that somebody (hopefully some of the commercial vendors AND the open source community too) get wise to the problems of digital storage.

    I always create files with unique headers and consistent version numbering to allow for minor as well as major file format changes. For storage/exchange purposes, I make the format expandable where each subfield/record has an individual header with a field type and a length indicator. Each field is terminated with a unique marker (two NULL bytes) to make the format resilient to errors in the headers with possible resynchronisationthrough the markers. The format is in most situations backward compatible to a certain extent as an old program can always ignore fields/subfields it does not understand in a newer format file. If that is not an option, the major version number is incremented. This means that a version 2.11 program can read a version 2.34 file with only minor problems. It will not be able to write to that format, though. The same version 2.11 program would not be able to correctly read a version 3.01 file either.

    I have not implemented ECC in the formats yet, but maybe the next time I do an overhaul ... I will have to ponder that. Maybe not, my programs seem to ephemeral for that ... Then again, so did people think about their 1960es COBOL programs.

  8. Also, Bittorrent by NoPantsJim · · Score: 4, Informative

    I remember reading a story of a guy who had to download a file from Apple that was over 4 gigabytes, and had to attempt it several times because each came back corrupted due to some problem with his internet. Eventually, he gave up and found the file on bit torrent, but realized if he saved it in the same location as the corrupted file, it would check the file and then overwrite it with the correct information. He was able to fix it in under an hour using bittorrent rather than trying to re-download the file while crossing his fingers and praying for no corruption.

    I know it's not a perfect example, but just one way of looking at it.

  9. Re:What files does a single bit error destroy? by Rockoon · · Score: 5, Insightful

    Most modern compression formats will not tolerate any errors. With LZ a single bit error could propagate over a long expanse of the uncompressed output, while with Arithmetic encoding the remainder of the file following the single bit error will be completely unrecoverable.

    Pretty much only the prefix-code style compression schemes (Huffman for one) will isolate errors to short sgements, and then only if the compressor is not of the adaptive variety.

    --
    "His name was James Damore."
  10. Parchive: Parity Archive Volume Set by khundeck · · Score: 5, Interesting

    Parchive: Parity Archive Volume Set

    It basically allows you to create an archive that's selectively larger, but contains an amount of parity such that you can have XX% corruption and still 'unzip.'

    "The original idea behind this project was to provide a tool to apply the data-recovery capability concepts of RAID-like systems to the posting and recovery of multi-part archives on Usenet. We accomplished that goal." [http://parchive.sourceforge.net/]

    KPH

  11. Cloud computing provides an opportunity by davide+marney · · Score: 3, Funny

    As we're on the cusp of moving much of our data to the cloud, we've got the perfect opportunity to improve the resilience of information storage for a lot of people at the same time.

    --
    "We receive as friendly that which agrees with, we resist with dislike that which opposes us" - Faraday