Slashdot Mirror


The Many Paths To Data Corruption

Runnin'Scared writes "Linux guru Alan Cox has a writeup on KernelTrap in which he talks about all the possible ways for data to get corrupted when being written to or read from a hard disk drive. This includes much of the information applicable to all operating systems. He prefaces his comments noting that the details are entirely device specific, then dives right into a fascinating and somewhat disturbing path tracing data from the drive, through the cable, into the bus, main memory and CPU cache. He also discusses the transfer of data via TCP and cautions, 'unfortunately lots of high performance people use checksum offload which removes much of the end to end protection and leads to problems with iffy cards and the like. This is well studied and known to be very problematic but in the market speed sells not correctness.'"

6 of 121 comments (clear)

  1. End-to-end by Intron · · Score: 4, Informative

    Some enterprise server systems use end-to-end protection, meaning the data block is longer. If you write 512 bytes of data + 12 bytes or so of check data and carry that through all of the layers, it can prevent the data corruption from going undiscovered. The check data usually includes the block's address, so that data written with correct CRC but in the wrong place will also be discovered. It is bad enough to have data corrupted by a hardware failure, much worse not to detect it.

    --
    Intron: the portion of DNA which expresses nothing useful.
  2. Hello ZFS by Wesley+Felter · · Score: 4, Informative

    ZFS's end-to-end checksums detect many of these types of corruption; as long as ZFS itself, the CPU, and RAM are working correctly, no other errors can corrupt ZFS data.

    I am looking forward to the day when all RAM has ECC and all filesystems have checksums.

    1. Re:Hello ZFS by harrkev · · Score: 3, Informative

      I am looking forward to the day when all RAM has ECC and all filesystems have checksums.
      Not gonna happen. The problem is that ECC memory costs more, simply because there is 12.5% more memory. Most people are going to go for as cheap as possible.

      But, ECC is available. If it is important to you, pay for it.
      --
      "-1 Troll" is the apparently the same as "-1 I disagree with you."
  3. Real-life proof of ZFS detecting problems by E-Lad · · Score: 3, Informative

    Give this blog entry a read:
    http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta

    And you'll understand :)

  4. Re:RAM = the weakest link by Anonymous Coward · · Score: 1, Informative

    Note: The newer Intel P965 chipset does not support ECC memory while their older 965x does. Crying shame too given the P965 has been designed for Core 2 Due and Quad Core CPUs.

    You meant 975x, not 965x. The successor of 975x is X38 (Bearlake-X) chipset supporting ECC DRAM. It should debut this month.

  5. Re:RAM = the weakest link by Anonymous Coward · · Score: 1, Informative

    Sad given that ECC logic is so simple it's basically FREE.

    What's worse? It IS free!
    Motherboard chips (e.g. south bridge, north bridge) are generally limited in size NOT by the transistors inside but by the number of IO connections. There's silicon to burn, so to speak, and therefore plenty of room to add features like this.

    How do I know this? Oh wait, my company made them.... We never had to worry about state-of-the-art process technology because it wasn't worth it. We could afford to be several generations behind for exactly this reason.