Slashdot Mirror


The Many Paths To Data Corruption

Runnin'Scared writes "Linux guru Alan Cox has a writeup on KernelTrap in which he talks about all the possible ways for data to get corrupted when being written to or read from a hard disk drive. This includes much of the information applicable to all operating systems. He prefaces his comments noting that the details are entirely device specific, then dives right into a fascinating and somewhat disturbing path tracing data from the drive, through the cable, into the bus, main memory and CPU cache. He also discusses the transfer of data via TCP and cautions, 'unfortunately lots of high performance people use checksum offload which removes much of the end to end protection and leads to problems with iffy cards and the like. This is well studied and known to be very problematic but in the market speed sells not correctness.'"

18 of 121 comments (clear)

  1. Keep your porn on separate physical drives! by eln · · Score: 5, Funny

    The most common way for young data to be corrupted is to be saved on a block that once contained pornographic data. As we all know, deleting data alone is not sufficient, as that will only remove the pointer to the data while leaving the block containing it undisturbed. This allows a young piece of data to easily see the old porn data as it is being written to that block. For this reason, it is imperative that you keep all pornographic data on separate physical drives.

    In addition, you should never access young data and pornographic data in the same session, as the young impressionable data may get corrupted by the pornographic data if they exist in RAM at the same time.

    Data corruption is a serious problem in computing today, and it is imperative that we take steps to stop our young innocent data from being corrupted.

    1. Re:Keep your porn on separate physical drives! by king-manic · · Score: 3, Funny

      In addition, you should never access young data and pornographic data in the same session, as the young impressionable data may get corrupted by the pornographic data if they exist in RAM at the same time.

      indeed, young pornographic data is disturbing. Fortunately there is a legal social firewall of 18.

      --
      "There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy."
  2. Paul Cylon by HTH+NE1 · · Score: 3, Funny

    There must be 50 ways to lose your data.

    --
    Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?
  3. benchmarks by larien · · Score: 4, Insightful

    As Alan Cox alluded to, there are benchmarks for data transfers, web performance, etc, etc, etc, but none for data integrity, it's kind of assumed, even if it perhaps shouldn't be. It also reminds me of various cluster software which will happily crash a node rather than risk data corruption (Sun Cluster & Oracle RAC both do this). What do you [em]really[/em] want? Lightning fast performance, or the comfort of knowing that your data is intact & correct? For something like a rendering farm, you can probably tolerate a pixel or two being the wrong shade. If you're dealing with money, you want the data to be 100% correct, otherwise there's a world of hurt waiting to happen...

    1. Re:benchmarks by dgatwood · · Score: 5, Interesting

      I've concluded that nobody cares about data integrity. That's sad, I know, but I have yet to see product manufacturers sued into oblivion for building fundamentally defective devices, and that's really what it would take to improve things, IMHO.

      My favorite piece of hardware was a chip that was used in a bunch of 5-in-1 and 7-in-1 media card readers about four years ago. It was complete garbage, and only worked correctly on Windows. Mac OS X would use transfer sizes that the chip claimed to support, but the chip returned copies of block 0 instead of the first block in every transaction over a certain size. Linux supposedly also had problems with it. This was while reading, so no data was lost, but a lot of people who checked the "erase pictures after import" button in iPhoto were very unhappy.

      Unfortunately, there was nothing the computer could do to catch the problem, as the data was in fact copied in from the device exactly as it presented it, and no amount of verification could determine that there was a problem because it would consistently report the same wrong data.... Fortunately, there are unerase tools available for recovering photos from flash cards. Anyway, I made it a point to periodically look for people posting about that device on message boards and tell them how to work around it by imaging the entire flash card with dd bs=512 until they could buy a new flash card reader.

      In the end, I moved to a FireWire reader and I no longer trust USB for anything unless there's no other alternative (iPod, iPhone, and disks attached to an Airport Base Station). While that makes me somewhat more comfortable than dealing with USB, there have been a few nasty issues even with FireWire devices. For example, there was an Oxford 922 firmware bug about three years back that wiped hard drives if a read or write attempt was made after a spindown request timed out or something. I'm not sure about the precise details.

      And then, there is the Seagate hard drive that mysteriously will only boot my TiVo about one time out of every twenty (but works flawlessly when attached to a FW/ATA bridge chipset). I don't have an ATA bus analyzer to see what's going on, but it makes me very uncomfortable to see such compatibility problems with supposedly standardized modern drives. And don't get me started on the number of dead hard drives I have lying around....

      If my life has taught me anything about technology, it is this: if you really care about data, back it up regularly and frequently, store your backups in another city, ensure that those backups are never all simultaneously in the same place or on the same electrical grid as the original, and never throw away any of the old backups. If it isn't worth the effort to do that, then the data must not really be important.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

  4. End-to-end by Intron · · Score: 4, Informative

    Some enterprise server systems use end-to-end protection, meaning the data block is longer. If you write 512 bytes of data + 12 bytes or so of check data and carry that through all of the layers, it can prevent the data corruption from going undiscovered. The check data usually includes the block's address, so that data written with correct CRC but in the wrong place will also be discovered. It is bad enough to have data corrupted by a hardware failure, much worse not to detect it.

    --
    Intron: the portion of DNA which expresses nothing useful.
  5. Hello ZFS by Wesley+Felter · · Score: 4, Informative

    ZFS's end-to-end checksums detect many of these types of corruption; as long as ZFS itself, the CPU, and RAM are working correctly, no other errors can corrupt ZFS data.

    I am looking forward to the day when all RAM has ECC and all filesystems have checksums.

    1. Re:Hello ZFS by harrkev · · Score: 3, Informative

      I am looking forward to the day when all RAM has ECC and all filesystems have checksums.
      Not gonna happen. The problem is that ECC memory costs more, simply because there is 12.5% more memory. Most people are going to go for as cheap as possible.

      But, ECC is available. If it is important to you, pay for it.
      --
      "-1 Troll" is the apparently the same as "-1 I disagree with you."
  6. I think this has happened to me by jdigital · · Score: 4, Interesting

    I think I suffered from a series of Type III errors (rtfa). After merging lots of poorly maintained backups of my /home file system I decided to write a little script to look for duplicate files (using file size as a first indicator, then md5 for ties). The script would identify duplicates and move files around into a more orderly structure based on type, etc. After doing this i noticed that a small number of my mp3's now contain chunks of other songs in them. My script was only working with whole files, so I have no idea how this happened. When I refer back to the original copies of the mp3s the files are uncorrupted.

    Of course, no one believes me. But maybe this presentation is on to something. Or perhaps I did something in a bonehead fashion totally unrelated.

    --
    :wq ~ ~ ~ ~ ~
  7. MySQL? by Jason+Earl · · Score: 4, Funny

    I was expecting an article on using MySQL in production.

  8. No by ElMiguel · · Score: 3, Interesting

    as long as ZFS itself, the CPU, and RAM are working correctly, no other errors can corrupt ZFS data.

    Sorry, but that is absurd. Nothing can absolutely protect against data errors (even if they only happen in the hard disk). For example, errors can corrupt ZFS data in a way that turns out to have the same checksum. Or errors can corrupt both the data and the checksum so they match each other.

    This is ECC 101 really.

    1. Re:No by Slashcrap · · Score: 5, Funny

      Or errors can corrupt both the data and the checksum so they match each other.

      This is about as likely as simultaneously winning every current national and regional lottery on the planet. And then doing it again next week.

      And if we're talking about a 512 bit hash then it's possible that a new planet full of lotteries will spontaneously emerge from the quantum vacuum. And you'll win all those too.

  9. Re:Hah by Cajun+Hell · · Score: 3, Insightful

    Sometimes I think we're lucky this stuff works at all.

    --
    "Believe me!" -- Donald Trump
  10. ...but in the market speed sells not correctness. by ozzee · · Score: 3, Interesting

    Ah - this is the bane of computer technology.

    One time I remember writing some code and it was very fast and almost always correct. The guy I was working with exclaimed "I can give you the wrong answer in zero seconds" and I shut up and did it the slower way that was right every time.

    This mentality of speed at the cost of correctness is prevalent, for example I can't understand why people don't spend the extra money on ECC memory *ALL THE TIME*. One failure over the lifetime of the computer and you have paid for your RAM. I have assembled many computers and unfortunately there have been a number of times where ECC memory was not an option. In almost every case where I have used ECC memory, the computer was noticably more stable. Case in point, the most recent machine that I built has never (as far as I know) crashed and I've thrown same really nasty workloads it's way. On the other hand, a couple of notebooks I have have crashed more often than I care to remember and there is no ECC option. Not to mention the ridicule I get for suggesting that people invest the extra $30 for a "non server" machine. Go figure. Suggesting that stability is the realm of "server" machines and infer end user machines should be relegated to a realm of lowered standards of reliability makes very little sense to me especially when the investment of $30 to $40 is absolutely minuscule if it prevents a single failure. What I think (see lawsuit coming on) is that memory manufacturers will sell quality marginal products to the non ECC crowd because there is no way of validating memory quality.

    I think there needs to be a significant change in the marketing of products to ensure that metrics of data integrity play a more significant role in decision making. It won't happen until the consumer demands it and I can't see that happening any time soon. Maybe, hopefully, I am wrong.

  11. what we have lost by cdrguru · · Score: 4, Insightful

    It amazes me how much has been lost over the years towards the "consumerization" of computers.

    Large mainframe systems have had data integrity problems solved for a long, long time. It is today unthinkable that any hardware issues or OS issues could corrupt data on IBM mainframe systems and operating systems.

    Personal computers, on the other hand, have none of the protections that have been present since the 1970s on mainframes. Yes, corruption can occur anywhere in the path from the CPU to the physical disk itself or during a read operation. There is no checking, period. And not only are failures unlikely to be quickly detected but they cannot be diagnosed to isolate the problem. All you can do is try throwing parts at the problem, replacing functional units like the disk drive or controller. These days, there is no separate controller - its on the motherboard - so your "functional unit" can almost be considered to be the computer.

    How often is data corrupted on a personal computer? It is clear it doesn't happen all that often, but in the last fourty years or so we have actually gone backwards in our ability to detect and diagnose such problems. Nearly all businesses today are using personal computers to at least display information if not actually maintain and process it. What assurance do you have that corruption is not taking place? None, really.

    A lot of businesses have few, if any, checks that would point out problems that could cost thousands of dollars because of a changed digit. In the right place, such changes could lead to penalties, interest and possible loss of a key customer.

    Why have we gone backwards in this area when compared to a mainframe system of fourty years ago? Certainly software has gotten more complex but basic issues of data integrity have fallen by the wayside. Much of this was done in hardware previously. It could be done cheaply in firmware and software today with minimal cost and minimal overhead. But it is not done.

    1. Re:what we have lost by suv4x4 · · Score: 3, Interesting

      Why have we gone backwards in this area when compared to a mainframe system of fourty years ago?

      For the same reason why experienced car drivers crash in ridiculous situations: they are too sure of themselves.

      The industry is so huge, that the separate areas of our computers just accept the rest is a magic box that should magically operate as is written in the spec. Screwups don't happen too often, and when they happen they are not detectable, hence no one woke up to it.

      That said don't feel bad, we're not going downwards. It just so happened speed and flashy graphics will play important role for another couple of years. Then after we max this out, the industry will seek to improve another parameter of their products, and sooner or later we'll hit back the data integrity issue :D

      Look at hard disks: does the casual consumer need more than 500 GB? So now we see the advent of faster hybrid (flash+disk) storage devices, or pure flash memory devices.

      So we've tackled storage size, we're about to tackle storage speed. And when it's fast enough, what's next, encryption and additional integrity checks. Something for the bullet list of features...

  12. RAM = the weakest link by DigiShaman · · Score: 3, Interesting

    It's well known that ECC and other forms of error correction are found at all levels of software and hardware. For example, hard drives have their own internal error correction while the file system it's formatted with may have another. Also worth mentioning, the CPU, serial busses, network adapters (both the physical IEEE 802.x connection and TCP/IP stack) and other forms of software error correction.

    Basically, the modern computer has various hardware and software layers of error correction stacked on top of each other if not at least by themselves.

    We do have weak link with desktops regarding RAM however. While modern workstations and server are generally installed with ECC RAM, our desktops are not. Also worth mentioning, most custom built clone PCs are for the desktop market. This has become a huge problem given the voltage and timing requirements don't leave much room for tolerance. The fact memory density has been going up only makes the chances for "bit flips" even worse. I can't tell you how many countless times I've ran into data corruption due to improper RAM settings. Running a few passes with Memtest 86+ will reveal this nasty issue. Hell, even Windows Vista now includes a utility to check for faulty RAM read/write issues that's how big the problem has become in the industry. As such, the desktop market severely needs to embrace ECC RAM like the server and workstation market. These days, to not use ECC is asking for trouble. And yes, you would take a 1 to 2% performance hit, but so what; Data integrity is more imporant.

    Note: The newer Intel P965 chipset does not support ECC memory while their older 965x does. Crying shame too given the P965 has been designed for Core 2 Due and Quad Core CPUs.

    --
    Life is not for the lazy.
  13. Real-life proof of ZFS detecting problems by E-Lad · · Score: 3, Informative

    Give this blog entry a read:
    http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta

    And you'll understand :)