Slashdot Mirror


The Many Paths To Data Corruption

Runnin'Scared writes "Linux guru Alan Cox has a writeup on KernelTrap in which he talks about all the possible ways for data to get corrupted when being written to or read from a hard disk drive. This includes much of the information applicable to all operating systems. He prefaces his comments noting that the details are entirely device specific, then dives right into a fascinating and somewhat disturbing path tracing data from the drive, through the cable, into the bus, main memory and CPU cache. He also discusses the transfer of data via TCP and cautions, 'unfortunately lots of high performance people use checksum offload which removes much of the end to end protection and leads to problems with iffy cards and the like. This is well studied and known to be very problematic but in the market speed sells not correctness.'"

121 comments

  1. Keep your porn on separate physical drives! by eln · · Score: 5, Funny

    The most common way for young data to be corrupted is to be saved on a block that once contained pornographic data. As we all know, deleting data alone is not sufficient, as that will only remove the pointer to the data while leaving the block containing it undisturbed. This allows a young piece of data to easily see the old porn data as it is being written to that block. For this reason, it is imperative that you keep all pornographic data on separate physical drives.

    In addition, you should never access young data and pornographic data in the same session, as the young impressionable data may get corrupted by the pornographic data if they exist in RAM at the same time.

    Data corruption is a serious problem in computing today, and it is imperative that we take steps to stop our young innocent data from being corrupted.

    1. Re:Keep your porn on separate physical drives! by king-manic · · Score: 3, Funny

      In addition, you should never access young data and pornographic data in the same session, as the young impressionable data may get corrupted by the pornographic data if they exist in RAM at the same time.

      indeed, young pornographic data is disturbing. Fortunately there is a legal social firewall of 18.

      --
      "There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy."
    2. Re:Keep your porn on separate physical drives! by Neanderthal+Ninny · · Score: 1

      Boy, your expert on this subject! I wonder if the FBI is watching you.

    3. Re:Keep your porn on separate physical drives! by Anonymous Coward · · Score: 1, Insightful

      Funny, but there's a bit of truth in it too. If data corruption happens in the filesystem, it can cause files to become interlinked or point to "erased" data, which might be a surprise that you don't want if you keep porn on the same harddisk as data which is going to be published.

    4. Re:Keep your porn on separate physical drives! by dwater · · Score: 1

      > Boy, your expert on this subject!

      Who's "Boy", how you know Boy is an expert, and what makes Boy the poster's?

      --
      Max.
    5. Re:Keep your porn on separate physical drives! by Anonymous Coward · · Score: 0

      Me, "Tarzan" --- Him, "Boy" --- That, "Jane"
      "Cheeta Bad! --- Cheeta, No make fun Boy"

      Ungowa,
      Tarzan

    6. Re:Keep your porn on separate physical drives! by rts008 · · Score: 1

      From 'eln:21727' "The most common way for young data to be corrupted is to be saved on a block that once contained pornographic data."

      Unfortunately, as to your "Fortunately there is a legal social firewall of 18.", it depends on which block in which city you are cruising as to whether or not they may be at least/over 14, much less 18.

      At least that's what a traveling friend of mine told me....honest!

      --
      Down With Slashdot BETA!!! I've been around the corner and seen the oliphant; you can only abuse me from your perspecti
    7. Re:Keep your porn on separate physical drives! by legirons · · Score: 1

      "The most common way for young data to be corrupted is to be saved on a block that once contained pornographic data."

      Whatever you do, don't dilute the data 200 times by zeroing 9/10 of the data each time, otherwise your drive will be full of porn ;)

  2. Paul Cylon by HTH+NE1 · · Score: 3, Funny

    There must be 50 ways to lose your data.

    --
    Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?
    1. Re:Paul Cylon by Anonymous Coward · · Score: 0

      That's not redundant; you're just upset that he spelled "lose" correctly!

    2. Re:Paul Cylon by HTH+NE1 · · Score: 1, Insightful

      Ah well. Perhaps I should have been a bit cleverer and said, "There must be 110010 ways to lose your data."

      --
      Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?
    3. Re:Paul Cylon by mcpkaaos · · Score: 1

      That or someone is playing fast and lose with their mod points.

      --
      It goes from God, to Jerry, to me.
  3. benchmarks by larien · · Score: 4, Insightful

    As Alan Cox alluded to, there are benchmarks for data transfers, web performance, etc, etc, etc, but none for data integrity, it's kind of assumed, even if it perhaps shouldn't be. It also reminds me of various cluster software which will happily crash a node rather than risk data corruption (Sun Cluster & Oracle RAC both do this). What do you [em]really[/em] want? Lightning fast performance, or the comfort of knowing that your data is intact & correct? For something like a rendering farm, you can probably tolerate a pixel or two being the wrong shade. If you're dealing with money, you want the data to be 100% correct, otherwise there's a world of hurt waiting to happen...

    1. Re:benchmarks by dgatwood · · Score: 5, Interesting

      I've concluded that nobody cares about data integrity. That's sad, I know, but I have yet to see product manufacturers sued into oblivion for building fundamentally defective devices, and that's really what it would take to improve things, IMHO.

      My favorite piece of hardware was a chip that was used in a bunch of 5-in-1 and 7-in-1 media card readers about four years ago. It was complete garbage, and only worked correctly on Windows. Mac OS X would use transfer sizes that the chip claimed to support, but the chip returned copies of block 0 instead of the first block in every transaction over a certain size. Linux supposedly also had problems with it. This was while reading, so no data was lost, but a lot of people who checked the "erase pictures after import" button in iPhoto were very unhappy.

      Unfortunately, there was nothing the computer could do to catch the problem, as the data was in fact copied in from the device exactly as it presented it, and no amount of verification could determine that there was a problem because it would consistently report the same wrong data.... Fortunately, there are unerase tools available for recovering photos from flash cards. Anyway, I made it a point to periodically look for people posting about that device on message boards and tell them how to work around it by imaging the entire flash card with dd bs=512 until they could buy a new flash card reader.

      In the end, I moved to a FireWire reader and I no longer trust USB for anything unless there's no other alternative (iPod, iPhone, and disks attached to an Airport Base Station). While that makes me somewhat more comfortable than dealing with USB, there have been a few nasty issues even with FireWire devices. For example, there was an Oxford 922 firmware bug about three years back that wiped hard drives if a read or write attempt was made after a spindown request timed out or something. I'm not sure about the precise details.

      And then, there is the Seagate hard drive that mysteriously will only boot my TiVo about one time out of every twenty (but works flawlessly when attached to a FW/ATA bridge chipset). I don't have an ATA bus analyzer to see what's going on, but it makes me very uncomfortable to see such compatibility problems with supposedly standardized modern drives. And don't get me started on the number of dead hard drives I have lying around....

      If my life has taught me anything about technology, it is this: if you really care about data, back it up regularly and frequently, store your backups in another city, ensure that those backups are never all simultaneously in the same place or on the same electrical grid as the original, and never throw away any of the old backups. If it isn't worth the effort to do that, then the data must not really be important.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    2. Re:benchmarks by BSAtHome · · Score: 1

      To paraphrase a RFC: Good, Fast, Cheap; pick two, you can't have all three.

    3. Re:benchmarks by unitron · · Score: 1

      And then, there is the Seagate hard drive that mysteriously will only boot my TiVo about one time out of every twenty (but works flawlessly when attached to a FW/ATA bridge chipset).

      And then there's my 80 Gig Western Digital that was very flakey (as soon as the warranty was up) in BX chipset (or equivalent) motherboard PCs, but I used it to replace the original drive in a Series 1 stand alone Philips Tivo and it's been working flawlessly in it for about a year now. Before you blame WD, I'm writing this on a BX chipset PC that's been running another WD 80 Gig that's almost identical (came off the assembly line a few months earlier) and it's been working fine since before I got the newer one that's now in the Tivo. Go figure.

      By the way, you aren't the only one running a hard drive cemetary. :-)

      --

      I see even classic Slashdot is now pretty much unusable on dial up anymore.

    4. Re:benchmarks by IvyKing · · Score: 1

      In the end, I moved to a FireWire reader and I no longer trust USB for anything unless there's no other alternative (iPod, iPhone, and disks attached to an Airport Base Station). While that makes me somewhat more comfortable than dealing with USB, there have been a few nasty issues even with FireWire devices.


      I don't recall seeing anything with regards to FireWire vs USB that would give FireWire an advantage in data integrity (though may be missing some finer points about the respective specs). OTOH, I have seen specs (one of the LaCie RAID in a box drives) that give a 10 to 20% performance advantage to FW despite the 'lower' peak speeds - one reason is that FW uses separate pairs for xmit and rcv.
    5. Re:benchmarks by straybullets · · Score: 1

      then the data must not really be important

      Yep, that's it: loads of useless data, produced by a society barely able to perform some relatively weak techno tricks while completly failing to solve basic issues. Something is wrong in this biometric cash flow production model.

      --
      With that aggravating beauty, Lulu Walls.
    6. Re:benchmarks by dgatwood · · Score: 1

      There's no technical reason for FW drives to be more reliable. The limited number of FireWire silicon vendors, however, does mean that each one is likely to get more scrutiny than the much larger number of USB silicon vendors, IMHO.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

    7. Re:benchmarks by Anonymous Coward · · Score: 0

      What do you [em]really[/em] want?

      "<em>really</em>".

    8. Re:benchmarks by dgatwood · · Score: 1

      Close. The FireWire controller is smarter and allow it to do a lot more work without the CPU being involved. That's the reason FireWire performs faster than USB 2.0 despite being a slower bus.

      Of course, the new USB 3.0 will bump USB speed way up. I'm not holding my breath, though. Considering what USB 2.0 does to the CPU load when the bus is under heavy use, I'd expect that Intel had better increase the number of CPU cores eightfold right now to try to get ahead of the game.... :-)

      this group of comments sums up my opinion pretty well for the most part.

      --

      Check out my sci-fi/humor trilogy at PatriotsBooks.

  4. End-to-end by Intron · · Score: 4, Informative

    Some enterprise server systems use end-to-end protection, meaning the data block is longer. If you write 512 bytes of data + 12 bytes or so of check data and carry that through all of the layers, it can prevent the data corruption from going undiscovered. The check data usually includes the block's address, so that data written with correct CRC but in the wrong place will also be discovered. It is bad enough to have data corrupted by a hardware failure, much worse not to detect it.

    --
    Intron: the portion of DNA which expresses nothing useful.
  5. Hello ZFS by Wesley+Felter · · Score: 4, Informative

    ZFS's end-to-end checksums detect many of these types of corruption; as long as ZFS itself, the CPU, and RAM are working correctly, no other errors can corrupt ZFS data.

    I am looking forward to the day when all RAM has ECC and all filesystems have checksums.

    1. Re:Hello ZFS by harrkev · · Score: 3, Informative

      I am looking forward to the day when all RAM has ECC and all filesystems have checksums.
      Not gonna happen. The problem is that ECC memory costs more, simply because there is 12.5% more memory. Most people are going to go for as cheap as possible.

      But, ECC is available. If it is important to you, pay for it.
      --
      "-1 Troll" is the apparently the same as "-1 I disagree with you."
    2. Re:Hello ZFS by Wesley+Felter · · Score: 1

      Intel or AMD could force ECC adoption if they wanted to; the increase in cost would be easily hidden by Moore's Law.

    3. Re:Hello ZFS by drsmithy · · Score: 1

      Not gonna happen. The problem is that ECC memory costs more, simply because there is 12.5% more memory. Most people are going to go for as cheap as possible.

      It'll happen for the same reason RAID5 on certain types of arrays will be obselete in 4 - 5 years. Eventually memory sizes are going to get so big that the statistical probability of a memory error will effectively guarantee they happen too frequently to ignore.

    4. Re:Hello ZFS by MarsDefenseMinister · · Score: 1

      Just so everybody knows, ZFS is available for Linux as a FUSE module. It's easy to get it working, and lots of fun to tinker with. I have it set up right now in a test configuration with an old 80 gig drive, and a 11 gig drive. 91 gigs total, in external USB enclosures. And I created files on an NFS server the same size as each of the drives, and told ZFS to use those files as mirrors. On a 100 megabit link. And surprisingly enough, it's actually not too slow to use!

      But the reason I have it set up is not to use it that way forever, but to learn about ZFS administration to assess how well it'll work in an average Linux geek's setup. So far, I think it'll work very well for many people, even if you've just got one drive. ZFS can keep multiple copies of important data and repair data errors even if you're not in a mirrored or RAID configuration.

      --
      No weapon in the arsenals of the world is so formidable as the will and moral courage of free men.-Ronald Reagan
    5. Re:Hello ZFS by QuoteMstr · · Score: 1

      Obsolete? What would you replace it with then?

    6. Re:Hello ZFS by drsmithy · · Score: 2, Interesting

      Obsolete? What would you replace it with then?

      RAID6. Then a while after that, "RAID7" (or whatever they call triple-parity).

      In ca. 4-5 years[0], the combination of big drives (2TB+) and raw read error rates (roughly once every 12TB or so) will mean that during a rebuild of 6+ disk RAID5 arrays after a single drive failure, a second "drive failure" (probably just a single bad sector, but the end result is basically the same) will be - statistically speaking - pretty much guaranteed. RAID5 will be obselete because it won't protect you from array failures (because every single-disk failure will become a double-disk failure). RAID6 will only give you the same protection as RAID5 today (because you will be vulnerable to a third drive failing during the rebuild in addition to the second) and "RAID7" will be needed to protect you from "triple disk failures".

      On a more positive note, with current error rates, RAID10 should last until ca. 10TB drives before SATA array elements have to be "triple mirrored" (although this is far enough down the track that I expect the basic assumptions here to have changed). "Enterprise" hardware also has (much) longer to go, because the read error rate is better and drives typically (much) smaller.

      (Even today, IMHO, anyone using drives bigger than 250G in 6+ disk arrays without either RAID6 or RAID10 is crazy.)

      [0]This is actually being pretty generous. It's certain we'll see 2TB drives well before then, but I'm taking a timeframe where they will be "common" rather than "high end".

    7. Re:Hello ZFS by renoX · · Score: 1

      >The problem is that ECC memory costs more, simply because there is 12.5% more memory.

      The big issue is that ECC memory doesn't cost only 12.5% more than regular memory, otherwise you'd see that lots of knowledgeable (or correctly guided) people would buy ECC.

    8. Re:Hello ZFS by AntrygRevok.net · · Score: 1

      ECC may be available
      ( I always build systems with http://www.crucial.com/ ECC RAM,
      and no I'm nae affilliated,
      but they're the ONLY brand who've never once proven flaky,
      in my experience. . . )

      but the problem is that almost no motherboards support ECC.

      Gigabyte's GC-RAMDISK and GO-RAMDISK ( up-to 4GB ) hardware-ram-drive without ECC *support*,
      is typical of this idiocy:
      The only way to make the things trustworthy is to run a RAID5 or RAID6 array of 'em,
      and that gets bloody expensive
      ( though the speed. . . kernel-raid5 is quick, eh? )

      http://usa.asus.com/ is the only consumer-brand I know-of,
      other-than maybe Abit ( it's been awhile for me to've seen one of 'em. . . )
      that provides ECC support.

      Gigabyte generally don't, and if they don't. . . who does?

      Supermicro & Tyan? No SLI/Crossfire/sea-of-ports on them, eh?

      --
      Try also my gallery: http://photo.net/photos/AntrygRevo
  6. I think this has happened to me by jdigital · · Score: 4, Interesting

    I think I suffered from a series of Type III errors (rtfa). After merging lots of poorly maintained backups of my /home file system I decided to write a little script to look for duplicate files (using file size as a first indicator, then md5 for ties). The script would identify duplicates and move files around into a more orderly structure based on type, etc. After doing this i noticed that a small number of my mp3's now contain chunks of other songs in them. My script was only working with whole files, so I have no idea how this happened. When I refer back to the original copies of the mp3s the files are uncorrupted.

    Of course, no one believes me. But maybe this presentation is on to something. Or perhaps I did something in a bonehead fashion totally unrelated.

    --
    :wq ~ ~ ~ ~ ~
    1. Re:I think this has happened to me by jdigital · · Score: 1

      Of course the fa that I was referring to is here. Much more informative than AC's post if I may say...

      --
      :wq ~ ~ ~ ~ ~
    2. Re:I think this has happened to me by KevinColyer · · Score: 1

      I have experienced having small chunks of other songs inside mp3 files on my mp3 player. Of course being a cheap player I assumed it was the player... I have a few problems when writing to it from Linux. I shall look more closely now!

      Perhaps the FAT filesystem is interpreted differently on the player to how Linux expects it to be? (Or VV)

    3. Re:I think this has happened to me by mikolas · · Score: 1

      I have had the same kind of problems a few times. I store all my stuff on a server with RAID5, but there have been a couple of times when transferring music from the server (via SMB) to MP3 player (via USB) has corrupted files. I never solved the problem as the original files were intact so I did not go through the effort. However, after reading the article I just might do something about it as I got a bit more worried about the data integrity of my lifetime personal file collection that I store on the server.

    4. Re:I think this has happened to me by Anonymous Coward · · Score: 0

      You have more problems than that. Did you notice that one of your posts was monospaced, and the other was not? Obviously, your Slashdot preferences are flip-flopping, probably due to chunks of someone else's preferences randomly being copied into yours.

      Of course, no one believes this. Perhaps someone did something in a bonehead fashion totally unrelated.

    5. Re:I think this has happened to me by Reziac · · Score: 1

      Oh, I believe you... I'm reminded of a "HD crash" a friend suffered. Long story short, I wound up doing file reconstruction, and from the pieces (almost always some multiple of 4k) of files stuck inside other files, concluded that there was probably nothing wrong with the HD, but rather, the RAID controller was writing intact data but to random locations -- like it had got its "which HD this data belongs on" count off by one. And there was evidence that the corruption started long before anyone noticed the problem.

      Occurs to me that RAID may not be the only HD controller system that can misfire this way. Who knows what specific actions might tickle a chipset bug, such as someone else mentioned up above?

      Yet another reason to never ever toss old backups, because the backup made since just MIGHT contain errors!

      --
      ~REZ~ #43301. Who'd fake being me anyway?
  7. MySQL? by Jason+Earl · · Score: 4, Funny

    I was expecting an article on using MySQL in production.

    1. Re:MySQL? by glwtta · · Score: 1

      They said many paths - MySQL is just the most common.

      --
      sic transit gloria mundi
    2. Re:MySQL? by Anonymous Coward · · Score: 0

      Mod parent insightful, not funny. :P

  8. Just wait till the banks data gets munged by Colin+Smith · · Score: 1

    That'll get fixed lickety split.

    --
    Deleted
  9. Re:Iffy cards? Try crappy drivers by Anonymous Coward · · Score: 0
    Your whole post makes no sense.
    1. AC was specifically talking about Ethernet checksums.
    2. A "typicalslashdotuser's" Linux box doesn't process "datasets of several Petabytes" looking for data corruption
    3. Which poster are you talking about? The words came straight from Alan Cox
    4. Do you have any problems using said card with an alternate OS?
  10. No by ElMiguel · · Score: 3, Interesting

    as long as ZFS itself, the CPU, and RAM are working correctly, no other errors can corrupt ZFS data.

    Sorry, but that is absurd. Nothing can absolutely protect against data errors (even if they only happen in the hard disk). For example, errors can corrupt ZFS data in a way that turns out to have the same checksum. Or errors can corrupt both the data and the checksum so they match each other.

    This is ECC 101 really.

    1. Re:No by Wesley+Felter · · Score: 1

      For example, errors can corrupt ZFS data in a way that turns out to have the same checksum. Or errors can corrupt both the data and the checksum so they match each other.

      You can use SHA as the checksum algorithm; the chance of undetected corruption is infinitesimal.

    2. Re:No by Anonymous Coward · · Score: 0

      Um... do you have a reference for that type of corruption happening in ZFS? First the checksums aren't kept with the data, second if that happens, the data is regenerated from parity information. ZFS is self healing. And by parity information, I mean the raid parity information. so even if data gets corrupted, it is repaired.

    3. Re:No by Slashcrap · · Score: 5, Funny

      Or errors can corrupt both the data and the checksum so they match each other.

      This is about as likely as simultaneously winning every current national and regional lottery on the planet. And then doing it again next week.

      And if we're talking about a 512 bit hash then it's possible that a new planet full of lotteries will spontaneously emerge from the quantum vacuum. And you'll win all those too.

    4. Re:No by Anonymous Coward · · Score: 0

      A checksum is very small compared to the data. Even an ideally distributed 128 bit(=16bytes) checksum matches 2^3968 different contents of a 512 byte block of data. Change it to one of those contents and the error won't be detected. The only thing which stands between you and that kind of error is the very small likelyhood of it: only 1 in 2^128 random contents match the checksum. Suppose that a block was read a thousand times per second since the beginning of time and it has always returned different random data, then you would still not need to expect that one of those blocks could have bypassed a proper 128 bit checksum. That's good enough for most applications.

    5. Re:No by TruthfulLiar · · Score: 2, Funny

      > And if we're talking about a 512 bit hash then it's possible that a new planet full of lotteries will spontaneously emerge from the quantum vacuum. And you'll win all those too.

      If this happens, be sure to keep the money from the quantum vacuum lotteries in a separate account, or it will annihilate with your real money.

    6. Re:No by StarfishOne · · Score: 1

      I'm amazed that none of you have ever heard of the Girlfriend Money experiment: when a girlfriend (especially your own) looks at a certain amount of money, she'll cause the collapse of the money's superposition. This _always_ results in the money disappearing both completely and instantaneously. ;o

    7. Re:No by Anonymous Coward · · Score: 0

      ZFS checksums are SHA256 and NOT stored *in* the block but in the block above.

  11. That is nothing compared to the actual storage technology. Attempting to recover data packed at a density of 1 GB/sq.in. from a disk spinning at 10,000 revolutions per minute where the actual data is stored in a micron thin layer of rust on the surface of the disk is manifestly impossible.

    1. Re:Hah by Cajun+Hell · · Score: 3, Insightful

      Sometimes I think we're lucky this stuff works at all.

      --
      "Believe me!" -- Donald Trump
    2. Re:Hah by Lorkki · · Score: 1

      Some people call it luck, others call it engineering.

  12. ...but in the market speed sells not correctness. by ozzee · · Score: 3, Interesting

    Ah - this is the bane of computer technology.

    One time I remember writing some code and it was very fast and almost always correct. The guy I was working with exclaimed "I can give you the wrong answer in zero seconds" and I shut up and did it the slower way that was right every time.

    This mentality of speed at the cost of correctness is prevalent, for example I can't understand why people don't spend the extra money on ECC memory *ALL THE TIME*. One failure over the lifetime of the computer and you have paid for your RAM. I have assembled many computers and unfortunately there have been a number of times where ECC memory was not an option. In almost every case where I have used ECC memory, the computer was noticably more stable. Case in point, the most recent machine that I built has never (as far as I know) crashed and I've thrown same really nasty workloads it's way. On the other hand, a couple of notebooks I have have crashed more often than I care to remember and there is no ECC option. Not to mention the ridicule I get for suggesting that people invest the extra $30 for a "non server" machine. Go figure. Suggesting that stability is the realm of "server" machines and infer end user machines should be relegated to a realm of lowered standards of reliability makes very little sense to me especially when the investment of $30 to $40 is absolutely minuscule if it prevents a single failure. What I think (see lawsuit coming on) is that memory manufacturers will sell quality marginal products to the non ECC crowd because there is no way of validating memory quality.

    I think there needs to be a significant change in the marketing of products to ensure that metrics of data integrity play a more significant role in decision making. It won't happen until the consumer demands it and I can't see that happening any time soon. Maybe, hopefully, I am wrong.

  13. Missing option. by Neanderthal+Ninny · · Score: 1

    I remember a long time ago that cosmic rays (actually the ElectroMagnetic Field disruption they caused) created some of those errors.

    1. Re:Missing option. by Intron · · Score: 1

      The typical energy of a cosmic ray is around 300 MeV. Interestingly, around mid 1990's the feature size of SRAM cells got small enough that a 300 MeV event could flip the state. This means that the cache memory now needs ECC as well as main memory, but I don't see that happening in too many CPUs. Reference:

      http://www.srim.org/SER/SERTrends.htm

      --
      Intron: the portion of DNA which expresses nothing useful.
  14. what we have lost by cdrguru · · Score: 4, Insightful

    It amazes me how much has been lost over the years towards the "consumerization" of computers.

    Large mainframe systems have had data integrity problems solved for a long, long time. It is today unthinkable that any hardware issues or OS issues could corrupt data on IBM mainframe systems and operating systems.

    Personal computers, on the other hand, have none of the protections that have been present since the 1970s on mainframes. Yes, corruption can occur anywhere in the path from the CPU to the physical disk itself or during a read operation. There is no checking, period. And not only are failures unlikely to be quickly detected but they cannot be diagnosed to isolate the problem. All you can do is try throwing parts at the problem, replacing functional units like the disk drive or controller. These days, there is no separate controller - its on the motherboard - so your "functional unit" can almost be considered to be the computer.

    How often is data corrupted on a personal computer? It is clear it doesn't happen all that often, but in the last fourty years or so we have actually gone backwards in our ability to detect and diagnose such problems. Nearly all businesses today are using personal computers to at least display information if not actually maintain and process it. What assurance do you have that corruption is not taking place? None, really.

    A lot of businesses have few, if any, checks that would point out problems that could cost thousands of dollars because of a changed digit. In the right place, such changes could lead to penalties, interest and possible loss of a key customer.

    Why have we gone backwards in this area when compared to a mainframe system of fourty years ago? Certainly software has gotten more complex but basic issues of data integrity have fallen by the wayside. Much of this was done in hardware previously. It could be done cheaply in firmware and software today with minimal cost and minimal overhead. But it is not done.

    1. Re:what we have lost by glwtta · · Score: 1

      Yeah, go figure, cheap stuff is built to lower standards than really high-end stuff.

      A lot of businesses have few, if any, checks that would point out problems that could cost thousands of dollars because of a changed digit.

      I would think it's extremely unlikely that such random corruption would happen on some byte somewhere which actually gets interpreted as a meaningful digit; much more likely to either corrupt some format or produce some noticeable garbage somewhere (not "wrong-yet-meaningful" data). Or just go completely unnoticed - I recently discovered (don't ask how) that you can write relatively large chunks of garbage to many parts of an excel file without producing any noticeable effect.

      Why have we gone backwards in this area when compared to a mainframe system of fourty years ago?

      You could ask that about almost anything: we've had some spectacular advances in audio quality, so why do we settle for lossy formats and the "passable-at-best" sound of ipods?

      Answer: It's Good Enough (TM).

      --
      sic transit gloria mundi
    2. Re:what we have lost by Anonymous Coward · · Score: 0

      But consumer level stuff is so much cheaper that I could buy maybe 100 PCs instead of one mainframe. I can run multiple redundant machines each checking the other if I really want to. If a machine dies, I throw it in the trash and move to the next and at the end it's still cheaper than a mainframe.

      Plus technology is reaching such a quality and cheapness level that the overhead of redundant checking is too much of a cost or disadvantage for the minuscule chance that something may be corrupted.

    3. Re:what we have lost by suv4x4 · · Score: 3, Interesting

      Why have we gone backwards in this area when compared to a mainframe system of fourty years ago?

      For the same reason why experienced car drivers crash in ridiculous situations: they are too sure of themselves.

      The industry is so huge, that the separate areas of our computers just accept the rest is a magic box that should magically operate as is written in the spec. Screwups don't happen too often, and when they happen they are not detectable, hence no one woke up to it.

      That said don't feel bad, we're not going downwards. It just so happened speed and flashy graphics will play important role for another couple of years. Then after we max this out, the industry will seek to improve another parameter of their products, and sooner or later we'll hit back the data integrity issue :D

      Look at hard disks: does the casual consumer need more than 500 GB? So now we see the advent of faster hybrid (flash+disk) storage devices, or pure flash memory devices.

      So we've tackled storage size, we're about to tackle storage speed. And when it's fast enough, what's next, encryption and additional integrity checks. Something for the bullet list of features...

    4. Re:what we have lost by ozzee · · Score: 1
      Plus technology is reaching such a quality and cheapness level that the overhead of redundant checking is too much of a cost or disadvantage for the minuscule chance that something may be corrupted.

      Are you sure? The actual cost is minuscule and significantly less than the cost of potential errors.

    5. Re:what we have lost by hxnwix · · Score: 1

      There is no checking, period. I'm sorry, but for even the crappiest PC clones, you've been wrong since 1998. Even the worst commodity disk interfaces have had checking since then: UDMA uses a 16 bit CRC for data; SATA uses 32 bit CRC for commands as well. Most servers and workstations have had ECC memory for even longer. Furthermore, if you cared at all about your data, you already had end-to-end CRC/ECC.

      Yeah, mainframes are neat, but they don't save you from end-to-end integrity checking unless you really don't give a damn about it in the first place, being the sort of person who wakes up every morning to a tall steaming glass of IBM koolaid...
    6. Re:what we have lost by Anonymous Coward · · Score: 0

      Why have we gone backwards in this area when compared to a mainframe system of fourty years ago?
      $.
    7. Re:what we have lost by tonkdude · · Score: 1

      Actually even today, a mainframe running OS390 and CICS still has problems with disk corruption. If a machine crash or power outage happens when a file is being extended via a CA split, the only way to recover the file is from backup and then forward applying the CICS journals.

  15. RAM = the weakest link by DigiShaman · · Score: 3, Interesting

    It's well known that ECC and other forms of error correction are found at all levels of software and hardware. For example, hard drives have their own internal error correction while the file system it's formatted with may have another. Also worth mentioning, the CPU, serial busses, network adapters (both the physical IEEE 802.x connection and TCP/IP stack) and other forms of software error correction.

    Basically, the modern computer has various hardware and software layers of error correction stacked on top of each other if not at least by themselves.

    We do have weak link with desktops regarding RAM however. While modern workstations and server are generally installed with ECC RAM, our desktops are not. Also worth mentioning, most custom built clone PCs are for the desktop market. This has become a huge problem given the voltage and timing requirements don't leave much room for tolerance. The fact memory density has been going up only makes the chances for "bit flips" even worse. I can't tell you how many countless times I've ran into data corruption due to improper RAM settings. Running a few passes with Memtest 86+ will reveal this nasty issue. Hell, even Windows Vista now includes a utility to check for faulty RAM read/write issues that's how big the problem has become in the industry. As such, the desktop market severely needs to embrace ECC RAM like the server and workstation market. These days, to not use ECC is asking for trouble. And yes, you would take a 1 to 2% performance hit, but so what; Data integrity is more imporant.

    Note: The newer Intel P965 chipset does not support ECC memory while their older 965x does. Crying shame too given the P965 has been designed for Core 2 Due and Quad Core CPUs.

    --
    Life is not for the lazy.
    1. Re:RAM = the weakest link by IvyKing · · Score: 1

      We do have weak link with desktops regarding RAM however. While modern workstations and server are generally installed with ECC RAM, our desktops are not.


      The major failing of the original Apple Xserve 'supercomputer' cluster was the lack of ECC - ISTR an estimate of a memory error every few hours (estimate made by Del Cecchi on comp.arch), which would severely limit the kinds of problems that could be solved on the system. I also remember the original systems being replaced a few months later with Xserves that had ECC.


      And yes, you would take a 1 to 2% performance hit, but so what; Data integrity is more impor[t]ant.


      A 1 to 2% performance hit is less costly than having to do multiple runs to make sure the data didn't get munged.


      Note: The newer Intel P965 chipset does not support ECC memory while their older 965x does. Crying shame too given the P965 has been designed for Core 2 Due and Quad Core CPUs.


      You're right about the crying shame - what you have is a high end games machine. Perhaps AMD still has a chance if their chipsets support ECC RAM. Something similar came up a few years ago on one of the Sun newsgroups about the the latest Apple box being able to run rings around a much more expensive Sun box - the one limitation of the Apple box was lack of ECC.
    2. Re:RAM = the weakest link by BiggerIsBetter · · Score: 1

      The issue with chipsets is about market segmentation rather than newer=better. P965 is a "mainstream desktop" chipset, while say, a 975X is a "performance desktop" and/or "workstation" chipset and so supports ECC. The performance hit isn't a factor, but the price hit for the extra logic apparently is.

      --
      Forget thrust, drag, lift and weight. Airplanes fly because of money.
    3. Re:RAM = the weakest link by Anonymous Coward · · Score: 1, Informative

      Note: The newer Intel P965 chipset does not support ECC memory while their older 965x does. Crying shame too given the P965 has been designed for Core 2 Due and Quad Core CPUs.

      You meant 975x, not 965x. The successor of 975x is X38 (Bearlake-X) chipset supporting ECC DRAM. It should debut this month.

    4. Re:RAM = the weakest link by DigiShaman · · Score: 1

      You meant 975x, not 965x.

      Correct. A typo on my part.

      --
      Life is not for the lazy.
    5. Re:RAM = the weakest link by KonoWatakushi · · Score: 2, Insightful

      You're right about the crying shame - what you have is a high end games machine. Perhaps AMD still has a chance if their chipsets support ECC RAM.

      The nice thing about AMD is that with the integrated memory controller, you don't need support in the chipset. I'm not sure about Semprons, but all of the Athlons support ECC memory. The thing you have to watch out for is BIOS/motherboard support. If the vendor doesn't include the necessary traces on the board or the configuration settings in the BIOS, it won't work. It is worth noting that unbuffered ECC ram will work in non-ECC boards, but without actually using the ECC bits, so you have to make sure that the board explicitly supports ECC, and is not merely compatible.

      It is a shame though, and however nice a chip the Core2 is, AMD is the obvious choice if you care about your data.
    6. Re:RAM = the weakest link by Anonymous Coward · · Score: 1, Informative

      Sad given that ECC logic is so simple it's basically FREE.

      What's worse? It IS free!
      Motherboard chips (e.g. south bridge, north bridge) are generally limited in size NOT by the transistors inside but by the number of IO connections. There's silicon to burn, so to speak, and therefore plenty of room to add features like this.

      How do I know this? Oh wait, my company made them.... We never had to worry about state-of-the-art process technology because it wasn't worth it. We could afford to be several generations behind for exactly this reason.

  16. checksum offload by goarilla · · Score: 1

    doesn't checksum offload means that that functionality gets
    offloaded to another device like say an expensive NIC ? and thus removes that overhead from the CPU

    1. Re:checksum offload by Anonymous Coward · · Score: 1, Interesting

      Correct. However, there's two problems. Firstly, it's not an expensive NIC these days - virtually all Gigabit ethernet chips do at least some kind of TCP offload, and if these chips miscompute the checksum (or don't detect the error) due to being a cheap chip, you're worse off than doing it in software.

      Also, these don't protect against errors on the card or PCI bus. (If the data was corrupted on the card or on the bus after the checksum validation but before it got to system RAM for any reason, this corruption would be not be detected. But if the checksum validation was happening in software after the data was written to RAM, the curruption would be detected by the OS. It'd assume it's a network transmission error instead of a bad network card, but (in TCP) it would arrange for a retransmittal of the data.)

    2. Re:checksum offload by shird · · Score: 1

      Yes, but then there is the risk the data gets corrupted between the NIC and CPU. Doing the checksum at the CPU checks the integrity of the data at the end-point, rather than on its way to the CPU.

      --
      I.O.U One Sig.
  17. How much does scrubbing cost? by skeptictank · · Score: 2, Interesting

    Can anyone point me toward some information on the hit to CPU and I/O throughput for scrubbing?

    1. Re:How much does scrubbing cost? by feld · · Score: 1

      i was wondering the same thing... i dont have scrubbing enabled on my opteron workstation. i should do a memory benchmark test or two and turn it on to see how it compares.

    2. Re:How much does scrubbing cost? by Detritus · · Score: 1

      Scrubbing for RAM is an insignificant amount of overhead. All it involves is doing periodic read/write cycles on each memory location to detect and correct errors. This can be done as a low-priority kernel task or as part of the timer interrupt-service-routine.

      --
      Mea navis aericumbens anguillis abundat
    3. Re:How much does scrubbing cost? by skeptictank · · Score: 1

      If a system was operating in an environment where a failure was more like is it desirable to increase the frequency of the access to a given memory location. It seems reasonable that this would be the case. I am looking at an application that could be exposed to a higher level of cosmic rays than would be the normal for ground based workstations.

    4. Re:How much does scrubbing cost? by Detritus · · Score: 1

      If you want to be thorough about it, you need to determine the acceptable probability of an uncorrectable error in the memory system, the rate at which errors occur, and the scrub rate needed to meet or exceed your reliability target. If you scrub when the system is idle, you will probably find that the scrub rate is much higher than the minimum rate needed to meet your reliability target. In really hostile environments, you may need a stronger ECC and/or a different memory organization.

      --
      Mea navis aericumbens anguillis abundat
  18. Speed Kills by Anonymous Coward · · Score: 0

    The subject is the comment.

  19. Re: ...but in the market speed sells not correctne by Anonymous Coward · · Score: 1, Insightful

    I can't understand why people don't spend the extra money on ECC memory *ALL THE TIME*. One failure over the lifetime of the computer and you have paid for your RAM.

    I do understand it. They live in the real world, where computers are fallible, no matter how much you spend on data integrity. It's a matter of diminishing return. Computers without ECC are mostly stable and when they're not, they typically exhibit problems on a higher level. I've had faulty RAM once. Only one bit was unstable and only one test of the many Memtest routines triggered the defect. Even a fault that small caused problems with every other verified CD burning. Given that lots of other reasons can cause data integrity violations, many of which can't be avoided because they're rooted in the imperfections of human nature, it is more effective to have procedures in place to deal with problems than to avoid them 100%.

  20. Timely article ... by ScrewMaster · · Score: 2, Interesting

    As I sit here having just finished restoring NTLDR to my RAID 0 drive after the thing failed to boot. I compared the original file and the replacement, and they were off by ONE BIT.

    --
    The higher the technology, the sharper that two-edged sword.
    1. Re:Timely article ... by dotgain · · Score: 1
      A while ago I had an AMD K6-2 which couldn't gunzip one of the XFree86 tarballs (invalid compressed data - CRC error). I left memtest running over 24 hours which showed nothing, copying the file onto another machine (using the K6 as a fileserver) and gunzipping it there worked. I eventually bumped into someone with the same mobo and same problem, and figured binning the mobo was the fix.

      To be honest, most of the comments about ECC RAM here have convinced me that it's worth it just for more peace of mind - all those poppy MP3s I've got and never thought much about could well have been caused by corruption at my end all this time.

  21. Re: ...but in the market speed sells not correctne by ozzee · · Score: 1
    They live in the real world, where computers are fallible ...

    Computers are machines and don't need to be designed to be fallible. ECC is a small insurance policy to avoid problems exactly like the one you described. How much time did you spend on burning CD's that were no good, or running various memtests, not to mention the possible corrupted data you ended up saving and other unknown consequences ? Had you bought ECC RAM, your problem would have been corrected or more than likely detected not to mention that the memory manufacturers would need to push up their quality needs. The extra money for ECC would become miniscule if everyone bought only ECC RAM.

    Equating human fallibility to physical fallibility makes little sense. If we sought to use the same standards you're proposing for electrical engineering to say civil engineering, it would extend to it being OK for a building to topple because they're "fallible" - not good.

    If I could validate every data path in my computer at up to a 20% premium I would. That's better than an insurance policy. It happens to be that RAM is especially susceptible to errors that are very difficult to diagnose or even repeat and so reducing the probability of such errors is desirable, and at a small price like ECC RAM, it's a bargain of an insurance policy.

    BTW, I'm not saying that you don't put procedures in place to deal with hardware failure. I'm saying that treating the problem may be far more effective than treating the symptoms.

  22. Real-life proof of ZFS detecting problems by E-Lad · · Score: 3, Informative

    Give this blog entry a read:
    http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta

    And you'll understand :)

  23. HEY. by yoyhed · · Score: 2, Funny

    TFA doesn't list ALL the possible ways data can be corrupted. It fails to mention the scenario of Dark Data (an evil mirror of your data, happens more commonly with RAID 1) corrupting your data with Phazon. In this case, the only way to repair the corruption is to send your data on a quest to different hard drives across the world (nay, the GALAXY) to destroy the Seeds that spread the corruption.

    --
    WHO NEEDS SHIFT WHEN YOU HAVE CAPSLOCK/ DAMN1
    1. Re:HEY. by DeadChobi · · Score: 1

      Praise be to the almighty Torrent, that he may deliver us from data corruption!

      --
      SRSLY.
  24. my path to corruption.... by Anonymous Coward · · Score: 0

    Back in the gold old MSDOS days I managed to get one of the first VESA Local BUS IDE cards, that promised great transfer rates over ISA cards. Well, I played with the jumper settings to enable DMA fast transfers but the hardware (or driver) was not up to spec... Booted up and first thing I do is issue some "dir /s /p" commands (that was the unofficial visual speed test back then) I noticed that the listings got more corrupted with each pass (ouch), everything got screwed so I had to get back to the default jumper settings and let the format begin... LOL

  25. I've seen it happen by Hans+Lehmann · · Score: 1

    I previously had a Shuttle desktop machine running Windows XP. One day I started noticing that when I copied files to a network file server, about 1 out of 20 or so would get corrupted, with larger files getting corrupted more often than smaller ones. Copying them to the local IDE hard drive caused no problems, and other machines did not have problems copying files to the same file server. I spent a lot of time swapping networking cards, etc. and not getting anywhere, until I plugged in a USB drive and noticed that files were also getting corrupted when copied to it.
    I then ran tests with large random files, doing diff's between the originals and the copies. The errors were always single bytes that had changed; the file size never changed. Interestingly, whenever there was a changed byte, the seventh and eighth bytes preceding the error were always the same values, although having those two values next to each other in a file did not always cause the error. The problem turned out to be a bad motherboard; the data path to some destinations like the NIC and USB ports would corrupt data, while the path to the IDE connectors would not.

    --
    09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
    1. Re:I've seen it happen by kg261 · · Score: 1

      And I have seen this happen on the IDE as well. In my case, the fan for the bridge chip had failed causing a bit error on disk writes every few hundred megabytes. This went on for I do not know how many months before I actually did a file copy and CMP to find the errors. Ethernet and other ports were fine.

  26. Re:YOU FAIl IT by ScrewMaster · · Score: 1

    Apparently this idiot has a bad network card. It's obviously corrupting outgoing packets.

    --
    The higher the technology, the sharper that two-edged sword.
  27. Re: ...but in the market speed sells not correctne by Anonymous Coward · · Score: 1, Interesting

    Computers are machines and don't need to be designed to be fallible. ECC is a small insurance policy to avoid problems exactly like the one you described. How much time did you spend on burning CD's that were no good, or running various memtests

    That's beside the point. Computers ARE fallible, with or without ECC RAM. That you think they could be perfect (infallible) is testament to the already low rate of hardware defects which harm data integrity. It's good enough. I've experienced and located infrequent defects in almost every conceivable component of a computer system. An ECC error does not mean that the RAM is faulty. It could be caused by an aging capacitor, by a badly designed mainboard or a bunch of other reasons. An error just tells you that something is wrong. You still have to look for the cause.

    If I could validate every data path in my computer at up to 20% premium, I would too. Unfortunately that is impossible, and not just because 20% is too small a premium to expect perfection. A stray particle from our sun could flip a bit in the processor and you'd be none the wiser. A seldom triggered off-by-one error in your favorite software could cause equally catastrophic mistakes as a flipped bit in main memory, and it wouldn't be caught by ECC RAM or any other available automatic integrity check. I'm not equating human fallibility to hardware problems. I'm explaining that at the current rate of faults in RAM modules, it is not the most common problem, which is precisely why it's rarely diagnosed correctly on the first try. That makes it a type of error which people don't want to pay money to avoid, as long as it can be found somehow. It turns out that it is surprisingly easy to detect too, because RAM rarely sits unused for very long, so even spurious defects show up on higher levels with a frequency that causes them to be noticed quickly. People have to be on the lookout for other defects and user errors all the time, they don't need to do anything extra to know that something is wrong when bad RAM is the cause. It just shows on a different level.

    It is much more important to have working high level checks, otherwise you're going to miss lots of flaws. That's why mission critical systems run data through redundant systems with different implementations by different people and compare the results. A "whole system parity check", if you will. RAID is designed with the same philosophy: Cheap, possibly faulty hardware is used and errors are detected on a higher level and corrected if possible. Real world systems just place the checks much closer to the user, or even beyond the user, where laws allow for correction of mistakes post factum. A flipped bit in the exponent of a financial transaction does not mean you lose a lot of money. It means you end up having to correct that error. But the real world gives you that opportunity, so you're fine with saving money by not trying to achieve infallibility.

  28. Hah-Ironing things out. by Anonymous Coward · · Score: 0

    "Attempting to recover data packed at a density of 1 GB/sq.in. from a disk spinning at 10,000 revolutions per minute where the actual data is stored in a micron thin layer of rust on the surface of the disk is manifestly impossible."

    It's not always iron oxide . Sometimes it's Cobalt

  29. Other schemes by jd · · Score: 1

    Now, as far as I know, there are many schemes for correcting and detecting errors. Some, like FEC, fix infrequent, scattered errors. Others, like turbocodes, fix sizeable blocks of errors. This leads to two questions: what is the benefit in using plain CRCs any more? And since disks are block-based not streamed, wouldn't block-based error-correction be more suitable for the disk?

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
    1. Re:Other schemes by Wesley+Felter · · Score: 1

      Now, as far as I know, there are many schemes for correcting and detecting errors. Some, like FEC, fix infrequent, scattered errors. Others, like turbocodes, fix sizeable blocks of errors. This leads to two questions: what is the benefit in using plain CRCs any more?

      CRCs are only used for detecting errors. Once you've detected a bad disk block, you can use replication (RAID 1), parity (RAID 4/5/Z), or some more advanced FEC (RAID 3/6/Z2) to correct the error. The benefit of CRCs is that you can read only the data blocks (saving bandwidth), check the CRC, and ignore the check blocks if the CRC is correct; you only need to read check blocks from disk when the CRC is wrong.

  30. Much Ado About Very Little by Jane+Q.+Public · · Score: 1

    I was a early "adopter" of the Internet... and when I was on a slow dial-up line, even with checksums being done on-the-fly via hardware, and packets being re-sent willy-nilly due to insufficient transmission integrity, my data seemed to get corrupted almost as often as not.

    Today, with these "unreliable" hard drives, and (apparently, if we believe the post) less hardware checking being done, I very, very seldom receive data, or retrieve data from storage, that is detectably corrupted. My CRCs almost invariably check out after the download and storage, and I am a happy camper. :o)

    Once in a great while, I get a file or other data stream that has unacceptable corruption in it. But that has been very rare in recent years, and I have little cause to complain.

    Commercial interests that have big investments in their data might, of course, be justified in taking stiffer measures than the average consumer. Nothing new about that.

    I just don't see the problem here. The "free market" (Microsoft and certain others notwithstanding) has reached solutions that are acceptable to the customers. Where is the issue?

    1. Re:Much Ado About Very Little by Detritus · · Score: 1

      The free market doesn't always produce socially desirable results. Manufacturers can also get trapped in a race to the bottom. Just look at the current quality of floppy disk drives and their media. I can remember when they actually worked.

      --
      Mea navis aericumbens anguillis abundat
    2. Re:Much Ado About Very Little by Jane+Q.+Public · · Score: 1

      That is still the free market at work. The quality has gone down because there is no pressure to keep it up. NOBODY uses floppy disks anymore. There is no market, so there is no market pressure.

    3. Re:Much Ado About Very Little by Detritus · · Score: 1

      A case of chicken and egg. Many people stopped using them because the quality was so bad. There was a market failure in that even if you were willing to pay more, you couldn't buy stuff that worked reliably. Anyone interested in producing a quality product had left the market.

      --
      Mea navis aericumbens anguillis abundat
  31. Re: ...but in the market speed sells not correctne by ozzee · · Score: 1
    Computers ARE fallible, with or without ECC RAM.

    Yes. They are, but considerably less fallible with ECC. Remember, "I can give you the wrong answer in zero seconds." There's no point in computing at all unless there is a very high degree of confidence in computational results. Software is just as fallible as hardware but again, I can, and do, make considerable effort in making it less fallible.

    I am the default sysadmin for a family member's business. There was a time where the system was fraught with network failures on a constant basis. Printers stopped working, wan, stopped working etc without any apparent explanation. So called "experts" were called out to fix the problem which was fixed and broken the next day. Each and every time there was a failure, the cost was significant - huge, far more than ECC RAM. Just the time it took to rule out memory failure was more than the cost of ECC RAM. Under your fatalistic mode, he would be relegated to continual system failures. I reconfigured the systems, made some DHCP addresses virtually static and dropped the PPPOE in the box and put it on a Linux server, etc etc. It's been well over a year now and the only failure has been the occasional power interruption and my forgetting to set services to run on reboots. MUCH MUCH more reliable.

    As for flipping bits. DRAM is particularly susceptible to background radiation while the CPU data paths are not for various reasons and there is far more silicon devoted to memory than there is to the CPU, hence using ECC over main memory is a huge deal. I have had alot of experience and I know that memory is a significant cause of reliability issues hence again, the cost of ECC RAM is minuscule compared to the benefits.

    ECC - corrects errors. So even if you have a faulty bit, it is both detected AND corrected which makes it so that you don't have to worry about somthing that would otherwise have stopped you in your tracks.

    I find it very strange that you have a distinction between "mission critical" and some user's machine. I have a very high expectation and whenever there is a fault, I will diagnose it to remove it. Anything which is unrepeatable wastes alot of my time and so ECC ram removes a large set of those problems - for an extra 20% investment, it's not even above the noise of other things.

  32. ECC memory & Intel by Anonymous Coward · · Score: 0

    A serious weakness in modern PCs is the lack of ECC memory. I think this is caused primarily by Intel. To create market segmentation Intel's mainstream chipsets (i815, i845, i865, i915, i945, i965 and later) do not support ECC memory. I believe this is actually market segmentation, and not a real cost reduction, because all mainstream chipsets before the i815, like the i440 LX & BX, support ECC.
    A side effect of this is that it's now very expensive to build a home PC with ECC memory, because you now have to buy an expensive mainboard with intel's premium chipset (i875, i955, i975, etc.) to have ECC support.

  33. I have seen this many times, unfortunately. :-( by Terje+Mathisen · · Score: 2, Interesting

    We have 500+ servers worldwide, many of them contains the same program install images which by definition should be identical:

    One master, all the others are copies.

    Starting maybe 15 years ago, when these directory structures were in the single-digit GB range, we started noticing strange errors, and after running full block-by-block compares between the master and several slave servers we determined that we had end-to-end error rates of about 1 in 10 GB.

    Initially we solved this by doubling the network load, i.e. always doing a full verify after every copy, but later on we found that keeping the same hw, but using sw packet checksums, was sufficient to stop this particular error mechanism.

    One of the errors we saw was a data block where a single byte was repeated, overwriting the real data byte that should have followed it. This is almost certainly caused by a timing glitch which over-/under-runs a hardware FIFO. Having 32-bit CRCs on all Ethernet packets as well as 16-bit TCP checksums doesn't help if the path across the PCI bus is unprotected and the TCP checksum has been verified on the network card itself.

    Since then our largest volume sizes have increased into the 100 TB range, and I do expect that we now have other silent failure mechanisms: Basically, any time/location when data isn't explicitly covered by end-to-end verification is a silent failure waiting to happen. On disk volumes we try to protect against this by using file systems which can protect against lost writes as well as miss-placed writes (i.e. the disk reports writing block 1000, but in reality it wrote to block 1064 on the next cylinder).

    NetApp's WAFL is good, but I expect Sun's ZFS to an equally good job a significantly lower cost.

    Terje

    --
    "almost all programming can be viewed as an exercise in caching"
    1. Re:I have seen this many times, unfortunately. :-( by saik0max0r · · Score: 1

      NetApp's WAFL is good, but I expect Sun's ZFS to an equally good job a significantly lower cost.

      Hard to say for certain, as comparing WAFL and ZFS ignores the often overlooked but additional integration that you get with a Filer as opposed to a more general purpose system running ZFS, which is still pretty green when it comes to this sort of stuff:

      http://mail.opensolaris.org/pipermail/zfs-discuss/2006-November/036124.html.

      It's been my experience that data corruption typically occurs in RAM (ECC), at HBA, cabling or drive level itself. The difference in firmware behavior with integrated system vs. some drive you bought at Fry's is where true "End to End" data protection comes into play.

      Adding up that level of integration makes the file system merely a component in a very very expensive system ;)

  34. BS by Anonymous Coward · · Score: 0

    If data corruption was so possible, you would get segfaults all the time because loaded binaries would have wrong machine code bytes in them. And even 1 bit wrong in assembly is enough for segv.

    Every time you downloaded your fav *.unubtu package, it would be corrupted so either bzip would fail of the program would crash if you tried to run it. Not to mention configuration files and the fact that debugging would be impossible in such a case.

    But i've never seen such segfaults on linux, except when in a boxen that the motherboard was damaged.

    So alan is full of it.

  35. Re:Iffy cards? Try crappy drivers by cnettel · · Score: 1

    Simple: the bus can be faulty (or the connection NIC-bus). The memory can go bad. If you do checksum offloading, you only verify the integrity at the endpoint in your machine. If you move it to the CPU, the path for data where it may be corrupted is shortened.

  36. Girlfriend? by Anonymous Coward · · Score: 0

    You must be new here.

    1. Re:Girlfriend? by StarfishOne · · Score: 2, Funny

      No, girlfriend waveforms can collapse in such a way that one can actually have one. This may not happen often when combined with the /. waveform.. but every now and then it does happen. ;)

  37. Re: ...but in the market speed sells not correctne by 51mon · · Score: 1

    This mentality of speed at the cost of correctness is prevalent...

    I use to sell firewalls. People always wanted to know how fast it would work (most were good up to around 100Mbps, when most people had at most 2Mbps pipes at most), very few people asked detailed questions about what security policies it could enforce, or the correctness and security of the firewall device itself.

    Everyone knew they needed something, very few had a clue about selecting a good product, speed they understood, network security in comparison is pretty tough. Other forms of correctness are I think also more difficult to comprehend.

    How many people know the safety rating of their automobile? Okay probably the wrong people to ask.

  38. Re: ...but in the market speed sells not correctne by Anonymous Coward · · Score: 0

    How many people know the safety rating of their automobile?

    Excellent example. If driving were so dangerous that buying a car for maximum safety would be commonplace, then people wouldn't drive. The general safety of cars is high enough that you don't need to worry about how safe a particular car is exactly. Same with RAM. If people thought that the defect rate of RAM warranted ECC RAM, then they wouldn't trust their computers at all.

  39. oblig xkcd by Anonymous Coward · · Score: 0
  40. Faulty offloading by Tribbles · · Score: 1

    I came across a NIC with faulty offloading. It was at a customer's site, and it took a month to diagnose.

    The only way I found out was with an Ethereal trace at each end - I could see that every 0x2000 bytes there was a corruption. We turned TCP segment offloading off, and it worked fine because the maximum packet size was 1536 bytes - before 0x2000.

  41. Even Microsoft tried to push for ECC RAM by Sits · · Score: 1

    The perils of RAM just seems to be one of those open secrets. Apparently even Microsoft has tried pushing for ECC RAM in all machines (including dekstops) as memory errors have risen to the top 10 causes of system crashes according to their crash analysis.

    Earlier this decade I was living with strange, random crashes when booting Linux that would only seemingly only occur when booting from cold (but not every time!). It was only years later when running a memtest on someone else's sticks (which turned out to be fine in their machine) that I learned that the motherboard was unhappy with the timing settings even though the RAM was rated as being compatible with said settings.

    A few years later on a different machine I was MD5summing a big file locally and the sum turned out to be different to what was expected. For whatever reason I tried doing a MD5sum of the same file over a network filesystem (as the server was running Samba) and was surprised to find that the sum came to the expected result. Upon reruning the MD5sum locally the sum was again incorrect... ... in a manner that was different to the first try. Repeated runs on the same unchanging file were giving different MD5sums. Running a memtest went on to show the memory was dodgy.

    I have also seen a not inexpensive system throw up SCSI errors and effectively make the disks disappear from the operating system while doing RAID 5 with a hardware SCSI RAID controller after a single disk failure. Apparently if a SCSI disk fails in a particularly rare manner it is possible for it to keep the bus busy and thus disable communication with any other drives on same bus. Thankfully the backups worked (the filesystem had been corrupted) but if your data is important you can never be too careful. As disk and memory sizes go up along with the rates that data are transferred the chances of rare circumstances like bit flips happening only increase. The question is - will you notice the problem before it's too late and are how much are you willing to pay (in terms of money and performance) to reduce the odds?

    1. Re:Even Microsoft tried to push for ECC RAM by Anonymous Coward · · Score: 0

      That happens with direct-connect drives like the hot-swappable 80 pin sca and drives connected via cable(s). Older drives that had logic between the hard drive and the edge connector didn't fail this way. You are correct, a drive that fails in a particular way will grab ahold of the bus and not let go, puking the array. Often, pulling the offending drive will allow the array to go online and a new drive can be inserted and rebuilt. Usually, the array goes offline before any data corruption can occur.

  42. data corruption happens often and easily by treat · · Score: 1

    Refusing to implement integrity checks at every level is data mismanagement.

    The filesystem should provide this.

    Linux people have been denying for years that hardware will cause data corruption. Therefore they can deny their own responsibility in detecting and correcting it.

    It is everyone's responsibility to make OS people aware of how often hardware causes data corruption.

    http://www.storagetruth.org/index.php/2006/data-corruption-happens-easily/

  43. Comment removed by account_deleted · · Score: 1

    Comment removed based on user account deletion

  44. Been there, seen that by Anonymous Coward · · Score: 0

    I used to work for a big data network company (you kids never heard of it), where the marketing department quoted the theoretical error rate of the CRC used on the packets as the actual error rate for the network. I got my first evidence this might be optimistic when I saw a chunk of text obviously from (big important corporate customer) mixed into the middle of text I was reading on my terminal in manufacturing. Well, I was just a tech, nobody I mentioned it to showed any sign of interest, so what could I do?

    Years later I learned that a defect in the backplane of one of our packet switch models had caused problems in the network for quite a while before it was discovered and fixed.

    Well, packets came in, CRC was verified then stripped, packets were disassembled and routed, new packets were assembled, new CRC calculated and attached, packets went out.

    What could go wrong?

  45. Re: ...but in the market speed sells not correctne by ozzee · · Score: 1
    If people thought that the defect rate of RAM warranted ECC RAM,then they wouldn't trust their computers at all.

    Sarcasm right ?

  46. IBM System p machines has extensive ECC checks by Anonymous Coward · · Score: 0

    There was some IBM paper on how they made it that way. They have RAM data scrubing and the disk has 520 bytes sectors. All major path, even inside the processor, is ECC checked.

    IEEE Spectrun has in a recent editon extensive information and links about this in multi CPU machines.

    Unfortunately the same is not true for their System x (x86) machines.

  47. Glitch on the bus, Gus! by billstewart · · Score: 2, Funny
    Burned out fan, Stan.

    ...


    You can make up some more yourself....

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks