The Many Paths To Data Corruption
Runnin'Scared writes "Linux guru Alan Cox has a writeup on KernelTrap in which he talks about all the possible ways for data to get corrupted when being written to or read from a hard disk drive. This includes much of the information applicable to all operating systems. He prefaces his comments noting that the details are entirely device specific, then dives right into a fascinating and somewhat disturbing path tracing data from the drive, through the cable, into the bus, main memory and CPU cache. He also discusses the transfer of data via TCP and cautions, 'unfortunately lots of high performance people use checksum offload which removes much of the end to end protection and leads to problems with iffy cards and the like. This is well studied and known to be very problematic but in the market speed sells not correctness.'"
The most common way for young data to be corrupted is to be saved on a block that once contained pornographic data. As we all know, deleting data alone is not sufficient, as that will only remove the pointer to the data while leaving the block containing it undisturbed. This allows a young piece of data to easily see the old porn data as it is being written to that block. For this reason, it is imperative that you keep all pornographic data on separate physical drives.
In addition, you should never access young data and pornographic data in the same session, as the young impressionable data may get corrupted by the pornographic data if they exist in RAM at the same time.
Data corruption is a serious problem in computing today, and it is imperative that we take steps to stop our young innocent data from being corrupted.
There must be 50 ways to lose your data.
Oh, say does that Star-Spangled Banner entwine / The myrtle of Venus with Bacchus's vine?
As Alan Cox alluded to, there are benchmarks for data transfers, web performance, etc, etc, etc, but none for data integrity, it's kind of assumed, even if it perhaps shouldn't be. It also reminds me of various cluster software which will happily crash a node rather than risk data corruption (Sun Cluster & Oracle RAC both do this). What do you [em]really[/em] want? Lightning fast performance, or the comfort of knowing that your data is intact & correct? For something like a rendering farm, you can probably tolerate a pixel or two being the wrong shade. If you're dealing with money, you want the data to be 100% correct, otherwise there's a world of hurt waiting to happen...
Some enterprise server systems use end-to-end protection, meaning the data block is longer. If you write 512 bytes of data + 12 bytes or so of check data and carry that through all of the layers, it can prevent the data corruption from going undiscovered. The check data usually includes the block's address, so that data written with correct CRC but in the wrong place will also be discovered. It is bad enough to have data corrupted by a hardware failure, much worse not to detect it.
Intron: the portion of DNA which expresses nothing useful.
ZFS's end-to-end checksums detect many of these types of corruption; as long as ZFS itself, the CPU, and RAM are working correctly, no other errors can corrupt ZFS data.
I am looking forward to the day when all RAM has ECC and all filesystems have checksums.
I think I suffered from a series of Type III errors (rtfa). After merging lots of poorly maintained backups of my /home file system I decided to write a little script to look for duplicate files (using file size as a first indicator, then md5 for ties). The script would identify duplicates and move files around into a more orderly structure based on type, etc. After doing this i noticed that a small number of my mp3's now contain chunks of other songs in them. My script was only working with whole files, so I have no idea how this happened. When I refer back to the original copies of the mp3s the files are uncorrupted.
Of course, no one believes me. But maybe this presentation is on to something. Or perhaps I did something in a bonehead fashion totally unrelated.
:wq ~ ~ ~ ~ ~
I was expecting an article on using MySQL in production.
That'll get fixed lickety split.
Deleted
as long as ZFS itself, the CPU, and RAM are working correctly, no other errors can corrupt ZFS data.
Sorry, but that is absurd. Nothing can absolutely protect against data errors (even if they only happen in the hard disk). For example, errors can corrupt ZFS data in a way that turns out to have the same checksum. Or errors can corrupt both the data and the checksum so they match each other.
This is ECC 101 really.
That is nothing compared to the actual storage technology. Attempting to recover data packed at a density of 1 GB/sq.in. from a disk spinning at 10,000 revolutions per minute where the actual data is stored in a micron thin layer of rust on the surface of the disk is manifestly impossible.
Ah - this is the bane of computer technology.
One time I remember writing some code and it was very fast and almost always correct. The guy I was working with exclaimed "I can give you the wrong answer in zero seconds" and I shut up and did it the slower way that was right every time.
This mentality of speed at the cost of correctness is prevalent, for example I can't understand why people don't spend the extra money on ECC memory *ALL THE TIME*. One failure over the lifetime of the computer and you have paid for your RAM. I have assembled many computers and unfortunately there have been a number of times where ECC memory was not an option. In almost every case where I have used ECC memory, the computer was noticably more stable. Case in point, the most recent machine that I built has never (as far as I know) crashed and I've thrown same really nasty workloads it's way. On the other hand, a couple of notebooks I have have crashed more often than I care to remember and there is no ECC option. Not to mention the ridicule I get for suggesting that people invest the extra $30 for a "non server" machine. Go figure. Suggesting that stability is the realm of "server" machines and infer end user machines should be relegated to a realm of lowered standards of reliability makes very little sense to me especially when the investment of $30 to $40 is absolutely minuscule if it prevents a single failure. What I think (see lawsuit coming on) is that memory manufacturers will sell quality marginal products to the non ECC crowd because there is no way of validating memory quality.
I think there needs to be a significant change in the marketing of products to ensure that metrics of data integrity play a more significant role in decision making. It won't happen until the consumer demands it and I can't see that happening any time soon. Maybe, hopefully, I am wrong.
I remember a long time ago that cosmic rays (actually the ElectroMagnetic Field disruption they caused) created some of those errors.
It amazes me how much has been lost over the years towards the "consumerization" of computers.
Large mainframe systems have had data integrity problems solved for a long, long time. It is today unthinkable that any hardware issues or OS issues could corrupt data on IBM mainframe systems and operating systems.
Personal computers, on the other hand, have none of the protections that have been present since the 1970s on mainframes. Yes, corruption can occur anywhere in the path from the CPU to the physical disk itself or during a read operation. There is no checking, period. And not only are failures unlikely to be quickly detected but they cannot be diagnosed to isolate the problem. All you can do is try throwing parts at the problem, replacing functional units like the disk drive or controller. These days, there is no separate controller - its on the motherboard - so your "functional unit" can almost be considered to be the computer.
How often is data corrupted on a personal computer? It is clear it doesn't happen all that often, but in the last fourty years or so we have actually gone backwards in our ability to detect and diagnose such problems. Nearly all businesses today are using personal computers to at least display information if not actually maintain and process it. What assurance do you have that corruption is not taking place? None, really.
A lot of businesses have few, if any, checks that would point out problems that could cost thousands of dollars because of a changed digit. In the right place, such changes could lead to penalties, interest and possible loss of a key customer.
Why have we gone backwards in this area when compared to a mainframe system of fourty years ago? Certainly software has gotten more complex but basic issues of data integrity have fallen by the wayside. Much of this was done in hardware previously. It could be done cheaply in firmware and software today with minimal cost and minimal overhead. But it is not done.
It's well known that ECC and other forms of error correction are found at all levels of software and hardware. For example, hard drives have their own internal error correction while the file system it's formatted with may have another. Also worth mentioning, the CPU, serial busses, network adapters (both the physical IEEE 802.x connection and TCP/IP stack) and other forms of software error correction.
Basically, the modern computer has various hardware and software layers of error correction stacked on top of each other if not at least by themselves.
We do have weak link with desktops regarding RAM however. While modern workstations and server are generally installed with ECC RAM, our desktops are not. Also worth mentioning, most custom built clone PCs are for the desktop market. This has become a huge problem given the voltage and timing requirements don't leave much room for tolerance. The fact memory density has been going up only makes the chances for "bit flips" even worse. I can't tell you how many countless times I've ran into data corruption due to improper RAM settings. Running a few passes with Memtest 86+ will reveal this nasty issue. Hell, even Windows Vista now includes a utility to check for faulty RAM read/write issues that's how big the problem has become in the industry. As such, the desktop market severely needs to embrace ECC RAM like the server and workstation market. These days, to not use ECC is asking for trouble. And yes, you would take a 1 to 2% performance hit, but so what; Data integrity is more imporant.
Note: The newer Intel P965 chipset does not support ECC memory while their older 965x does. Crying shame too given the P965 has been designed for Core 2 Due and Quad Core CPUs.
Life is not for the lazy.
doesn't checksum offload means that that functionality gets
offloaded to another device like say an expensive NIC ? and thus removes that overhead from the CPU
Can anyone point me toward some information on the hit to CPU and I/O throughput for scrubbing?
The subject is the comment.
I can't understand why people don't spend the extra money on ECC memory *ALL THE TIME*. One failure over the lifetime of the computer and you have paid for your RAM.
I do understand it. They live in the real world, where computers are fallible, no matter how much you spend on data integrity. It's a matter of diminishing return. Computers without ECC are mostly stable and when they're not, they typically exhibit problems on a higher level. I've had faulty RAM once. Only one bit was unstable and only one test of the many Memtest routines triggered the defect. Even a fault that small caused problems with every other verified CD burning. Given that lots of other reasons can cause data integrity violations, many of which can't be avoided because they're rooted in the imperfections of human nature, it is more effective to have procedures in place to deal with problems than to avoid them 100%.
As I sit here having just finished restoring NTLDR to my RAID 0 drive after the thing failed to boot. I compared the original file and the replacement, and they were off by ONE BIT.
The higher the technology, the sharper that two-edged sword.
Computers are machines and don't need to be designed to be fallible. ECC is a small insurance policy to avoid problems exactly like the one you described. How much time did you spend on burning CD's that were no good, or running various memtests, not to mention the possible corrupted data you ended up saving and other unknown consequences ? Had you bought ECC RAM, your problem would have been corrected or more than likely detected not to mention that the memory manufacturers would need to push up their quality needs. The extra money for ECC would become miniscule if everyone bought only ECC RAM.
Equating human fallibility to physical fallibility makes little sense. If we sought to use the same standards you're proposing for electrical engineering to say civil engineering, it would extend to it being OK for a building to topple because they're "fallible" - not good.
If I could validate every data path in my computer at up to a 20% premium I would. That's better than an insurance policy. It happens to be that RAM is especially susceptible to errors that are very difficult to diagnose or even repeat and so reducing the probability of such errors is desirable, and at a small price like ECC RAM, it's a bargain of an insurance policy.
BTW, I'm not saying that you don't put procedures in place to deal with hardware failure. I'm saying that treating the problem may be far more effective than treating the symptoms.
Give this blog entry a read:
:)
http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta
And you'll understand
TFA doesn't list ALL the possible ways data can be corrupted. It fails to mention the scenario of Dark Data (an evil mirror of your data, happens more commonly with RAID 1) corrupting your data with Phazon. In this case, the only way to repair the corruption is to send your data on a quest to different hard drives across the world (nay, the GALAXY) to destroy the Seeds that spread the corruption.
WHO NEEDS SHIFT WHEN YOU HAVE CAPSLOCK/ DAMN1
Back in the gold old MSDOS days I managed to get one of the first VESA Local BUS IDE cards, that promised great transfer rates over ISA cards. Well, I played with the jumper settings to enable DMA fast transfers but the hardware (or driver) was not up to spec... Booted up and first thing I do is issue some "dir /s /p" commands (that was the unofficial visual speed test back then) I noticed that the listings got more corrupted with each pass (ouch), everything got screwed so I had to get back to the default jumper settings and let the format begin... LOL
I previously had a Shuttle desktop machine running Windows XP. One day I started noticing that when I copied files to a network file server, about 1 out of 20 or so would get corrupted, with larger files getting corrupted more often than smaller ones. Copying them to the local IDE hard drive caused no problems, and other machines did not have problems copying files to the same file server. I spent a lot of time swapping networking cards, etc. and not getting anywhere, until I plugged in a USB drive and noticed that files were also getting corrupted when copied to it.
I then ran tests with large random files, doing diff's between the originals and the copies. The errors were always single bytes that had changed; the file size never changed. Interestingly, whenever there was a changed byte, the seventh and eighth bytes preceding the error were always the same values, although having those two values next to each other in a file did not always cause the error. The problem turned out to be a bad motherboard; the data path to some destinations like the NIC and USB ports would corrupt data, while the path to the IDE connectors would not.
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
Apparently this idiot has a bad network card. It's obviously corrupting outgoing packets.
The higher the technology, the sharper that two-edged sword.
Computers are machines and don't need to be designed to be fallible. ECC is a small insurance policy to avoid problems exactly like the one you described. How much time did you spend on burning CD's that were no good, or running various memtests
That's beside the point. Computers ARE fallible, with or without ECC RAM. That you think they could be perfect (infallible) is testament to the already low rate of hardware defects which harm data integrity. It's good enough. I've experienced and located infrequent defects in almost every conceivable component of a computer system. An ECC error does not mean that the RAM is faulty. It could be caused by an aging capacitor, by a badly designed mainboard or a bunch of other reasons. An error just tells you that something is wrong. You still have to look for the cause.
If I could validate every data path in my computer at up to 20% premium, I would too. Unfortunately that is impossible, and not just because 20% is too small a premium to expect perfection. A stray particle from our sun could flip a bit in the processor and you'd be none the wiser. A seldom triggered off-by-one error in your favorite software could cause equally catastrophic mistakes as a flipped bit in main memory, and it wouldn't be caught by ECC RAM or any other available automatic integrity check. I'm not equating human fallibility to hardware problems. I'm explaining that at the current rate of faults in RAM modules, it is not the most common problem, which is precisely why it's rarely diagnosed correctly on the first try. That makes it a type of error which people don't want to pay money to avoid, as long as it can be found somehow. It turns out that it is surprisingly easy to detect too, because RAM rarely sits unused for very long, so even spurious defects show up on higher levels with a frequency that causes them to be noticed quickly. People have to be on the lookout for other defects and user errors all the time, they don't need to do anything extra to know that something is wrong when bad RAM is the cause. It just shows on a different level.
It is much more important to have working high level checks, otherwise you're going to miss lots of flaws. That's why mission critical systems run data through redundant systems with different implementations by different people and compare the results. A "whole system parity check", if you will. RAID is designed with the same philosophy: Cheap, possibly faulty hardware is used and errors are detected on a higher level and corrected if possible. Real world systems just place the checks much closer to the user, or even beyond the user, where laws allow for correction of mistakes post factum. A flipped bit in the exponent of a financial transaction does not mean you lose a lot of money. It means you end up having to correct that error. But the real world gives you that opportunity, so you're fine with saving money by not trying to achieve infallibility.
"Attempting to recover data packed at a density of 1 GB/sq.in. from a disk spinning at 10,000 revolutions per minute where the actual data is stored in a micron thin layer of rust on the surface of the disk is manifestly impossible."
It's not always iron oxide . Sometimes it's Cobalt
Now, as far as I know, there are many schemes for correcting and detecting errors. Some, like FEC, fix infrequent, scattered errors. Others, like turbocodes, fix sizeable blocks of errors. This leads to two questions: what is the benefit in using plain CRCs any more? And since disks are block-based not streamed, wouldn't block-based error-correction be more suitable for the disk?
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
I was a early "adopter" of the Internet... and when I was on a slow dial-up line, even with checksums being done on-the-fly via hardware, and packets being re-sent willy-nilly due to insufficient transmission integrity, my data seemed to get corrupted almost as often as not.
:o)
Today, with these "unreliable" hard drives, and (apparently, if we believe the post) less hardware checking being done, I very, very seldom receive data, or retrieve data from storage, that is detectably corrupted. My CRCs almost invariably check out after the download and storage, and I am a happy camper.
Once in a great while, I get a file or other data stream that has unacceptable corruption in it. But that has been very rare in recent years, and I have little cause to complain.
Commercial interests that have big investments in their data might, of course, be justified in taking stiffer measures than the average consumer. Nothing new about that.
I just don't see the problem here. The "free market" (Microsoft and certain others notwithstanding) has reached solutions that are acceptable to the customers. Where is the issue?
Yes. They are, but considerably less fallible with ECC. Remember, "I can give you the wrong answer in zero seconds." There's no point in computing at all unless there is a very high degree of confidence in computational results. Software is just as fallible as hardware but again, I can, and do, make considerable effort in making it less fallible.
I am the default sysadmin for a family member's business. There was a time where the system was fraught with network failures on a constant basis. Printers stopped working, wan, stopped working etc without any apparent explanation. So called "experts" were called out to fix the problem which was fixed and broken the next day. Each and every time there was a failure, the cost was significant - huge, far more than ECC RAM. Just the time it took to rule out memory failure was more than the cost of ECC RAM. Under your fatalistic mode, he would be relegated to continual system failures. I reconfigured the systems, made some DHCP addresses virtually static and dropped the PPPOE in the box and put it on a Linux server, etc etc. It's been well over a year now and the only failure has been the occasional power interruption and my forgetting to set services to run on reboots. MUCH MUCH more reliable.
As for flipping bits. DRAM is particularly susceptible to background radiation while the CPU data paths are not for various reasons and there is far more silicon devoted to memory than there is to the CPU, hence using ECC over main memory is a huge deal. I have had alot of experience and I know that memory is a significant cause of reliability issues hence again, the cost of ECC RAM is minuscule compared to the benefits.
ECC - corrects errors. So even if you have a faulty bit, it is both detected AND corrected which makes it so that you don't have to worry about somthing that would otherwise have stopped you in your tracks.
I find it very strange that you have a distinction between "mission critical" and some user's machine. I have a very high expectation and whenever there is a fault, I will diagnose it to remove it. Anything which is unrepeatable wastes alot of my time and so ECC ram removes a large set of those problems - for an extra 20% investment, it's not even above the noise of other things.
A serious weakness in modern PCs is the lack of ECC memory. I think this is caused primarily by Intel. To create market segmentation Intel's mainstream chipsets (i815, i845, i865, i915, i945, i965 and later) do not support ECC memory. I believe this is actually market segmentation, and not a real cost reduction, because all mainstream chipsets before the i815, like the i440 LX & BX, support ECC.
A side effect of this is that it's now very expensive to build a home PC with ECC memory, because you now have to buy an expensive mainboard with intel's premium chipset (i875, i955, i975, etc.) to have ECC support.
We have 500+ servers worldwide, many of them contains the same program install images which by definition should be identical:
One master, all the others are copies.
Starting maybe 15 years ago, when these directory structures were in the single-digit GB range, we started noticing strange errors, and after running full block-by-block compares between the master and several slave servers we determined that we had end-to-end error rates of about 1 in 10 GB.
Initially we solved this by doubling the network load, i.e. always doing a full verify after every copy, but later on we found that keeping the same hw, but using sw packet checksums, was sufficient to stop this particular error mechanism.
One of the errors we saw was a data block where a single byte was repeated, overwriting the real data byte that should have followed it. This is almost certainly caused by a timing glitch which over-/under-runs a hardware FIFO. Having 32-bit CRCs on all Ethernet packets as well as 16-bit TCP checksums doesn't help if the path across the PCI bus is unprotected and the TCP checksum has been verified on the network card itself.
Since then our largest volume sizes have increased into the 100 TB range, and I do expect that we now have other silent failure mechanisms: Basically, any time/location when data isn't explicitly covered by end-to-end verification is a silent failure waiting to happen. On disk volumes we try to protect against this by using file systems which can protect against lost writes as well as miss-placed writes (i.e. the disk reports writing block 1000, but in reality it wrote to block 1064 on the next cylinder).
NetApp's WAFL is good, but I expect Sun's ZFS to an equally good job a significantly lower cost.
Terje
"almost all programming can be viewed as an exercise in caching"
If data corruption was so possible, you would get segfaults all the time because loaded binaries would have wrong machine code bytes in them. And even 1 bit wrong in assembly is enough for segv.
Every time you downloaded your fav *.unubtu package, it would be corrupted so either bzip would fail of the program would crash if you tried to run it. Not to mention configuration files and the fact that debugging would be impossible in such a case.
But i've never seen such segfaults on linux, except when in a boxen that the motherboard was damaged.
So alan is full of it.
Simple: the bus can be faulty (or the connection NIC-bus). The memory can go bad. If you do checksum offloading, you only verify the integrity at the endpoint in your machine. If you move it to the CPU, the path for data where it may be corrupted is shortened.
You must be new here.
I use to sell firewalls. People always wanted to know how fast it would work (most were good up to around 100Mbps, when most people had at most 2Mbps pipes at most), very few people asked detailed questions about what security policies it could enforce, or the correctness and security of the firewall device itself.
Everyone knew they needed something, very few had a clue about selecting a good product, speed they understood, network security in comparison is pretty tough. Other forms of correctness are I think also more difficult to comprehend.
How many people know the safety rating of their automobile? Okay probably the wrong people to ask.
How many people know the safety rating of their automobile?
Excellent example. If driving were so dangerous that buying a car for maximum safety would be commonplace, then people wouldn't drive. The general safety of cars is high enough that you don't need to worry about how safe a particular car is exactly. Same with RAM. If people thought that the defect rate of RAM warranted ECC RAM, then they wouldn't trust their computers at all.
http://xkcd.com/54/
I came across a NIC with faulty offloading. It was at a customer's site, and it took a month to diagnose.
The only way I found out was with an Ethereal trace at each end - I could see that every 0x2000 bytes there was a corruption. We turned TCP segment offloading off, and it worked fine because the maximum packet size was 1536 bytes - before 0x2000.
The perils of RAM just seems to be one of those open secrets. Apparently even Microsoft has tried pushing for ECC RAM in all machines (including dekstops) as memory errors have risen to the top 10 causes of system crashes according to their crash analysis.
... in a manner that was different to the first try. Repeated runs on the same unchanging file were giving different MD5sums. Running a memtest went on to show the memory was dodgy.
Earlier this decade I was living with strange, random crashes when booting Linux that would only seemingly only occur when booting from cold (but not every time!). It was only years later when running a memtest on someone else's sticks (which turned out to be fine in their machine) that I learned that the motherboard was unhappy with the timing settings even though the RAM was rated as being compatible with said settings.
A few years later on a different machine I was MD5summing a big file locally and the sum turned out to be different to what was expected. For whatever reason I tried doing a MD5sum of the same file over a network filesystem (as the server was running Samba) and was surprised to find that the sum came to the expected result. Upon reruning the MD5sum locally the sum was again incorrect...
I have also seen a not inexpensive system throw up SCSI errors and effectively make the disks disappear from the operating system while doing RAID 5 with a hardware SCSI RAID controller after a single disk failure. Apparently if a SCSI disk fails in a particularly rare manner it is possible for it to keep the bus busy and thus disable communication with any other drives on same bus. Thankfully the backups worked (the filesystem had been corrupted) but if your data is important you can never be too careful. As disk and memory sizes go up along with the rates that data are transferred the chances of rare circumstances like bit flips happening only increase. The question is - will you notice the problem before it's too late and are how much are you willing to pay (in terms of money and performance) to reduce the odds?
Refusing to implement integrity checks at every level is data mismanagement.
The filesystem should provide this.
Linux people have been denying for years that hardware will cause data corruption. Therefore they can deny their own responsibility in detecting and correcting it.
It is everyone's responsibility to make OS people aware of how often hardware causes data corruption.
http://www.storagetruth.org/index.php/2006/data-corruption-happens-easily/
Comment removed based on user account deletion
I used to work for a big data network company (you kids never heard of it), where the marketing department quoted the theoretical error rate of the CRC used on the packets as the actual error rate for the network. I got my first evidence this might be optimistic when I saw a chunk of text obviously from (big important corporate customer) mixed into the middle of text I was reading on my terminal in manufacturing. Well, I was just a tech, nobody I mentioned it to showed any sign of interest, so what could I do?
Years later I learned that a defect in the backplane of one of our packet switch models had caused problems in the network for quite a while before it was discovered and fixed.
Well, packets came in, CRC was verified then stripped, packets were disassembled and routed, new packets were assembled, new CRC calculated and attached, packets went out.
What could go wrong?
Sarcasm right ?
There was some IBM paper on how they made it that way. They have RAM data scrubing and the disk has 520 bytes sectors. All major path, even inside the processor, is ECC checked.
IEEE Spectrun has in a recent editon extensive information and links about this in multi CPU machines.
Unfortunately the same is not true for their System x (x86) machines.
You can make up some more yourself....
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks