Disk Failure Rates More Myth Than Metric
Lucas123 writes "Using mean time between failure rates suggest that disks can last from 1 million to 1.5 million hours, or 114 to 170 years, but study after study shows that those metrics are inaccurate for determining hard drive life. One study found that some disk drive replacement rates were greater than one in 10. This is nearly 15 times what vendors claim, and all of these studies show failure rates grow steadily with the age of the hardware. One former EMC employee turned consultant said, 'I don't think [disk array manufacturers are] going to be forthright with giving people that data because it would reduce the opportunity for them to add value by 'interpreting' the numbers.'"
I don't understand how people are always complaining about their hard drives failing. In 30 years it hasn't happened to me yet.
I'm about to lug a huge Wang hard drive out to the trash pickup on Monday - weighs over 100 pounds... still runs. Actually it uses removable platters but still...
This space available.
...those that make backups and those that never had a hard drive fail.
If everyone knows how much a disk drive costs, and nobody can find out how long a disk drive really will last, there is no way the marketplace can reward the vendors of durable and reliable products.
The inevitable result is a race to the bottom. Buyers will reason they might was well buy cheap, because they at least know they're saving money, rather then paying for quality and likely not getting it.
"How to Do Nothing," kids activities, back in print!
My anecdotal converse is I have never had a hard drive not fail. I am a bit on the cheap side of the spectrum, I'll admit, but having lost my last 40GB drives this winter I now claim a pair of 120s as my smallest.
I always seem to have a use for a drive, so I run them until failure.
I dub thee... Sir Phobos, Knight of Mars, Beater of Ass.
The best metric is probably going to be the length of warranty the manufacturer offers. They have financial incentive to find out the REAL mean time until failure in calculating the warranty.
'Every story, if continued long enough, ends in death.' --Ernest Hemingway
I remember back in the mid 1980s when I received a service management manual from DEC, it had some information that really opened my eyes about what MTBF was really intended for. It had a calculation (I have long since forgotten the details) that allowed you to estimate how many service spares you would need to keep in stock to service any installed base of hardware, based on MTBF. This was intended for internal use in calculating spares inventory level for DEC service agents. High MTBF products needed fewer replacement parts in inventory, low MTBF parts needed lots of parts in stock. Presumably internal MTBF ratings were more accurate than those released to end users.
So anyway.. MTBF is not intended as an indicator of a specific unit's reliability. It is a statistical measurement to calculate how many spares are needed to keep a large population of machines working. It cannot be applied to a single unit in the way it can be applied to a large population of units.
Perhaps the classical example is about the old tube-based computers like ENIAC, if a single tube has an MTBF of 1 year, but the computer has 10,000 tubes, you'd be changing tubes (on average) more than once an hour, you'd rarely even get an hour of uptime. (I hope I got that calculation vaguely correct)
I think that a lot of people are mis-understanding MTBF. A HD might have a MTBF of 100 years. This doesn't mean that the company expects the vast majority of consumers to have that HD running for 100 years without problems.
MTBF numbers are generated by running say thousands of hard-drives of the same model and batch/lot, and seeing how long it takes before 1 fails. This may be a day or so. You then figure out how many total HD running hours it took before failure. If you have 1,000 HD's running, and it takes 40 hours before one fails, that's a 40,000 hr MTBF. But this number isn't generated by running say 10 hard-drives, waiting for all of them to fail, and averaging that number.
Thus, because of the way MTBF numbers are generated, they may or may not reflect hard-drive reliability beyond a few weeks. It depends on our assumptions about hard-drive stress and usage beyond the length of time before the 1st HD of the 1,000 or so they were testing failed. Most likely, it says less and less about hard-drive reliability beyond that initial point of failure (which is on the order of tens or hundreds of hours, not hundreds of thousands of hours or millions of hours!).
To be sure, all-else equal, a higher MTBF is better than a lower one. But as far as I'm concerned, those numbers are more useful for predicting DOA, duds, or quick-failure; and are more useful to professionals who might be employing large arrays of HD's. They are not particularly useful for getting a good idea of how long your HD will actually last.
HD manufacturers also publish an expected life-cycle of their HD. But I usually put the most stock in the length of the warranty. That's what they're willing to put their money behind. Albeit, it's possible their strategy is just to warranty less than how long they expect 90% of HD's to last, so they can then sell them cheaper. But if you've had a HD and you've had it for longer than what the manufacturer publishes as the expected-life, what they're saying by that is you've basically got a good value, and will probably want to have something else on hand, and be backed up.
social sciences can never use experience to verify their statemen
Disk MTBF is quoted for 20C.
Here is an example of my server. At 18C ambient in a well cooled and well designed case with dedicated hard drive fans he Maxtors I use for RAID1 run at 29ÂC. My Media server which is in the loft with sub-16C ambient runs them at 24-34 depending on the position in the case (once again, proper high end case with dedicated hard drive fans).
Very few hard disk enclosures can bring the temperature down to 24-25C.
SANs or high density servers usually end up running disks at 30C+ while at 18C ambient. In fact I have seen disks run at 40C or more in "enterprise hardware".
From there on it is not amazing that they fail at a rate different from the quoted one. In fact I would have been very surprised if they did.
Baker's Law: Misery no longer loves company. Nowadays it insists on it
http://www.sigsegv.cx/
The problem is that the MTBF is calculated on an accelerated lifecycle test schedule. Life in general does not actually act like the accelerated test expanded out to 1day=1day. It is an approximation, and prone to errors because of the aggregated averages created by the test.
On average, a disk drive can last as long as the MTBF number. What are the chances that you have an average drive? They are slim. Each component in the drive, every resistor, every capacitor, every part has an MTBF. They also have tolerance values: that is to say they are manufactured to a value with a given tolerance of accuracy. Each tolerance has to be calculated as one component out of tolerance could cause failure of complete sections of the drive itself. When you start calculating that kind of thing it becomes similar to an exercise in calculating safety on the space shuttle... damned complex in nature.
The tests remain valid because of a simple fact. In large data centers where you have large quantities of the same drive spinning in the same lifecycles, you will find that a percentage of them fail within days of each other. That means that there is a valid measurement of the parts in the drive, and how they will stand the test of life in a data center.
Is your data center an 'average' life for a drive? The accelerated lifecycle tests cannot tell you. All the testing does is look for failures of any given part over a number of power cycles, hours of use etc. It is quite improbable that your use of the drive will match that of the expanded testing life cycle.
The MTBF is a good estimation of when you can be certain of a failure of one part or another in your drive. There is ALWAYS room for it to fail prior to that number. ALWAYS.
Like any electronic device for consumers, if it doesn't fail in the first year, it's likely to last as long as you are likely to be using it. Replacement rates of consumer societies mean that manufacturers don't have to worry too much about MTBF as long as it's longer than the replacement/upgrade cycle.
If you are worried about data loss, implement a good data backup program and quit worrying about drive MTBFs.
Support NYCountryLawyer RIAA vs People
Warranty periods for 750 gig and 1 terabyte drives from Western Digital, Samsung, and Hitachi, are 3 years to 5 years according to the info on zipzoomfly.com.
A one year warranty doesn't seem that common. External drives seem to have one year warranties, but even SATA drives at Best Buy mostly have 3 years
Great post above. It also depends on how you count "failure." I've had external drives fail where the disk would still spin up, but the interface was the failure point. I took the disk out of the external enclosure and it worked just fine with a direct IDE (I know, who uses that anymore?) connection.
If I were running a data-based business I'd count that as a "failure" since I had to go deal with the drive, but the HD company probably wouldn't since no data was permanently lost.
While we are on the topic of failing drives, I think it would be appropriate to include a warning about USB drives and warranties.
I purchased a 500GB Western Digital My Book about a year and a half ago. I figured that a pre-fab USB enclosed drive would somehow be more reliable than building one myself with a regular 3.5" internal drive and my own separately purchased USB enclosure (you may dock me points for irrational thinking there). Of course, I started getting the click-of-death about a month ago, and I was unpleasantly surprised to discover that the warranty on the drive was only for 1 year, rather than the 3 year warranty that I would have gotten for a regular 3.5" 500GB Western Digital drive at the time. Meanwhile, my 750GB Seagate drive in a AMS VENUS enclosure has been chugging along just fine, and if it fails sometime in the next four years, I will still be able to exchange it under warranty.
The moral of the story is that, when there is a difference in the warranty periods (i.e., 1 year vs. 5 years), it makes a lot more sense to build your own USB enclosed drive rather than order a pre-fab USB enclosed drive.
An unjust law is no law at all. - St. Augustine
To make this sort of test work, it must be run over a much longer period of time. But in the process of designing, building, testing and refining disk drive hardware and firmware (software), there isn't that much extra time to test drive failure rates. Want to wait an extra 9 months before releasing that new drive, to get accurate MTBF numbers? Didn't think so. How many different disk controllers do they use in the MTBF tests, to approximate different real-world behaviors? Probably not that many.
Could they run longer tests, and revise MTBF numbers after the initial release of a drive? Sure, and many of them do, but that revised MTBF would almost always be lower, making it harder to sell the drives. On the other hand, newer drives are certainly available every quarter, so it may not be a bad idea to lower the apparent value of older drive models.
So, it's better to assume a drive will fail before you're done using it. They're mechanical devices with high-speed moving parts, very narrow tolerable ranges of operation (that drive head has to be far enough away from the platters not to hit them, but close enough to read smaller and smaller areas of data). Anyone who's worked in a data center, or even a small server room, knows that drives fail. When I've had around two hundred drives, of varying ages, sizes and manufacturers, in a data center, I observed a failure rate of five to ten drives per year. This is well below the MTBF for enterprise disk array drives (SCSI, FC, SAS, whatever), but drives fail. That's why we have RAID. Storage Review has a good overview of how to interpret MTBF values from drive manufactures.
But since 1981 I have had exactly zero catastrophic PC drive crashes. That's not to say I haven't seen some bad/relocated sectors, but hard failures? None. Granted that's only 20 drives. But in fact in my experience in PC's, midranges and mainframes in almost 30 years I have seen zero hard drive crashes.
Anecdotal reports of failures also need to consider the operating environment. If I have a server rack, and most servers in the rack have a drive failure in the first year, is it the drive design or the server design? Given the relative effort that usually goes into HDD design and box design, it's more likely to be due to poor thermal management in the drive enclosure. Back in the day when Apple made computers (yes, they did once, before they outsourced it) their thermal management was notoriously better than that of many of the vanilla PC boxes, and properly designed PC-format servers like the HP Kayaks were just as expensive as Macs. The same, of course, went for Sun, and that was one reason why elderly Mac and Sparc boxes would often keep chugging along as mail servers until there were just too many people sending big attachments.
One possibly related oddity that does interest me is laptop prices. The very cheap laptops are often advertised with optional 3 year warranties that cost as much as the laptop. Upmarket ones may have three year warranties for very little. I find myself wondering if the difference in price really does reflect better standards of manufacture so that the chance of a claim is much less, whether cheap laptops get abused and are so much more likely to fail, or whether the warranty cost is just built into the price of the more expensive models because most failures in fact occur in the first year.
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."
Hard drives have been becoming less and less reliable as densities increase. Seagate, WD, Hitachi, Maxtor, Toshiba, heck, they all die, often sooner than their warranties are up. They're mechanical devices, for crying out loud. So here's a bit of good advice: If you really care about your data, use a RAID array with redundancy (RAID 1 or 5). It will cost a bit more, but you'll sleep better at night. Thank you all for your kind attention. That is all.
There is another failure rate that you have to take into account: unrecoverable bit-read error-rate. This is detected as an error in the upstream connection, which can cause the controller to fail the drive. An unrecoverable read fails the ECC mechanism and can under circumstances be recovered by performing a re-read of the sector.
The error-rate is in the order of 10^14 bits. Calculating this on a busy system, reading 1MBytes/s gives you approx. 10^7 seconds for each unrecoverable read failure. Or, that means it occurs 3 times per year on average. So, forget MTBF on busy systems and hope that your controller is able to do re-reads on a disk. Otherwise, your busy system/array is not going to last very long.
Disk reliability metrics are much more science than myth. Like all science, this means you actually need to put some minimal effort into understanding them. Unlike myths :-)
Disks have two separate reliability metrics. The first is their expected life time. In general disks failure follows a "bathtub distribution". They are much more likely to fail at the first few weeks of operation. If they make it past this phase, they become very reliable - for a while anyway. Once their expected lifetime is reached, their failure rate starts steeply climbing.
The often quoted MTBF numbers express the disk reliability during the "safe" part of this probability distribution. Therefore, a disk with an expected lifetime of, say, 4 years, can have an MTBF of 100 years. This sounds theoretical until you consider that if you have 200 of such disks, you can expect that on average one of them will fail each year.
People running large data warehouses are painfully aware of these two separate numbers. They need to replace all "expired" disks, and also have enough redundancy to survive disk failures in the duration.
The article goes so far as to state this:
"When the vendor specs a 300,000-hour MTBF -- which is common for consumer-level SATA drives -- they're saying that for a large population of drives, half will fail in the first 300,000 hours of operation," he says on his blog. "MTBF, therefore, says nothing about how long any particular drive will last."
However, this obviously flew over the head of the author:
The study also found that replacement rates grew constantly with age, which counters the usual common understanding that drive degradation sets in after a nominal lifetime of five years, Schroeder says.
Common understanding is that 5 years is a bloody long life expectancy for a hard disk! It would take divine intervention to stop failures from rising after such a long time!
MTBF is only valid during the "lifetime" of a drive. (For example, "lifetime" might mean the five years during which a drive is under warranty.) Thus, the MTBF is the mean time before failure if you replace the drive every five years with other drives with identical MTBF. Thus the 100-some year MTBF doesn't mean that an individual drive will last 100+ years, it means that your scheme of replacing every 5 years will work for an average time of 100+ years.
Of course, I think this is another deceptive definition from the hard drive industry... To me, the drive's lifetime ends when it fails, not "5 years".
Source: http://www.rpi.edu/~sofkam/fileserverdisks.html
If you think an MTBF of 100 years means the disk will last 100 years you're bound to be disappointed, because that's not what it means. MTBF is calculated in different ways by different companies, but generally there are at least two numbers you need to look at, MTBF and the design or expected lifetime. A disk with an MTBF of 200 000 hours and a lifetime of 20 000 hours means that 1 in 10 are expected to fail during their lifetime, or with 200 000 disks one will fail every hour. It does not mean the average drive will last 200 000 years. After the lifetime is over all bets are off.
In short, the MTBF is a statistical measure of the expected failure rate during the expected lifetime of a device, it is not a measure of the expected lifetime of a device.
Chernobyl 'not a wildlife haven' - BBC News
MTBF is NOT calculated for a single drive. MTBF is calculated based on an average for ANY pool size of drives.
If you have 10,000 drives, and the failure is 1 in 1,000,000 hours, you will have a failure every 100 hours.
Here's a good document on disk failure information:
http://research.google.com/archive/disk_failures.pdf
Last I checked.
-Clio
Karma: Bad (mostly from not giving a fuck)
Blog: http://clintjcl.wordpress.com
...that by the time the drive fails beyond that warranty, the vendor is more likely than not not going to have any drives that small in stock. So they'll replace it with whatever's on the shelf, which is usually an order of magnitude larger, at the very least.
To the guys who claim they've never lost a drive, you've had what? Maybe 3 or 4? I deal with several large raids, encompassing a few hundred drives and running 24/7. The power and cooling are very tightly controlled. Looking at our statistics, we have about a 5% failure rate for drives within the first year. About 10% over four years. SCSI drives seem to last longer than SATA drives, but they are also much more expensive. The MTBF numbers from the manufacturers are total BS. The best number to go by is the warranty, because that's what matters to the manufacturer. Depending on the expected failure rate of a particular model and the profit margin, they set the warranty period to minimize the number of replacements and still be able to make a profit. Some models that might be a 5% or even 10% warranty replacement rate.