Disk Failure Rates More Myth Than Metric

← Back to Stories (view on slashdot.org)

Disk Failure Rates More Myth Than Metric

Posted by Zonk on Saturday April 5, 2008 @07:30AM from the like-the-loch-ness-hard-drive dept.

Lucas123 writes "Using mean time between failure rates suggest that disks can last from 1 million to 1.5 million hours, or 114 to 170 years, but study after study shows that those metrics are inaccurate for determining hard drive life. One study found that some disk drive replacement rates were greater than one in 10. This is nearly 15 times what vendors claim, and all of these studies show failure rates grow steadily with the age of the hardware. One former EMC employee turned consultant said, 'I don't think [disk array manufacturers are] going to be forthright with giving people that data because it would reduce the opportunity for them to add value by 'interpreting' the numbers.'"

15 of 283 comments (clear)

Min score:

Reason:

Sort:

Never had a drive fail by Jafafa+Hots · 2008-04-05 07:34 · Score: 4, Interesting

I've gone through many over the years, replacing them as they became too small - still using some small ones many years old for minor tasks, etc. and he only drive I've ever had partially fail is the one I accidentally launched across a room.
I don't understand how people are always complaining about their hard drives failing. In 30 years it hasn't happened to me yet.
I'm about to lug a huge Wang hard drive out to the trash pickup on Monday - weighs over 100 pounds... still runs. Actually it uses removable platters but still...

--
This space available.
1. Re:Never had a drive fail by Rosy+At+Random · 2008-04-05 09:43 · Score: 2, Interesting
  
  Am I the only one who wants to hear more about the drive that went ballistic?
  
  --
  Would you like a slice of toast?
2. Re:Never had a drive fail by Reziac · 2008-04-05 16:09 · Score: 2, Interesting
  
  I live where the power spikes and sags constantly. My machines are all on UPSs. And each PC has a decent quality PSU. And if a HD runs more than "pleasantly warm" to the touch, it gets its own dedicated fan. Consequently, I firmly believe all HDs are supposed to live A Long Time... the oldest of my 24/7 HDs right now is 10 years old, and has about 80,000 actual hours on it -- Like yourself, I think they're supposed to be worn out before being thrown out. :)
  
  Of course, yonder is a large stack of backups, which also help increase HD longevity. ;)
  
  --
  ~REZ~ #43301. Who'd fake being me anyway?
Misunderstanding MTBF by dh003i · 2008-04-05 07:55 · Score: 4, Interesting

I think that a lot of people are mis-understanding MTBF. A HD might have a MTBF of 100 years. This doesn't mean that the company expects the vast majority of consumers to have that HD running for 100 years without problems.

MTBF numbers are generated by running say thousands of hard-drives of the same model and batch/lot, and seeing how long it takes before 1 fails. This may be a day or so. You then figure out how many total HD running hours it took before failure. If you have 1,000 HD's running, and it takes 40 hours before one fails, that's a 40,000 hr MTBF. But this number isn't generated by running say 10 hard-drives, waiting for all of them to fail, and averaging that number.

Thus, because of the way MTBF numbers are generated, they may or may not reflect hard-drive reliability beyond a few weeks. It depends on our assumptions about hard-drive stress and usage beyond the length of time before the 1st HD of the 1,000 or so they were testing failed. Most likely, it says less and less about hard-drive reliability beyond that initial point of failure (which is on the order of tens or hundreds of hours, not hundreds of thousands of hours or millions of hours!).

To be sure, all-else equal, a higher MTBF is better than a lower one. But as far as I'm concerned, those numbers are more useful for predicting DOA, duds, or quick-failure; and are more useful to professionals who might be employing large arrays of HD's. They are not particularly useful for getting a good idea of how long your HD will actually last.

HD manufacturers also publish an expected life-cycle of their HD. But I usually put the most stock in the length of the warranty. That's what they're willing to put their money behind. Albeit, it's possible their strategy is just to warranty less than how long they expect 90% of HD's to last, so they can then sell them cheaper. But if you've had a HD and you've had it for longer than what the manufacturer publishes as the expected-life, what they're saying by that is you've basically got a good value, and will probably want to have something else on hand, and be backed up.

--
social sciences can never use experience to verify their statemen
Temperature is the key by arivanov · 2008-04-05 07:55 · Score: 4, Interesting

Disk MTBF is quoted for 20C.

Here is an example of my server. At 18C ambient in a well cooled and well designed case with dedicated hard drive fans he Maxtors I use for RAID1 run at 29ÂC. My Media server which is in the loft with sub-16C ambient runs them at 24-34 depending on the position in the case (once again, proper high end case with dedicated hard drive fans).

Very few hard disk enclosures can bring the temperature down to 24-25C.

SANs or high density servers usually end up running disks at 30C+ while at 18C ambient. In fact I have seen disks run at 40C or more in "enterprise hardware".

From there on it is not amazing that they fail at a rate different from the quoted one. In fact I would have been very surprised if they did.

--
Baker's Law: Misery no longer loves company. Nowadays it insists on it
http://www.sigsegv.cx/
1. Re:Temperature is the key by ABasketOfPups · 2008-04-05 08:10 · Score: 5, Interesting
  
  Google says that's just not what they've seen. "The figure shows that failures do not increase when the average temperature increases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at the very high temperatures is there a slight reversal of this trend."
  
  On the graph it's clear that 30-35C is best at three years. But up until then, 35-40C has lower failure rates, and both have lower rates by a lot than the 15-30C range.
Re:MTBF For Unused Drive? by zappepcs · 2008-04-05 07:56 · Score: 4, Interesting

The problem is that the MTBF is calculated on an accelerated lifecycle test schedule. Life in general does not actually act like the accelerated test expanded out to 1day=1day. It is an approximation, and prone to errors because of the aggregated averages created by the test.

On average, a disk drive can last as long as the MTBF number. What are the chances that you have an average drive? They are slim. Each component in the drive, every resistor, every capacitor, every part has an MTBF. They also have tolerance values: that is to say they are manufactured to a value with a given tolerance of accuracy. Each tolerance has to be calculated as one component out of tolerance could cause failure of complete sections of the drive itself. When you start calculating that kind of thing it becomes similar to an exercise in calculating safety on the space shuttle... damned complex in nature.

The tests remain valid because of a simple fact. In large data centers where you have large quantities of the same drive spinning in the same lifecycles, you will find that a percentage of them fail within days of each other. That means that there is a valid measurement of the parts in the drive, and how they will stand the test of life in a data center.

Is your data center an 'average' life for a drive? The accelerated lifecycle tests cannot tell you. All the testing does is look for failures of any given part over a number of power cycles, hours of use etc. It is quite improbable that your use of the drive will match that of the expanded testing life cycle.

The MTBF is a good estimation of when you can be certain of a failure of one part or another in your drive. There is ALWAYS room for it to fail prior to that number. ALWAYS.

Like any electronic device for consumers, if it doesn't fail in the first year, it's likely to last as long as you are likely to be using it. Replacement rates of consumer societies mean that manufacturers don't have to worry too much about MTBF as long as it's longer than the replacement/upgrade cycle.

If you are worried about data loss, implement a good data backup program and quit worrying about drive MTBFs.

--
Support NYCountryLawyer RIAA vs People
Recycle, don't just dump it! by Anonymous Coward · 2008-04-05 08:02 · Score: 1, Interesting

He should look at the escalating price of gold too. Older the computer component the more gold in the connectors and the thicker the gold on the traces, etc.. Not to mention other precious metals involved in some of the components such as platinum, paladium, etc.. Perhaps the greatest consideration should be given to the fact that it would increase the heavy metal pollution at the dump it goes to.

Probably some nice magnets inside to play with too. :P
I don't know what you people do to your drives by gelfling · 2008-04-05 08:22 · Score: 2, Interesting

But since 1981 I have had exactly zero catastrophic PC drive crashes. That's not to say I haven't seen some bad/relocated sectors, but hard failures? None. Granted that's only 20 drives. But in fact in my experience in PC's, midranges and mainframes in almost 30 years I have seen zero hard drive crashes.
Re:Marketplace can't function without good data by commodoresloat · 2008-04-05 08:56 · Score: 3, Interesting

If everyone knows how much a disk drive costs, and nobody can find out how long a disk drive really will last, there is no way the marketplace can reward the vendors of durable and reliable products. And that may be the exact reason why the vendors are providing bad data. On the flip side, however, if people knew how often drives failed, perhaps we'd buy more of them in order to always have backups.
Re:Never had a drive *not* fail. by Depili · 2008-04-05 09:09 · Score: 3, Interesting

The deathstars were all 80gt PATA disks, manufactured by a single plant, had 8 of them, all failed.
WD Green drive - marketing invention by Anonymous Coward · 2008-04-05 09:51 · Score: 1, Interesting

disclaimer: I work for Samsung 3.5" HDD Lab

One difference would be that the voice coil motor that pushes the head back and forth on seeks on Samsung drives runs slower, but quieter and lower power. Samsung drives generally have a reputation for being lower power. That has been one differentiating factor between Samsung versus Seagate, Fujitsu and Western Digital. However, an even bigger difference is the number of disks in the drive. The more disks, the harder all of the motors have to work.

There are differences from model to model within vendors as well. For each new model of hard drive you have a custom designed motor, enclosure, ICs, media, etc. The technology is moving so fast it is hard to follow. The current generation is the 1TB disks.

One funny example is that right now Western Digital is pushing their so-called "Green" 5400 rpm drives. Running at 5400 rpm does indeed use less power -- but they didn't set out to make a low power drive. Engineering was simply unable to get their 1TB drive to work at the higher performance 7200 rpm. So, they marketed it as a "green" drive, and had a huge success!
Re:Never had a drive *not* fail. by dgatwood · 2008-04-05 10:02 · Score: 2, Interesting

One drive, 24x7, approx. 12 years. Seagate. Why?

--
Check out my sci-fi/humor trilogy at PatriotsBooks.
Re:warranties by Anonymous Coward · 2008-04-05 10:29 · Score: 1, Interesting

Interesting ideas about hard drive reliability. I spend many hours each month looking at hard drive performance as part of my work. My job is to qualify drives (and other devices) for our servers. Also have a large volume of drives in the lab and in the field to monitor.

I see the useful life of most drives as 3 - 5 years. The drive supplier is going to cover failures inside that 5 year useful life. Most folks are replacing the obsolete gear as new hardware becomes available. The idea that a drive could actually have a million hour MTBF is just fantasy. I see lots of failures with less than 8000 hours (a year) on the drive. Those are certainly outside the "early life failure" category. They are just worn out. Lots of drives have defects that the user doesn't even know about. They don't generate SMART reports or they don't analyze what the report says. I see drives all the time that have only a few hundred hours on them that I wouldn't install in my system.

MTBF specs for hard drives are a marketing ploy.
Re:Never had a drive *not* fail. by monkaru · 2008-04-05 11:22 · Score: 2, Interesting

*laughs* Redunadcy exists for a purpose. I ALWAYS assume hardware will fail. It does you know. I guess that's why I still have the data from my 1966 756 byte Multics terminal account.