Calculating the Mean Time Between Failures?

← Back to Stories (view on slashdot.org)

Calculating the Mean Time Between Failures?

Posted by Cliff on Thursday June 19, 2003 @12:52PM from the surely-they-don't-test-for-500k-hours dept.

Blue Booger asks: "I was looking over some fibrechannel hard drives and noticed that the Mean Time Between Failures was rated at 1.2 million hours. I thought that was pretty high, and figured it up to be close to 137 YEARS!! I went to check some regular IDE drives just for comparison, and they were rated at 500,000 hours (57 years). Now, as I understand it, this is supposed to be the average time that you can expect the drive to last before failures. I rarely have an IDE drive last more than 4 years, and my record is 10 years, so what is the deal? BTW, that is 57 years running 24 hours a day...the MTBF is rated as power on time. Here you can find Western Digital's glossary that defines the term MTBF (pdf). Here you can find a spec sheet on one of their 20GB IDE drives. I checked, and Seagate also lists similar MTBFs. How the heck are they coming up with these numbers?"

12 of 100 comments (clear)

Min score:

Reason:

Sort:

Duty Cycle by m0rph3us0 · 2003-06-19 12:59 · Score: 3, Informative

Usually they have a duty cycle associated with an MTBF which can drastically alter the MTBF at a 100% duty cycle.
1. Re:Duty Cycle by Blkdeath · 2003-06-19 14:08 · Score: 2, Informative
  
  Usually they have a duty cycle associated with an MTBF which can drastically alter the MTBF at a 100% duty cycle.
  PNot to mention temperature. Read the environmental factors very carefully; if you exceed them by even 1 degree celicius you can cut your MTBF equally, if not more drastically.
  
  --
  BD Phone Home!
  Shameless plug. Like you weren't expecting it.
2. Re:Duty Cycle by itwerx · 2003-06-20 03:33 · Score: 2, Informative
  
  That 5-year warranty almost put Western Digital out of business when they all started failing at the 4-year mark!
  No, I'm not kidding. Some heads rolled over that...
You are wrong by Mensa+Babe · 2003-06-19 13:06 · Score: 4, Informative

I rarely have an IDE drive last more than 4 years, and my record is 10 years, so what is the deal?

If you have twenty drives with twenty years MTFB (Mean Time Between Failures) each, then you have one failure per year on average. These are basic statistics fighting always against you.

--
Karma: Positive (probably because of superiour intellect)
MTBF calculation and estimation by crmartin · 2003-06-19 13:40 · Score: 4, Informative

You know, it's almost a shame to screw up the amusing notions /.ers come up with by adding actual information, but I can't help it, all those years of teaching I guess.

Okay, first of all: "mean time between failures" is obviously a statistical measure -- it is an average over a large number of individual items. In most electronic components (including light bulbs!) the statistical distribution of the time between failures is the exponential distribution, which has the odd property that it's "memory-less" -- it doesn't matter how long since the last failure it's been, the mean time to the next failure will still be the same. A consequence of this is that if the MTBF is 10,000 hours, the probability of failure in any particular hour would be 1/10,000th. So, if you set up 10,000 components, all running simultaneously, you'd expect one of them to fail within the first hour; conversely, if you ran them for 1000 hours, and 998 of them failed, you could be fairly certain that the MTBF would be around 10,000 hours.

Note, by the way, that this is only true when the failure time distribution is exponential -- so it works for electronic components, but not for, say, bicycles and cars and roller skates, which are more likely to fail the older they get.

This has an obvious problem, of course: if the MTBF is high, it can take forever to test. Consider, for example, something I worked on for NASA some years ago: trying to prove that a fly-by-wire system will have a mean time between failures of 1e10 hours. (This is about the same failure rte as the airframe, which is how they came up with the number.) 1e10 hours is about 1.141 million years, by the way.

(Pop quiz: if MTBF is a million years, how do you explain the occasional airframe failure, say, eg TWA 800? Hint: It doesn't require any foul play.)

At that point, you've got a couple of choices: first, you can make a lot of copies and run them simultaneously. Relatively easy for $50 disks, hard for billion dollar 747s.

Second, you can make the estimate by computation and modeling which is what you do for web systems. Conceptually, it's pretty simple to do this, although it can be a kind of pain in the ass.

The third way, which is new and cool, is by Bayesian estimation of failure rates. This method lets you make increasingly accurate estimates of the failure rate based on short experiments. I don't have time to go into it, but there are some good sources available on the web.
1. Re:MTBF calculation and estimation by crmartin · 2003-06-19 13:52 · Score: 2, Informative
  
  Actually, here's some more references: at CiteSeer, a good (if expensive) book on practical examples, and my favorite textbook. I'll shut up now.
2. Re:MTBF calculation and estimation by CharlieG · 2003-06-19 17:49 · Score: 2, Informative
  
  Of course, the real problem is that neither electronics, nor mechanical items have a exponential failure curve
  
  Mechanical items tend to fail due to wearout - aka, they become more likely to fail as time goes on
  
  Electronics follow a "Bathtub" curve. A high initial rate, that rapidly drops to a VERY low rate. It stays at that LOW rate of failure for MANY hours, and then the rate increases rapidly during "wearout" - sort of like the cross section of a bathtub - hence the name
  
  The whole concept of "Burn In" - or better yet, Stress Screening is to remove the initial high rate of failure defects, with removing ANY of the bottom of the curve. A properly defined Stress Test can do this. These tests usually involve some sort of temperture/power cycling, along with some sort of vibration testing (usually a pseudo random vibration profile)
  
  I got my start in programming while running a stress screening lab. When the tests run 24/7, you either automate your data collection, or have folks work 24/7 - guess which is cheaper? So, I got to design the tests, and then write the software to run the test racks
  
  --
  -- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
Re:Yes, I am right. by iamroot · 2003-06-19 13:46 · Score: 2, Informative

Mean Time Before Failure is the MEAN time before the disk would fail.

If they all failed within 20 years, how would the average disk have failed in 20 years???

MTBF (Mean Time Between Failures) â" Average time (expressed in
hours) that a component works without failure. It is calculated by dividing
the total number of operating hours observed by the total number of
failures. Also, the length of time a user may reasonably expect a device or
system to work before an incapacitating fault occurs.

20 drives/20 years is 1 drive a year.
Re:No, you are wrong by The+Clockwork+Troll · 2003-06-19 13:52 · Score: 2, Informative

So you are using the average of the times measured between deployment of hardware and individual drive failure, as opposed to the mean time between failures of individual drives, i.e. the mean time between necessary hardware replacements, which is what I would have thought more useful for evaluating hardware.
I would have called your average "mean time to failure" vs. "mean time between failures."
But as other posters mentioned, most of these stats are marketing bunk no matter how they're computed!

--

There are no karma whores, only moderation johns
No, You are wrong by Anonymous Coward · 2003-06-19 14:27 · Score: 1, Informative

The discrepency emerges because you do not operate the drives past their end-of-life when you make the MTBF calculation. To illustrate assume you have a particular model of hard drive with a MTBF of 57 years, and it reaches end-of-life after 5 years. What you can then conclude is that if you replace the drive every 5 years with a new drive(of the same model and MTBF), then you can expect your first failure at 57 years. Keep in mind that this is really just the most probably time for a failure to occur. Given any other time it is possible to calculate the probability of that time being your first occurence of failure; 57 years just gives the greatest probability in this calculation(in this example). It does not mean that you should expect any single drive to operate for 57 years.
It all depends on the distribution... by anthony_dipierro · 2003-06-19 14:46 · Score: 3, Informative

I went to check some regular IDE drives just for comparison, and they were rated at 500,000 hours (57 years). Now, as I understand it, this is supposed to be the average time that you can expect the drive to last before failures. I rarely have an IDE drive last more than 4 years, and my record is 10 years, so what is the deal?

Let's say I have a drive that has a 99% chance of failing after 10 years, and a 1% chance of failing after 4710 years. The MTBF is 57 years.

In fact, with the proper distribution (think 2^n) you could have an infinite MTBF, but still have a 99% chance of failure within 10 years. See for example the St. Petersburg paradox.
Re:mod parent up - he's right by photon317 · 2003-06-20 00:13 · Score: 2, Informative

Whatever your math may say, the Industry's standard is the one you disagree with. If they stick 1000 drives in an array, run them for 1 year, and only a single drive fails during that year, the MTBF for that model of drive is 500 years.

--
11*43+456^2