Calculating the Mean Time Between Failures?

Duty Cycle by m0rph3us0 · 2003-06-19 12:59 · Score: 3, Informative

Usually they have a duty cycle associated with an MTBF which can drastically alter the MTBF at a 100% duty cycle.

Re:Duty Cycle by Jeremiah+Cornelius · 2003-06-19 13:35 · Score: 1

I NEVER had an IDEdrive last more than 14 years!
;-)

--
"Flyin' in just a sweet place,
Never been known to fail..."
Re:Duty Cycle by Blkdeath · 2003-06-19 14:08 · Score: 2, Informative

Usually they have a duty cycle associated with an MTBF which can drastically alter the MTBF at a 100% duty cycle.
PNot to mention temperature. Read the environmental factors very carefully; if you exceed them by even 1 degree celicius you can cut your MTBF equally, if not more drastically.

--
BD Phone Home!
Shameless plug. Like you weren't expecting it.
Re:Duty Cycle by ottothecow · 2003-06-19 14:24 · Score: 1

or maybe they forgot to power up the drives and when they turned them on after using their time machine to move into the future they miraculously worked

--
Bottles.
Re:Duty Cycle by macdaddy357 · 2003-06-19 14:38 · Score: 1

Has the MTBF of major brand hard drives gone down in recent years like their warranties? They once had five year warranties, then three, then all at once, major manufacturers scaled their warranties back to one year. Are they cramming way too much data onto the platters, making the technology unreliable, or are they just cutting their costs at the expense of customers? Their shortening of warranties to one year seemingly all at once smells like collusion to me, which violates anti-trust laws. The FTC should investigate this.

--
How ya like dat?
Re:Duty Cycle by itwerx · 2003-06-20 03:33 · Score: 2, Informative

That 5-year warranty almost put Western Digital out of business when they all started failing at the 4-year mark!
No, I'm not kidding. Some heads rolled over that...

Do it like the pros. by Anonymous Coward · 2003-06-19 13:00 · Score: 1, Funny

Just make Some Wild Ass Guess(SWAG).

Like, my hrad drive has a MTBF of 300,000 hours.

Re:Do it like the pros. by aminorex · 2003-06-20 04:48 · Score: 1

Hey, Baby, my hard drive has *never* failed.

--
-I like my women like I like my tea: green-

not just drives... by ryanmoffett · 2003-06-19 13:04 · Score: 4, Interesting

Cisco used to sell Catalyst 3548XL switches that were listed as having a MTBF of 120,000+ hours. Their current replacement for that line (3550)comes in at 163,000+ hours. We had 7 of 24 3548XL switches fail in the first year we had them. They had poor air flow from a tiny fan, no heatsinks and tons of hot chips. The newer model has the same issue, though they did stuff a cheap foam baffle in the case to get air to flow closer to the chips, none of which have heatsinks. I have no idea how they tested them and got a MTBF of 13 years.

Simple, it's called "lies" by Anonymous Coward · 2003-06-19 13:05 · Score: 5, Funny

Sure, the test engineers sit and rub their chins and write numbers on paper and do stupid tests in the lab, but in the end it comes down to this:

WD Guy 1 Hey, what's the MTBF for our new drive?
WD Guy 2 Dunno, what's Maxtor saying?
WD Guy 1 sez here "300,000" hours
WD Guy 2 okay, ours is 500,000 then
WD Guy 1 I smell a NEW VICE PRESIDENT

You are wrong by Mensa+Babe · 2003-06-19 13:06 · Score: 4, Informative

I rarely have an IDE drive last more than 4 years, and my record is 10 years, so what is the deal?

If you have twenty drives with twenty years MTFB (Mean Time Between Failures) each, then you have one failure per year on average. These are basic statistics fighting always against you.

--
Karma: Positive (probably because of superiour intellect)

Re:You are wrong by Anonymous Coward · 2003-06-19 13:11 · Score: 0

But, he only mentions two drives not twenty. Therefore his failure rate is far too high statistically.
Re:You are wrong by sstamps · 2003-06-20 03:49 · Score: 1

First, as already pointed out above, if you have 20 drives with a simple MTBF of 20 years (as you are alluding), then a single failure of one drive each year for twenty years yields a mean of around 10 years.

Second, manufacturer-specified MTBF has nothing to do with a simple, observational mean. It is a calculated statistic based on a statistical sample and extrapolated over a normal distribution. In real-life terms, it is close to meaningless.

Basically, it does not mean "Mean Time Between observed Failures". The word "Between" is almost a dead giveaway there by itself. How many drives have YOU seen that had a failure, were repaired, and put back into service?

About the only usefulness it has is as a comparison statistic and, even then, its value is dubious at best.

Like you said, though, redundancy is the practical answer, unless you are so unlucky as to get a batch of drives that fail nearly simultaneously in a single redundant system. Then, the mantra becomes "We have backups!".

You do have a backup, right? ;)

--
-SS "Teach the ignorant, care for the dumb, and punish the stupid."
Re:You are wrong by sstamps · 2003-06-20 04:08 · Score: 1

Actually, let me clarify that.

If you are assuming that a drive is repaired and placed back into service, and then fails again after another random 1-20 year period, then you would be correct.

I think the problem is that most of the geeks think in terms of practicality, and treat the situation as MTTF ("Mean-Time-To-Failure") instead of MTBF, since we all know that MTBF is BS anyway.

As a result, I clarify the statement above, myself assuming "simple MTTF", not "simple MTBF".

--
-SS "Teach the ignorant, care for the dumb, and punish the stupid."

Here's a wild-ass guess by HotNeedleOfInquiry · 2003-06-19 13:09 · Score: 4, Insightful

First they specify a sample period, perhaps a year. Then they multiply the number of units shipped during that time times the estimated hours per year that the drives are run then divide it by the number of units returned due to failure

For example, shipped 2 million drives last year, each ran 2080 hours ( 8 hours * 52 weeks), roughly 4 trillion hours total. Out of those 2 million units, they got 3466 returns. So the average MTBF was 1.2 million hours.

--
"Eve of Destruction", it's not just for old hippies anymore...

Re:Here's a wild-ass guess by elmegil · 2003-06-19 13:21 · Score: 3, Interesting

A lot of hardware vendors actually test before they ship. But aside from that your basic math is about right. A controlled number of units is tested (possibly in stressed environments) and used to build the statistics that say what the expected MTBF should be.

--
7 November 2006: The day Americans realized corruption and incompetence weren't addressing 11 September 2001
Re:Here's a wild-ass guess by Anonymous Coward · 2003-06-19 15:03 · Score: 0

This is not what is done. You have no way of controlling for things like abuse, fraud, damage during shipment, etc. All testing which leads to the MTBF number is done by the manufacturer, or by an independant laboratory.

Look at the definition by aaarrrgggh · 2003-06-19 13:12 · Score: 3, Interesting

If they run 500 drives for 2,000 hours and observe only one failure, that is a MTBF of 500,000 hours.

Unfortunately, that equation doesn't take into account the fact that some equipment degrades over time; if a product is very reliable for 1,000 hours, and less reliable after that, just double the sample size (maybe triple for statistics), and see what you get.

Real reliability calculations are much more difficult than just what users think MTBF means...

Re:Look at the definition by 0x0d0a · 2003-06-19 17:24 · Score: 1

Yup.

MBTF might have a little bit of value as a relative measure -- i.e. perhaps drives with an order of magnitude higher MBTF will last longer. It's a lot less useful as an absolute measure.

Lots of equipment (not just hard drives) has some sort of estimated lifetime. The best the manufacturer can ever do is estimate under semi-realistic conditions and extrapolate.

--
May we never see th

Labs by MazTaim · 2003-06-19 13:15 · Score: 4, Insightful

That's the key word.

MTBF is probably determined by taking a bunch of drives, putting them into PERFECT conditions that NEVER exist in the real world. Run them in a way that, although test all functionality, really doesn't provide true conditions for drives (IE head always reading/writing up and down the disk probably never seeking, disks always spinning, etc..). Something that drives never do in real life. Statistics...statistics...statistics...(speeling too :)

As a sidenote by Mensa+Babe · 2003-06-19 13:16 · Score: 1

I might add, that when I was contracted at a server farm, people there used to celebrate every day, when there was no hardware failure, and the record was four times in one month. But have they complained that the producers of hardware were lying to them, stating years of MTBF? No. And that's because they knew the basics of mathematics and knew how to use FDIV opcode in their brains. The only solution is redundancy.

--
Karma: Positive (probably because of superiour intellect)

Re:As a sidenote by NickDngr · 2003-06-19 13:27 · Score: 5, Funny

DISCLAIMER: The views expressed hereafter are not necessarily those of MENSA, which I am only a member of.

Shouldn't that be "The views expressed hereafter are not necessarily those of MENSA, of which I am only a member." I would think proper grammar usage would be a prerequisite for being a MENSA member.

--
Yoda of Borg am I! Assimilated shall you be! Futile resistance is, hmm?
Re:As a sidenote by pompousjerk · 2003-06-19 13:50 · Score: 0, Offtopic

I would think proper grammar usage would be a prerequisite for being a MENSA member.

Mmmm-hmm. You understood it; the "grammar" rule you are citing is artificial. English isn't created by a bunch of old guys in Universities; it's in every mind, and it's constantly changing. "Don't split infinitives," "don't use no double negatives" (an especially stupid rule, considering you have to think about it to not understand the sentence!), and the like tend to do little more than limit the expressiveness of English.

Apparently you just need to feel good about yourself or something. ("Look at me! I pointed out the error of a Mensa member....wheeeeee.....")

Obligatory plug of a book that will explain it: The Language Instinct, second-to-last chapter.

Also, mods: ad hominem, offtopic (this is as well, so fire away if you see fit)
Re:As a sidenote by Anonymous Coward · 2003-06-19 13:53 · Score: 0

I would think proper grammar usage would be a prerequisite for being a MENSA member.

Well, it isn't.
Re:As a sidenote by The+Clockwork+Troll · 2003-06-19 14:03 · Score: 4, Funny

I would think proper grammar usage would be a prerequisite for being a MENSA member.
<input type="radio" name="gift" value="IQ" disabled>
<input type="radio" name="gift" value="money" disabled>
<input type="radio" name="gift" value="penis size" disabled>
<input type="radio" name="gift" value="ability to nitpick trivia" checked>

--

There are no karma whores, only moderation johns
Re:As a sidenote by Blkdeath · 2003-06-19 14:15 · Score: 1

Apparently you just need to feel good about yourself or something. ("Look at me! I pointed out the error of a Mensa member....wheeeeee.....")

What of the pomposity of foisting membership in a group of self-righteous egotists on a public forum? Over-compensation, anyone?
The only problem with her scheme is that Mensa membership is easy to come by. Some people just find it an undesirable association. But if she feels she has to emphasize her membership because perhaps her posts can't stand on their own merits, well, that's fine by me.

--
BD Phone Home!
Shameless plug. Like you weren't expecting it.
Re:As a sidenote by bluephone · 2003-06-19 15:48 · Score: 1

I found it much more gratifying to turn MENSA down, personally. "Please join!" Umm, no thanks. Maybe I'm just an easily amused genius... :)

--
jX [ Make everything as simple as possible, but no simpler. - Einstein ]
Re:As a sidenote by itwerx · 2003-06-20 03:50 · Score: 1

I couldn't agree more. I knew a guy who was a Mensa member and after a protracted attempt on his part to get me to join I finally went to a meeting.
Maybe it was just that particular chapter but I got the distinct impression that they were a bunch of egotistical losers!

Marketing BS... by Alomex · 2003-06-19 13:23 · Score: 3, Insightful

Anybody who has a large number of drives running knows that the figures have become meaningless over time. They use to predict to the T the expected time of failure. They are now a marketing term assuming "a duty cycle" and computed by an absurd "units x time to failure". Using that system, the MTBF of the Honda Civic engine is 100,000 years as there are 1 million Civic's out there and none of them had their engine seize up in the first month.

Somebody ought to sue them for deceptive advertisement.

No, you are wrong by anthony_dipierro · 2003-06-19 13:25 · Score: 3, Insightful

Actually, you are wrong... If you have one drive fail per year for 20 years, then the mean time between failures is 10.5 years.

Re:No, you are wrong by The+Clockwork+Troll · 2003-06-19 13:30 · Score: 1

Uh, how do you mean? (literally)
Is there some special definition for MTBF that changes how "mean time between" is interpreted?

--

There are no karma whores, only moderation johns
Re:No, you are wrong by anthony_dipierro · 2003-06-19 13:40 · Score: 1

A mean is an average. The average of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, and 20 is 10.5.
Re:No, you are wrong by Mensa+Babe · 2003-06-19 13:50 · Score: 1

Is there some special definition for MTBF that changes how "mean time between" is interpreted?

If by some special definition you mean a simple linear multiplication (or division, depanding on your point of view), then yes. Anthony Dipierro probably was mistakenly thinking about a decibel or other logarithmically scaled unit system.

--
Karma: Positive (probably because of superiour intellect)
Re:No, you are wrong by The+Clockwork+Troll · 2003-06-19 13:52 · Score: 2, Informative

So you are using the average of the times measured between deployment of hardware and individual drive failure, as opposed to the mean time between failures of individual drives, i.e. the mean time between necessary hardware replacements, which is what I would have thought more useful for evaluating hardware.
I would have called your average "mean time to failure" vs. "mean time between failures."
But as other posters mentioned, most of these stats are marketing bunk no matter how they're computed!

--

There are no karma whores, only moderation johns
Re:No, you are wrong by Anonymous Coward · 2003-06-19 13:52 · Score: 0

That's the mode, moron. Mode is the "middle" # in a sequence (given array a, where a.length() = n, sort a, then take a[n/2]). Mean is: double sum=0; for(int i=0; i
Re:No, you are wrong by pompousjerk · 2003-06-19 13:53 · Score: 1

Yeah, but you can have one for 0 years, too. :)
Re:No, you are wrong by crmartin · 2003-06-19 13:54 · Score: 1

Your interpretation is correct, assuming the exponential distribution (which is the common assumption.)
Re:No, you are wrong by anthony_dipierro · 2003-06-19 13:58 · Score: 1

I would have called your average "mean time to failure" vs. "mean time between failures."

Or better yet, "mean time before failure," thus preserving the acronym.
Re:No, you are wrong by Anonymous Coward · 2003-06-19 14:02 · Score: 0

http://www.t-cubed.com/faq_mtbf.htm
Re:No, you are wrong by The+Clockwork+Troll · 2003-06-19 17:33 · Score: 1

Do drive failures tend to arrive in something resembling a Poisson manner, in practice?

--

There are no karma whores, only moderation johns
Re:No, you are wrong by crmartin · 2003-06-20 05:02 · Score: 1

Yes. Strictly -- as someone else pointed out -- the failures tend to have a so-called "bathtub" distribution. That is, there's a high failur rate at first ("infant mortality") followed by a long Poisson/Markovian/exponential (you pick your term) stretch, followed by a higher failure rate as it get old. In general, the "lifetime" of a component is the time to the inflection at the end of the exponential portion.

Enter The Matrix... by HaloZero · 2003-06-19 13:25 · Score: 2, Funny

Calculating the Mean Time Between Failures?
I prefer to measure time by the emergence of one integral anomoly to the next.

--
Informatus Technologicus

MTBF... by m0rph3us0 · 2003-06-19 13:28 · Score: 3, Insightful

The best way to determine *REAL* MTBF is how long the drives are warrantied for, no one warranties a product longer than it is supposed to last. When you see a company reduce it's warranty expect quality to drop in accordance.

Yes, I am right. by Mensa+Babe · 2003-06-19 13:32 · Score: 1

Actually, you are wrong... If you have one drive fail per year for 20 years, then the mean time between failures is 10.5 years.

If I have twenty drives, each of which is estimated to fail once in a twenty-year period, then in such a twenty-year period every one of those disks is estimated to fail once. These are twenty failures in twenty-year period on average, id est one failure per year. It is actually a matter of very simple mathematics.

--
Karma: Positive (probably because of superiour intellect)

Re:Yes, I am right. by Anonymous Coward · 2003-06-19 13:34 · Score: 0

Hey MENSA chick, your sentence ends in a preposition.
Maybe it's a different MENSA...
Re:Yes, I am right. by iamroot · 2003-06-19 13:46 · Score: 2, Informative

Mean Time Before Failure is the MEAN time before the disk would fail.

If they all failed within 20 years, how would the average disk have failed in 20 years???

MTBF (Mean Time Between Failures) â" Average time (expressed in
hours) that a component works without failure. It is calculated by dividing
the total number of operating hours observed by the total number of
failures. Also, the length of time a user may reasonably expect a device or
system to work before an incapacitating fault occurs.

20 drives/20 years is 1 drive a year.
Re:Yes, I am right. by Anonymous Coward · 2003-06-19 13:49 · Score: 0

Ya know, maybe you should spend less time lying about mensa membership and more time learning to try and not be so obvious with the fact that it's all a lie. Proving yourself an idiot three times in one thread of one story is quite an accomplishment.

They rate it like Lightbulbs... by Anonymous Coward · 2003-06-19 13:38 · Score: 0

They pick that one in the attic you turn on every Christmas for 15 minutes and say that it will last 10 years of occasional use...

MTBF calculation and estimation by crmartin · 2003-06-19 13:40 · Score: 4, Informative

You know, it's almost a shame to screw up the amusing notions /.ers come up with by adding actual information, but I can't help it, all those years of teaching I guess.

Okay, first of all: "mean time between failures" is obviously a statistical measure -- it is an average over a large number of individual items. In most electronic components (including light bulbs!) the statistical distribution of the time between failures is the exponential distribution, which has the odd property that it's "memory-less" -- it doesn't matter how long since the last failure it's been, the mean time to the next failure will still be the same. A consequence of this is that if the MTBF is 10,000 hours, the probability of failure in any particular hour would be 1/10,000th. So, if you set up 10,000 components, all running simultaneously, you'd expect one of them to fail within the first hour; conversely, if you ran them for 1000 hours, and 998 of them failed, you could be fairly certain that the MTBF would be around 10,000 hours.

Note, by the way, that this is only true when the failure time distribution is exponential -- so it works for electronic components, but not for, say, bicycles and cars and roller skates, which are more likely to fail the older they get.

This has an obvious problem, of course: if the MTBF is high, it can take forever to test. Consider, for example, something I worked on for NASA some years ago: trying to prove that a fly-by-wire system will have a mean time between failures of 1e10 hours. (This is about the same failure rte as the airframe, which is how they came up with the number.) 1e10 hours is about 1.141 million years, by the way.

(Pop quiz: if MTBF is a million years, how do you explain the occasional airframe failure, say, eg TWA 800? Hint: It doesn't require any foul play.)

At that point, you've got a couple of choices: first, you can make a lot of copies and run them simultaneously. Relatively easy for $50 disks, hard for billion dollar 747s.

Second, you can make the estimate by computation and modeling which is what you do for web systems. Conceptually, it's pretty simple to do this, although it can be a kind of pain in the ass.

The third way, which is new and cool, is by Bayesian estimation of failure rates. This method lets you make increasingly accurate estimates of the failure rate based on short experiments. I don't have time to go into it, but there are some good sources available on the web.

Re:MTBF calculation and estimation by crmartin · 2003-06-19 13:52 · Score: 2, Informative

Actually, here's some more references: at CiteSeer, a good (if expensive) book on practical examples, and my favorite textbook. I'll shut up now.
Re:MTBF calculation and estimation by Anonymous Coward · 2003-06-19 13:57 · Score: 0

Sir, I pass on the legendary Crown of Geekhood. Tradition dictates that the Crown is passed on when the wearer discovers one who is geekier than they are.

Wear the Crown proudly, and may you reign in peace!
Re:MTBF calculation and estimation by Anonymous Coward · 2003-06-19 14:38 · Score: 0

"if the MTBF is 10,000 hours, the probability of failure in any particular hour would be 1/10,000th."
The probability of the part failing after some time t is e^(-t/10,000).
Re:MTBF calculation and estimation by plsuh · 2003-06-19 15:42 · Score: 1

AMEN! I distinctly regret that I don't have mod points right now. I have an ABD in Economics and years of work experience doing econometrics, and this analysis nails how the MTBF calculation is done exactly.

One quibble I have (more with the HD manufacturers, not crmartin) is that HD have mechanical components (spindle, actuator, etc.) that are subject to wear. As a result, MTBF calculations that are appropriate for solid state electronic equipment not subject to physical wear are likely inappropriate for HD's.

--Paul
Re:MTBF calculation and estimation by CharlieG · 2003-06-19 17:49 · Score: 2, Informative

Of course, the real problem is that neither electronics, nor mechanical items have a exponential failure curve

Mechanical items tend to fail due to wearout - aka, they become more likely to fail as time goes on

Electronics follow a "Bathtub" curve. A high initial rate, that rapidly drops to a VERY low rate. It stays at that LOW rate of failure for MANY hours, and then the rate increases rapidly during "wearout" - sort of like the cross section of a bathtub - hence the name

The whole concept of "Burn In" - or better yet, Stress Screening is to remove the initial high rate of failure defects, with removing ANY of the bottom of the curve. A properly defined Stress Test can do this. These tests usually involve some sort of temperture/power cycling, along with some sort of vibration testing (usually a pseudo random vibration profile)

I got my start in programming while running a stress screening lab. When the tests run 24/7, you either automate your data collection, or have folks work 24/7 - guess which is cheaper? So, I got to design the tests, and then write the software to run the test racks

--
-- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
Re:MTBF calculation and estimation by crmartin · 2003-06-20 04:57 · Score: 1

Sure, but in a "bathtub" distribution it's approximately exponential over most of the lifetime, so it's a decent approximation.
Re:MTBF calculation and estimation by crmartin · 2003-06-20 05:09 · Score: 1

Work it out; IF it's an exponential, THEN for failure rate parameter lambda, mean time between failures is 1/lambda. It's too painful to try to show this in HTML, but Trivedi's book cited above ("favorite text") shows the derivation.
Re:MTBF calculation and estimation by crmartin · 2003-06-20 05:26 · Score: 1

Thanks you. I will try to reign ineffectively from home, as all good geeks would prefer.

Sadly, on some online geek test recently, I not only scored high, I scored #4 among everyone who had ever taken the test.

Then I couldn't decide whether to be embarrassed at being that much of a geek, or at only scoring fourth.
Re:MTBF calculation and estimation by Anonymous Coward · 2003-06-20 07:00 · Score: 0

If you have a failure rate of lamda, then MTBF = 1/lambda, as you said. However, you can also compute the probability that the part will operate without failure for a given time t. This probability is given by e^(-t/MTBF) = e^(-t*lambda). Of course this is all assuming exponential distribution, which is where you get all these formulas in the first place.
Re:MTBF calculation and estimation by crmartin · 2003-06-20 07:26 · Score: 1

Right. Just an ambiguity in what you were saying: 'after time t'. You're giving failure in time (t-t0), PMF instead of PDF.
Re:MTBF calculation and estimation by CharlieG · 2003-06-21 04:28 · Score: 1

Yes, it's exponential in the "Non-interesting" part of the curve. When you look at total failures (as a percentage), less than 10% of the total parts will fail during the "Bottom" of the curve, and the rest are fairly evenly split between the 2 sides. It's one of the reasons that companies can offer extended warrantees on electronics as cheaply as they do, and for it to be their greatest profit center

Buy the product, use and abuse it during the original warrantee period, and it'll break if it's going to break. REAL quality minded mfgs can do stress screening for less $$$ than the return cost of repairs, and cut the infant mortality (in the field) to almost nothing. This is where the often heard "Quality is free" mantra came from. It's that Quality can cost less than repairs do, and has the side benefit of getting a rep for good products

--
-- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
Re:MTBF calculation and estimation by crmartin · 2003-06-21 12:59 · Score: 1

I'm not quite sure why you read like we're arguing, since I don't think we are -- except to the extent that you don't thing the lengthy period of Markovian behavior is interesting, while I think that's the most interesting part.

The point is this: the MTBF is computed for the Markovian part of the total life. The value for MTBF is computed as 1 / failure-rate IF AND ONLY IF the time distribution of failures is Markovian -- otherwise it's a more complicated function. The useful life is the length of time over which the failure rate is exponential: when you get to the upward inflection, that's the end of the useful life. Thus if a drive has a useful life of 5 years, and a failure rate of 1/50 years (that is, a MTBF of 50 years) then the chances are one in ten that the drive will fail during that five years.

The warranty issue is something else: a warranty is exactly like insurance -- it's a bet that you won't have to pay off on the warranty too often. The amount of money you have to charge for the warranty is the amount at risk times the chances of a failure over the lifetime of the warranty. Thus if you by a $100 CD player, and they offer you a $20 extended warranty for 3 years, they're saying they think the odds are less than on in five that they'll have to pay off during the year. Given the number of people who promptly lose the paperwork, that's usually an excellent bet; as you say, that makes the extended warranty a profit center all in itself.

No, you're wrong... by anthony_dipierro · 2003-06-19 13:44 · Score: 1

If I have twenty drives, each of which is estimated to fail once in a twenty-year period, then in such a twenty-year period every one of those disks is estimated to fail once.

No, you're wrong. The average time to failure is 20 years, not the maximum.

The average of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, and 20 is 10.5.

Re:No, you're wrong... by benjamindees · 2003-06-19 16:59 · Score: 1

Right, she's just making the assumption that the drives were all bought at different times.

That might be valid when you work in a datacenter that replaces x number of drives every year.

On average, if the MTBF of those drives is 20 years, one drive in such a group will fail every year.

--
"I assumed blithely that there were no elves out there in the darkness"

mod parent up - he's right by Paul+Jakma · 2003-06-19 13:52 · Score: 1

someone mod up the parent, they are correct, the mensa babe, despite their mensa membership, is wrong. MTBF of 20 drives, one failling each year is (Sigma(n=1 to 20){n})/20 which is 10.5 years.

A quick thought experiment should make it obvious that for a series of numbers (eg number of years between failure) where the highest number is n, that the mean of these numbers could never be n.

--
I use Friend/Foe + mod-point modifiers as a karma/reputation system.

Re:mod parent up - he's right by Paul+Jakma · 2003-06-19 13:55 · Score: 1

urg... with proviso that at least one of the series of numbers is not equal to n.

--
I use Friend/Foe + mod-point modifiers as a karma/reputation system.
Re:mod parent up - he's right by anthony_dipierro · 2003-06-19 14:11 · Score: 1

Hmm... Actually I just thought of something. If you have 20 hard drives, and one fails each year, but you fix it immediately after failure and then put it into service again, then the MTBF would be 20 years.

I guess that's why the term is the mean time between failures rather than the mean time before failure. Back when the term was invented it probably made economic sense to fix a hard drive when it breaks. Nowadays we're more likely to just throw it away.
Re:mod parent up - he's right by Paul+Jakma · 2003-06-19 14:20 · Score: 1

nice try, but if something fails and you then fix it, it has still failed. :)

--
I use Friend/Foe + mod-point modifiers as a karma/reputation system.
Re:mod parent up - he's right by anthony_dipierro · 2003-06-19 14:27 · Score: 1

Yeah, but the total operating time for each drive is 20 years if you fix them and they last. What I'm saying is, one drive breaks after 1 year, you fix it and it lasts the next 19, one drive breaks after 2 years, you fix it and it lasts the next 18, etc.

Alternatively, you could have 19 of the drives last 20 years, and the other one break once a year.

Actually, you wouldn't even have to fix them, if you replace them. If you keep 20 drives for 20 years, and one breaks every year, if you replace the one that breaks (immediately), that's an MTBF of 20 years.

So it's possible (though unlikely) that the person who wrote the original post was right, but just was talking about a different scenario than I thought.
Re:mod parent up - he's right by Paul+Jakma · 2003-06-19 14:37 · Score: 1

if it fails after a year, and you fix it and it lasts another 19 years, MTBF is (1+19)/2 = 10. Its failed twice on you.

--
I use Friend/Foe + mod-point modifiers as a karma/reputation system.
Re:mod parent up - he's right by anthony_dipierro · 2003-06-19 14:48 · Score: 1

if it fails after a year, and you fix it and it lasts another 19 years, MTBF is (1+19)/2 = 10. Its failed twice on you.

No, I never said it failed a second time. I said it failed after a year, then it lasted another 19 years. Then you stopped the experiment.
Re:mod parent up - he's right by Anonymous Coward · 2003-06-19 17:35 · Score: 0

hahahahaha. awesome.
Re:mod parent up - he's right by photon317 · 2003-06-20 00:13 · Score: 2, Informative

Whatever your math may say, the Industry's standard is the one you disagree with. If they stick 1000 drives in an array, run them for 1 year, and only a single drive fails during that year, the MTBF for that model of drive is 500 years.

--
11*43+456^2
Re:mod parent up - he's right by sstamps · 2003-06-20 04:23 · Score: 1

Anthony,

I think you are right in this case, even with that consideration.

Since every drive failed once within a 20-year period, then we have all of our data points on the lower half of the normal curve (with the mean of 20 being in the middle), and none above. Thus, the mean for this part of the experiment would actually be around 10 years. However, if we were to continue the experiment out to 40 years or longer, and gain more samples, then it might balance out, but those repaired drives would have to be "better than new", because they would have to make up for the large shortfall that occurred early-on.

That's why, once you look at this crap practically, the truth comes to light that one can not expect abstracts to directly represent reality. No more so than unit-less data points can be translated to to unit-ed ones.

--
-SS "Teach the ignorant, care for the dumb, and punish the stupid."
Re:mod parent up - he's right by anthony_dipierro · 2003-06-20 05:12 · Score: 1

Well, here's the thing. The way the numbers work, you don't have to wait to see each drive fail before calculating the MTBF. As soon as one drive fails, you can make a calculation. In fact, that's one of the reasons that hard drives usually have unrealistically high MTBFs. They test 500,000 drives for 1 hour, and only see 1 failure, so they call the MTBF 500,000 hours. But in reality the expected lifetime of a hard drive is not so cut and dry. It might have very low probability of failing after 1 hour, but much higher probability of failing after 1000 hours.
Re:mod parent up - he's right by crmartin · 2003-06-20 05:39 · Score: 1
That's the difference between "statistical inference" and what people who have had a probability class tend to do. If you've got 500,000 drives, test for an hour, and get one failure, you've got some evidence that the failure rate is 1 per 500,000 hours (== MTBF of 500,000 hours in the exponential case.) But the confidence in that estimate is very low. If you then replace the failed drive and do another 1 hour test, and get another single failure, then you still can estimate the MTBF as 500,000 hours -- but your confidence level is higher.
This trick is the key to the Bayesian estimation methods I mentioned above. Check out the CiteSeer and Google links for more papers than you ever wanted to read.
Now, consider if you get >1 failure in that second hour: there are several possible explanations:
- the failure rate is increasing over time
- the failure rate is 1/500,000 but it just happened that you had two random failures in that one hour
- The failure rate is really more than 1/500,000 and it just happened randomly that the first hour had only one failure.
In general, you've got to test a lot of items for a fair chunk of the expected lifetime to get the confidence level up.
Re:mod parent up - he's right by anthony_dipierro · 2003-06-20 06:51 · Score: 1

That's the difference between "statistical inference" and what people who have had a probability class tend to do.

I don't know if it matters as much whether or not the person has had a probability class as whether or not the person is trying to legally boost MTBF ratings.

Sure, you can get better figures using different methods, but those better figures most likely will be lower, so why bother if you want to sell your product.
Re:mod parent up - he's right by crmartin · 2003-06-20 07:29 · Score: 1

Anthony, that's not really true. First of all, for a long part of the lifetime, exponential is a very good approximation. I can't cite it offhand, but I read a paper that showed it was better than 95% accurate. Secondly, the Bayesian method will give measures that you can advertise as correct (and people like Telcordia are very stubborn about it) with a relatively short and inexpensive experiment.
Re:mod parent up - he's right by Paul+Jakma · 2003-06-20 12:15 · Score: 1

*sigh* how did you get modded as informative?

The original post that started this thread (from mensa babe) stated "run 20 drives of MTBF 20 years, then one fails each year", a poster replied to say they were incorrect, it would mean drives had MTBF of 10.5 years. And I replied to say the correctee was right. (which they are).

Your example is for 1000 drives, running for 1 year, 1 failure. Going by the same math which i used to back up the aformentioned correctee, ie MTBF = Sigma(runtime)/failures, which is:

(1000 drives * 1 year)/(1 failure) = MTBF of 1000 years.

This is the text book definition of MTBF, quoted in the glossary linked to in the story. So you're still wrong... and the math still holds.

NB: this does not take lifetime into account at all (they might /all/ fail after 1.1 years). See other posters explanations for that. MTBF is a measure of failure rate observed in a sample over a given period of time rather than an actual quantitive measure of how long something will likely run for before it breaks. Hence you will get vastly different figures for different sample sets/time periods. The 2 examples in this post being prime examples of this:

20 drives, 20 years, 1 fail per year
= 10.5 year MTBF

Vs

1000 drives, 1 year, 1 fail
= 1000 years MTBF

These are both textbook MTBF figures.

Which demonstrates the biggest problem with MTBF: its mostly meaningless unless one knows the details of the sample set and conditions used to obtain MTBF.

--
I use Friend/Foe + mod-point modifiers as a karma/reputation system.

Ack!! The irony is too much. by OwnerOfWhinyCat · 2003-06-19 14:08 · Score: 1

I'm not picking. REALLY. It's just too funny that someone complaining about bad mathmatics in the thread, and the C compiler, in the same post, would live on a planet with a 10 month year.

float y = (float) 1/2 * x; // should yeild better results.

Re:Ack!! The irony is too much. by Anonymous Coward · 2003-06-19 14:28 · Score: 0

Ever heard the term "back of the envelope calculation"?

No, You are wrong by Anonymous Coward · 2003-06-19 14:27 · Score: 1, Informative

The discrepency emerges because you do not operate the drives past their end-of-life when you make the MTBF calculation. To illustrate assume you have a particular model of hard drive with a MTBF of 57 years, and it reaches end-of-life after 5 years. What you can then conclude is that if you replace the drive every 5 years with a new drive(of the same model and MTBF), then you can expect your first failure at 57 years. Keep in mind that this is really just the most probably time for a failure to occur. Given any other time it is possible to calculate the probability of that time being your first occurence of failure; 57 years just gives the greatest probability in this calculation(in this example). It does not mean that you should expect any single drive to operate for 57 years.

Why is hardware always mean? by cookd · 2003-06-19 14:37 · Score: 4, Funny

Whatever happened to *NICE* time between failure?

--
Time flies like an arrow. Fruit flies like a banana.

Re:Why is hardware always mean? by MarkGriz · 2003-06-20 01:55 · Score: 1

Actually, the *NICE* time is before failure.
The *MEAN* time isn't until after it fails.

--
Beauty is in the eye of the beerholder.

It all depends on the distribution... by anthony_dipierro · 2003-06-19 14:46 · Score: 3, Informative

I went to check some regular IDE drives just for comparison, and they were rated at 500,000 hours (57 years). Now, as I understand it, this is supposed to be the average time that you can expect the drive to last before failures. I rarely have an IDE drive last more than 4 years, and my record is 10 years, so what is the deal?

Let's say I have a drive that has a 99% chance of failing after 10 years, and a 1% chance of failing after 4710 years. The MTBF is 57 years.

In fact, with the proper distribution (think 2^n) you could have an infinite MTBF, but still have a 99% chance of failure within 10 years. See for example the St. Petersburg paradox.

Re:It all depends on the distribution... by metamatic · 2003-06-20 07:58 · Score: 1

Let's say I have a drive that has a 99% chance of failing after 1 year, and only a 1% chance of lasting for 10 years. There's a 99% chance I bought it from Micropolis.

--
GCHQ Quantum Insert installed. If only our tongues were made of glass, how much more careful we would be when we speak

Sidestepping sidenotes by chewy · 2003-06-19 23:07 · Score: 1

After reading the original poster's (Mensa Babe) slashdot-blurp, I have come to the conclusion that she must be a very intelligent girl with a lot of anger towards all males with their brains in the wrong place.

As for the language bit, ALOT of really intelligent people are totally dislexic when it comes to grammar and spelling, since it makes no sense anyway IMHO :)

You've come to the right place. by Compact+Dick · 2003-06-20 00:36 · Score: 2, Funny

I'm sure our resident expert is more than willing to help.

--
Use ISO 8601 dates [YYYY-MM-DD]

Design Lifetime by Detritus · 2003-06-20 01:00 · Score: 2, Interesting

MTBF figures are usually associated with a design lifetime. That hard drive may have a 300,000 hour MTBF based on a 100% duty cycle and a 5 year design lifetime. That tells you the expected failure rate for the first 5 years of operation. After that point, the failure rate may increase rapidly.

--
Mea navis aericumbens anguillis abundat

failure. by leuk_he · 2003-06-20 01:56 · Score: 1

Pop quiz: if MTBF is a million years, how do you explain the occasional airframe failure, say, eg TWA 800? Hint: It doesn't require any foul play.)

Let me try:(please reply if i am in the right direction)

1.This 1 in a millon year does not count when the airplain is burning, or some other component failed.
2. The airplane has lots of components. Suppose that if a door fails this could lead to failure of the airframe. if the plane has 3 doors the change goes down to 4 failures in a million year. LOTS of components can fail. there are lots of planes and lots of years.

Re:failure. by crmartin · 2003-06-20 04:42 · Score: 1

"There are lots of planes and lots of years" is the right answer: I don't recall the exact figures right now, but at the time of the TWA 800 crash I predicted that it would turn out to be an airframe failure (on the heuristic of preferring failure to malice) because when I worked the numbers it turned out that MTBF of 747s was about that same 1e10.

By the way, your supposition about answer (1) is correct. It's just a definitional thing: we're really talking about proximate and root causes. You don't count it as an airframe failure if someone throws a cigarette butt into the trash and smoke and fire causes the crash, and while the airframe certainly fails if someone flies the plane into a mountain, that's still not the immediate cause.

All too simple by agentZ · 2003-06-20 02:38 · Score: 1

I think we're looking too deeply into this concept. The Western Digital definition for MTBF says, the MTBF "is calculated by dividing the total number of operating hours observed by the total number of failures. Also, the length of time a user may reasonably expect a device or system to work before an incapacitating fault occurs."

This means that they hook a whole bunch of drives up and run then for a while. Add the total hours of drive operation up and divide by the number of failures. They're estimate of 500,000 hours could correspond to the following experiment:

Connect 100 drives and run them, 24/7, for 200 days. This yields 100 * 24 * 20 about 500,000 drive hours. If there was only one drive failure in that time, then you could say that the MTBF is 500,000 hours. Granted, that's not 500,000 hours for ONE drive, but it's spread across all drive hours completed.

Re:All too simple by Anonymous Coward · 2003-06-20 03:12 · Score: 0

You are close. This is the real answer to the question. What you miss, is that once a drive fails, it will be replaced with another drive. You do this, because you need to get a total operating time of 500,000 hours(this number will be determined statistically, so that the experiment is meaningful). If one of the drives fails, then after 24/7 200 days you have something less than 500,000, and the experiment has to continue until you reach that 500,000 hour mark. This isn't good for deadlines, so the drives get replaced.

Confusion? by Anonymous Coward · 2003-06-20 03:14 · Score: 0

Are we confusing mean time between failures (MBTF) with mean time to failure (MTTF)?

Something else is going on . . . by nixman99 · 2003-06-20 03:41 · Score: 1

My lab has about 5 each 3524, 3548, and 3550-24, average age 2.5 years, and no failures (hardware or crashes). Other offices I know of have had similar experiences with the 3500 series.

Either you just happened to get a bad batch, or you've got environmental problems. Make sure there is sufficient air flow around the units, and check the power harmonics on the circuit. Most consumer grade (read cheap) electronics uses crappy power supplys which cause harmonics on the power line. One or two isn't a big deal, but if you get 7 or 8 cheap computers and hubs on the same circuit, you will have problems.

Bad Example by Detritus · 2003-06-20 03:47 · Score: 1

I had a VW Jetta that blew its engine when it was a few weeks old. One of the pistons disintegrated due to a defect in manufacturing. The dealer sent the parts back to the manufacturer for failure analysis. It's rare, but such things do happen.

--
Mea navis aericumbens anguillis abundat

MTBF = infinity by frink_exp · 2003-06-20 04:54 · Score: 1

I guess it depends on what you consider failure, but if it really fails then it can only happen once! The next failure is never coming 'cause there's no way it can fail again so MTBF = infinity

--
'Q' is for Dr. Tran

What MTBF Really means... by giberti · 2003-06-20 07:45 · Score: 1

failures / (drives * hoursrun) = MTBF

Where:
failures = actual number of drives RETURNED to manufacture
drives = total number of production drives built
hours = actual failure point (about 5 years)

--

AF-Design, web development.

Common misconceptions... by SagSaw · 2003-06-20 10:34 · Score: 1

For most items, the MTBF is not how long you can expect an item to operate with out failures. For most MTBF calculations, half of the units will fail before 33% or so of the MTBF. (I don't have the derivation of this number in front of me, but I can probably dig it up if somebody wants it).

As far as the high MTBF's mentioned by the submitter, I can think of at least two perfectly valid methods that accuratly determine long MTBF's:

First, you have the theoretical MTBF. This is where you look at the chance of failure for each (important) component and mathematically determine how long a unit would be expected to survive.

Second, you can test the device to failure under accelerated environmental conditions (high temperature, over-voltage, excessive vibrations, etc.) and extrapolate from the accelerated failure testing the failure distribution under typical conditions.

Neither method is perfect, but better than assuming that because only 1 unit out of 100 failed in the first year, the product has an MTBF of 50.5 years.

--
Come test your mettle in the world of Alter Aeon!

I wonder how complex their model is by Anonymous Coward · 2003-06-20 19:50 · Score: 0

The definition WD references in their glossary is really ambiguous.

MTBF (Mean Time Between Failures) â" Average time (expressed in hours) that a component works without failure. It is calculated by dividing the total number of operating hours observed by the total number of failures. Also, the length of time a user may reasonably expect a device or system to work before an incapacitating fault occurs.

#1 - In the marketplace - just b/c the component "exists" doesn't mean it's in use.
#2 - How do they estimate the mean operating time?(There are whole graduate courses on Estimation of statistical parameters).
#3 - How do they deal with censored date (i.e. HDs that never fail during the study period)? When do they terminate the study?

I'm sure given a decent sized data set, and the ambiguity of this def'n, I could come up with a MTBF that met any kind of marketing goal.

Poisson Distribution... by Anonymous Coward · 2003-06-22 07:01 · Score: 0

Mean-Time Between Failures is usually assumed to fbe a Poisson Process. Given that assumption, one can take N hard drives and run them for three or four months. Based upon how many of those N die in the first three to four months, one can make a pretty good stab as to the average time it takes one of their drives to fail.

Slashdot Mirror

Calculating the Mean Time Between Failures?

100 comments