Calculating the Mean Time Between Failures?
Blue Booger asks: "I was looking over some fibrechannel hard drives and noticed that the Mean Time Between Failures was rated at 1.2 million hours. I thought that was pretty high, and figured it up to be close to 137 YEARS!! I went to check some regular IDE drives just for comparison, and they were rated at 500,000 hours (57 years). Now, as I understand it, this is supposed to be the average time that you can expect the drive to last before failures. I rarely have an IDE drive last more than 4 years, and my record is 10 years, so what is the deal? BTW, that is 57 years running 24 hours a day...the MTBF is rated as power on time. Here you can find Western Digital's glossary that defines the term MTBF (pdf). Here you can find a spec sheet on one of their 20GB IDE drives. I checked, and Seagate also lists similar MTBFs. How the heck are they coming up with these numbers?"
Usually they have a duty cycle associated with an MTBF which can drastically alter the MTBF at a 100% duty cycle.
Just make Some Wild Ass Guess(SWAG).
Like, my hrad drive has a MTBF of 300,000 hours.
Cisco used to sell Catalyst 3548XL switches that were listed as having a MTBF of 120,000+ hours. Their current replacement for that line (3550)comes in at 163,000+ hours. We had 7 of 24 3548XL switches fail in the first year we had them. They had poor air flow from a tiny fan, no heatsinks and tons of hot chips. The newer model has the same issue, though they did stuff a cheap foam baffle in the case to get air to flow closer to the chips, none of which have heatsinks. I have no idea how they tested them and got a MTBF of 13 years.
If you have twenty drives with twenty years MTFB (Mean Time Between Failures) each, then you have one failure per year on average. These are basic statistics fighting always against you.
Karma: Positive (probably because of superiour intellect)
For example, shipped 2 million drives last year, each ran 2080 hours ( 8 hours * 52 weeks), roughly 4 trillion hours total. Out of those 2 million units, they got 3466 returns. So the average MTBF was 1.2 million hours.
"Eve of Destruction", it's not just for old hippies anymore...
If they run 500 drives for 2,000 hours and observe only one failure, that is a MTBF of 500,000 hours.
Unfortunately, that equation doesn't take into account the fact that some equipment degrades over time; if a product is very reliable for 1,000 hours, and less reliable after that, just double the sample size (maybe triple for statistics), and see what you get.
Real reliability calculations are much more difficult than just what users think MTBF means...
That's the key word.
:)
MTBF is probably determined by taking a bunch of drives, putting them into PERFECT conditions that NEVER exist in the real world. Run them in a way that, although test all functionality, really doesn't provide true conditions for drives (IE head always reading/writing up and down the disk probably never seeking, disks always spinning, etc..). Something that drives never do in real life. Statistics...statistics...statistics...(speeling too
I might add, that when I was contracted at a server farm, people there used to celebrate every day, when there was no hardware failure, and the record was four times in one month. But have they complained that the producers of hardware were lying to them, stating years of MTBF? No. And that's because they knew the basics of mathematics and knew how to use FDIV opcode in their brains. The only solution is redundancy.
Karma: Positive (probably because of superiour intellect)
Anybody who has a large number of drives running knows that the figures have become meaningless over time. They use to predict to the T the expected time of failure. They are now a marketing term assuming "a duty cycle" and computed by an absurd "units x time to failure". Using that system, the MTBF of the Honda Civic engine is 100,000 years as there are 1 million Civic's out there and none of them had their engine seize up in the first month.
Somebody ought to sue them for deceptive advertisement.
Actually, you are wrong... If you have one drive fail per year for 20 years, then the mean time between failures is 10.5 years.
Calculating the Mean Time Between Failures?
I prefer to measure time by the emergence of one integral anomoly to the next.
Informatus Technologicus
The best way to determine *REAL* MTBF is how long the drives are warrantied for, no one warranties a product longer than it is supposed to last. When you see a company reduce it's warranty expect quality to drop in accordance.
If I have twenty drives, each of which is estimated to fail once in a twenty-year period, then in such a twenty-year period every one of those disks is estimated to fail once. These are twenty failures in twenty-year period on average, id est one failure per year. It is actually a matter of very simple mathematics.
Karma: Positive (probably because of superiour intellect)
They pick that one in the attic you turn on every Christmas for 15 minutes and say that it will last 10 years of occasional use...
You know, it's almost a shame to screw up the amusing notions /.ers come up with by adding actual information, but I can't help it, all those years of teaching I guess.
Okay, first of all: "mean time between failures" is obviously a statistical measure -- it is an average over a large number of individual items. In most electronic components (including light bulbs!) the statistical distribution of the time between failures is the exponential distribution, which has the odd property that it's "memory-less" -- it doesn't matter how long since the last failure it's been, the mean time to the next failure will still be the same. A consequence of this is that if the MTBF is 10,000 hours, the probability of failure in any particular hour would be 1/10,000th. So, if you set up 10,000 components, all running simultaneously, you'd expect one of them to fail within the first hour; conversely, if you ran them for 1000 hours, and 998 of them failed, you could be fairly certain that the MTBF would be around 10,000 hours.
Note, by the way, that this is only true when the failure time distribution is exponential -- so it works for electronic components, but not for, say, bicycles and cars and roller skates, which are more likely to fail the older they get.
This has an obvious problem, of course: if the MTBF is high, it can take forever to test. Consider, for example, something I worked on for NASA some years ago: trying to prove that a fly-by-wire system will have a mean time between failures of 1e10 hours. (This is about the same failure rte as the airframe, which is how they came up with the number.) 1e10 hours is about 1.141 million years, by the way.
(Pop quiz: if MTBF is a million years, how do you explain the occasional airframe failure, say, eg TWA 800? Hint: It doesn't require any foul play.)
At that point, you've got a couple of choices: first, you can make a lot of copies and run them simultaneously. Relatively easy for $50 disks, hard for billion dollar 747s.
Second, you can make the estimate by computation and modeling which is what you do for web systems. Conceptually, it's pretty simple to do this, although it can be a kind of pain in the ass.
The third way, which is new and cool, is by Bayesian estimation of failure rates. This method lets you make increasingly accurate estimates of the failure rate based on short experiments. I don't have time to go into it, but there are some good sources available on the web.
If I have twenty drives, each of which is estimated to fail once in a twenty-year period, then in such a twenty-year period every one of those disks is estimated to fail once.
No, you're wrong. The average time to failure is 20 years, not the maximum.
The average of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, and 20 is 10.5.
someone mod up the parent, they are correct, the mensa babe, despite their mensa membership, is wrong. MTBF of 20 drives, one failling each year is (Sigma(n=1 to 20){n})/20 which is 10.5 years.
A quick thought experiment should make it obvious that for a series of numbers (eg number of years between failure) where the highest number is n, that the mean of these numbers could never be n.
I use Friend/Foe + mod-point modifiers as a karma/reputation system.
I'm not picking. REALLY. It's just too funny that someone complaining about bad mathmatics in the thread, and the C compiler, in the same post, would live on a planet with a 10 month year.
// should yeild better results.
float y = (float) 1/2 * x;
The discrepency emerges because you do not operate the drives past their end-of-life when you make the MTBF calculation. To illustrate assume you have a particular model of hard drive with a MTBF of 57 years, and it reaches end-of-life after 5 years. What you can then conclude is that if you replace the drive every 5 years with a new drive(of the same model and MTBF), then you can expect your first failure at 57 years. Keep in mind that this is really just the most probably time for a failure to occur. Given any other time it is possible to calculate the probability of that time being your first occurence of failure; 57 years just gives the greatest probability in this calculation(in this example). It does not mean that you should expect any single drive to operate for 57 years.
Whatever happened to *NICE* time between failure?
Time flies like an arrow. Fruit flies like a banana.
I went to check some regular IDE drives just for comparison, and they were rated at 500,000 hours (57 years). Now, as I understand it, this is supposed to be the average time that you can expect the drive to last before failures. I rarely have an IDE drive last more than 4 years, and my record is 10 years, so what is the deal?
Let's say I have a drive that has a 99% chance of failing after 10 years, and a 1% chance of failing after 4710 years. The MTBF is 57 years.
In fact, with the proper distribution (think 2^n) you could have an infinite MTBF, but still have a 99% chance of failure within 10 years. See for example the St. Petersburg paradox.
After reading the original poster's (Mensa Babe) slashdot-blurp, I have come to the conclusion that she must be a very intelligent girl with a lot of anger towards all males with their brains in the wrong place.
:)
As for the language bit, ALOT of really intelligent people are totally dislexic when it comes to grammar and spelling, since it makes no sense anyway IMHO
I'm sure our resident expert is more than willing to help.
Use ISO 8601 dates [YYYY-MM-DD]
MTBF figures are usually associated with a design lifetime. That hard drive may have a 300,000 hour MTBF based on a 100% duty cycle and a 5 year design lifetime. That tells you the expected failure rate for the first 5 years of operation. After that point, the failure rate may increase rapidly.
Mea navis aericumbens anguillis abundat
Pop quiz: if MTBF is a million years, how do you explain the occasional airframe failure, say, eg TWA 800? Hint: It doesn't require any foul play.)
Let me try:(please reply if i am in the right direction)
1.This 1 in a millon year does not count when the airplain is burning, or some other component failed.
2. The airplane has lots of components. Suppose that if a door fails this could lead to failure of the airframe. if the plane has 3 doors the change goes down to 4 failures in a million year. LOTS of components can fail. there are lots of planes and lots of years.
I think we're looking too deeply into this concept. The Western Digital definition for MTBF says, the MTBF "is calculated by dividing the total number of operating hours observed by the total number of failures. Also, the length of time a user may reasonably expect a device or system to work before an incapacitating fault occurs."
This means that they hook a whole bunch of drives up and run then for a while. Add the total hours of drive operation up and divide by the number of failures. They're estimate of 500,000 hours could correspond to the following experiment:
Connect 100 drives and run them, 24/7, for 200 days. This yields 100 * 24 * 20 about 500,000 drive hours. If there was only one drive failure in that time, then you could say that the MTBF is 500,000 hours. Granted, that's not 500,000 hours for ONE drive, but it's spread across all drive hours completed.
Are we confusing mean time between failures (MBTF) with mean time to failure (MTTF)?
My lab has about 5 each 3524, 3548, and 3550-24, average age 2.5 years, and no failures (hardware or crashes). Other offices I know of have had similar experiences with the 3500 series.
Either you just happened to get a bad batch, or you've got environmental problems. Make sure there is sufficient air flow around the units, and check the power harmonics on the circuit. Most consumer grade (read cheap) electronics uses crappy power supplys which cause harmonics on the power line. One or two isn't a big deal, but if you get 7 or 8 cheap computers and hubs on the same circuit, you will have problems.
I had a VW Jetta that blew its engine when it was a few weeks old. One of the pistons disintegrated due to a defect in manufacturing. The dealer sent the parts back to the manufacturer for failure analysis. It's rare, but such things do happen.
Mea navis aericumbens anguillis abundat
I guess it depends on what you consider failure, but if it really fails then it can only happen once! The next failure is never coming 'cause there's no way it can fail again so MTBF = infinity
'Q' is for Dr. Tran
failures / (drives * hoursrun) = MTBF
Where:
failures = actual number of drives RETURNED to manufacture
drives = total number of production drives built
hours = actual failure point (about 5 years)
AF-Design, web development.
For most items, the MTBF is not how long you can expect an item to operate with out failures. For most MTBF calculations, half of the units will fail before 33% or so of the MTBF. (I don't have the derivation of this number in front of me, but I can probably dig it up if somebody wants it).
As far as the high MTBF's mentioned by the submitter, I can think of at least two perfectly valid methods that accuratly determine long MTBF's:
First, you have the theoretical MTBF. This is where you look at the chance of failure for each (important) component and mathematically determine how long a unit would be expected to survive.
Second, you can test the device to failure under accelerated environmental conditions (high temperature, over-voltage, excessive vibrations, etc.) and extrapolate from the accelerated failure testing the failure distribution under typical conditions.
Neither method is perfect, but better than assuming that because only 1 unit out of 100 failed in the first year, the product has an MTBF of 50.5 years.
Come test your mettle in the world of Alter Aeon!
The definition WD references in their glossary is really ambiguous.
MTBF (Mean Time Between Failures) â" Average time (expressed in hours) that a component works without failure. It is calculated by dividing the total number of operating hours observed by the total number of failures. Also, the length of time a user may reasonably expect a device or system to work before an incapacitating fault occurs.
#1 - In the marketplace - just b/c the component "exists" doesn't mean it's in use.
#2 - How do they estimate the mean operating time?(There are whole graduate courses on Estimation of statistical parameters).
#3 - How do they deal with censored date (i.e. HDs that never fail during the study period)? When do they terminate the study?
I'm sure given a decent sized data set, and the ambiguity of this def'n, I could come up with a MTBF that met any kind of marketing goal.
Mean-Time Between Failures is usually assumed to fbe a Poisson Process. Given that assumption, one can take N hard drives and run them for three or four months. Based upon how many of those N die in the first three to four months, one can make a pretty good stab as to the average time it takes one of their drives to fail.