Calculating the Mean Time Between Failures?
Blue Booger asks: "I was looking over some fibrechannel hard drives and noticed that the Mean Time Between Failures was rated at 1.2 million hours. I thought that was pretty high, and figured it up to be close to 137 YEARS!! I went to check some regular IDE drives just for comparison, and they were rated at 500,000 hours (57 years). Now, as I understand it, this is supposed to be the average time that you can expect the drive to last before failures. I rarely have an IDE drive last more than 4 years, and my record is 10 years, so what is the deal? BTW, that is 57 years running 24 hours a day...the MTBF is rated as power on time. Here you can find Western Digital's glossary that defines the term MTBF (pdf). Here you can find a spec sheet on one of their 20GB IDE drives. I checked, and Seagate also lists similar MTBFs. How the heck are they coming up with these numbers?"
Cisco used to sell Catalyst 3548XL switches that were listed as having a MTBF of 120,000+ hours. Their current replacement for that line (3550)comes in at 163,000+ hours. We had 7 of 24 3548XL switches fail in the first year we had them. They had poor air flow from a tiny fan, no heatsinks and tons of hot chips. The newer model has the same issue, though they did stuff a cheap foam baffle in the case to get air to flow closer to the chips, none of which have heatsinks. I have no idea how they tested them and got a MTBF of 13 years.
If you have twenty drives with twenty years MTFB (Mean Time Between Failures) each, then you have one failure per year on average. These are basic statistics fighting always against you.
Karma: Positive (probably because of superiour intellect)
For example, shipped 2 million drives last year, each ran 2080 hours ( 8 hours * 52 weeks), roughly 4 trillion hours total. Out of those 2 million units, they got 3466 returns. So the average MTBF was 1.2 million hours.
"Eve of Destruction", it's not just for old hippies anymore...
That's the key word.
:)
MTBF is probably determined by taking a bunch of drives, putting them into PERFECT conditions that NEVER exist in the real world. Run them in a way that, although test all functionality, really doesn't provide true conditions for drives (IE head always reading/writing up and down the disk probably never seeking, disks always spinning, etc..). Something that drives never do in real life. Statistics...statistics...statistics...(speeling too
DISCLAIMER: The views expressed hereafter are not necessarily those of MENSA, which I am only a member of.
Shouldn't that be "The views expressed hereafter are not necessarily those of MENSA, of which I am only a member." I would think proper grammar usage would be a prerequisite for being a MENSA member.
Yoda of Borg am I! Assimilated shall you be! Futile resistance is, hmm?
You know, it's almost a shame to screw up the amusing notions /.ers come up with by adding actual information, but I can't help it, all those years of teaching I guess.
Okay, first of all: "mean time between failures" is obviously a statistical measure -- it is an average over a large number of individual items. In most electronic components (including light bulbs!) the statistical distribution of the time between failures is the exponential distribution, which has the odd property that it's "memory-less" -- it doesn't matter how long since the last failure it's been, the mean time to the next failure will still be the same. A consequence of this is that if the MTBF is 10,000 hours, the probability of failure in any particular hour would be 1/10,000th. So, if you set up 10,000 components, all running simultaneously, you'd expect one of them to fail within the first hour; conversely, if you ran them for 1000 hours, and 998 of them failed, you could be fairly certain that the MTBF would be around 10,000 hours.
Note, by the way, that this is only true when the failure time distribution is exponential -- so it works for electronic components, but not for, say, bicycles and cars and roller skates, which are more likely to fail the older they get.
This has an obvious problem, of course: if the MTBF is high, it can take forever to test. Consider, for example, something I worked on for NASA some years ago: trying to prove that a fly-by-wire system will have a mean time between failures of 1e10 hours. (This is about the same failure rte as the airframe, which is how they came up with the number.) 1e10 hours is about 1.141 million years, by the way.
(Pop quiz: if MTBF is a million years, how do you explain the occasional airframe failure, say, eg TWA 800? Hint: It doesn't require any foul play.)
At that point, you've got a couple of choices: first, you can make a lot of copies and run them simultaneously. Relatively easy for $50 disks, hard for billion dollar 747s.
Second, you can make the estimate by computation and modeling which is what you do for web systems. Conceptually, it's pretty simple to do this, although it can be a kind of pain in the ass.
The third way, which is new and cool, is by Bayesian estimation of failure rates. This method lets you make increasingly accurate estimates of the failure rate based on short experiments. I don't have time to go into it, but there are some good sources available on the web.
<input type="radio" name="gift" value="money" disabled>
<input type="radio" name="gift" value="penis size" disabled>
<input type="radio" name="gift" value="ability to nitpick trivia" checked>
There are no karma whores, only moderation johns
Whatever happened to *NICE* time between failure?
Time flies like an arrow. Fruit flies like a banana.