Disk Failure Rates More Myth Than Metric

Never had a drive fail by Jafafa+Hots · 2008-04-05 07:34 · Score: 4, Interesting

I've gone through many over the years, replacing them as they became too small - still using some small ones many years old for minor tasks, etc. and he only drive I've ever had partially fail is the one I accidentally launched across a room.

I don't understand how people are always complaining about their hard drives failing. In 30 years it hasn't happened to me yet.

I'm about to lug a huge Wang hard drive out to the trash pickup on Monday - weighs over 100 pounds... still runs. Actually it uses removable platters but still...

--
This space available.

Re:Never had a drive fail by Anonymous Coward · 2008-04-05 07:43 · Score: 5, Funny

Wait. You've got a huge Wang, and you're throwing it out? D00d, that's just uncool. Give it to someone else at least. It would be fun to ask people "wanna come see my huge Wang?" just to see their reaction! :)

hah. captcha word: largest
Re:Never had a drive fail by Anonymous Coward · 2008-04-05 07:44 · Score: 3, Insightful

Drive failures are actually fairly common, but usually the failures are due to cooling issues. Given that most PCs aren't really set up to ensure decent hard drive cooling, it is probable that the failure ratings are inflated due to operation outside of the expected operational parameters (which are probably not conservative enough for real usage). In my opinion, if you have more than a single hard drive closely stacked in your case you should have some sort of hard drive fan.
Re:Never had a drive fail by serviscope_minor · 2008-04-05 08:08 · Score: 5, Funny

I'm about to lug a huge Wang hard drive out to the trash pickup on Monday - weighs over 100 pounds... still runs. Actually it uses removable platters but still...

<Indiana Jones> IT BELONGS IN A MUSEUM!</Indiana Jones>

--
SJW n. One who posts facts.
Re:Never had a drive fail by hedwards · 2008-04-05 08:09 · Score: 3, Informative

I think cooling issues are somewhat less common than most people think, but they are definitely significant. And I wouldn't care to suggest that people neglect to handle heat dissipation on general principle.

Dirty, spikey power is a much larger problem. A few years back I had 3 or 4 nearly identical WD 80gig drives die within a couple of months of each other, They were replaced with identical drives that are still chugging along find all this time later. The only major difference is that I gave each system a cheapo UPS.

Being somewhat I cheap, I tend to use disks until they wear out completely. After a few years I shift the disks to storing things which are permanently archived elsewhere or swap. Seems to work out fine, only problem is what happens if the swap goes bad while I'm using it.
Re:Never had a drive fail by GIL_Dude · 2008-04-05 08:09 · Score: 3, Insightful

I'd agree with you there; I have had probably 8 or 9 hard drives fail over the years (I currently have 10 running in the house right now and I have 8 running at my desk at work, so I do have a lot of drives). I am sure that I have caused some of the failures by just what you are talking about - I've maxed out the cases (for example my server has 4 drives in it, but was designed for 2 - I had to make my own bracket to jam the 4th in there, the 3rd went in place of a floppy). But I've never done anything about cooling and I probably caused this myself. Although to hear the noises coming from some of the platters when they failed I'm sure at least a couple weren't just heat. For example at work I have had 2 drives fail in just bog standard HP Compaq dc7700 desktops (without cramming in extra stuff). Sometimes they just up and die, other times I must have helped them along with heat.
Re:Never had a drive fail by kesuki · 2008-04-05 08:13 · Score: 3, Informative

And i had 5 fail This year, welcome, the the law of averages. note i own about 15 hard drives including the 5 that failed.

--
https://www.gnu.org/philosophy/free-sw.html
Re:Never had a drive fail by STrinity · 2008-04-05 08:26 · Score: 4, Funny

I'm about to lug a huge Wang
There needs to be a -1 "Too Easy" moderation option.

--
Les Miserables Volume 1 now up with my reading of
Re:Never had a drive fail by afidel · 2008-04-05 09:01 · Score: 4, Informative

I would tend to agree with that. I run a datacenter that's cooled to 74 degrees and has good clean power from the online UPS's and I've had 6 drive failures out of about 500 drives over the last 22 months. Three were from older servers that weren't always properly cooled (the company had a crappy AC unit in their old data closet.) The other three all died in their first month or two after installation. So properly treated server class drives are dying at a rate of about .5% per year for me, I'd say that jives with manufacturer MTBF.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:Never had a drive fail by Depili · 2008-04-05 09:15 · Score: 3, Informative

Excess heat can cause the lubricant of a hd to go bad and causes weird noises, also logic board failures/head positioning failures cause quite a racket.
In my experience most drives fail without any indications from smart tests, ie. logic board failures, bad sectors are quite rare nowadays.
Re:Never had a drive fail by Zak3056 · 2008-04-05 10:00 · Score: 3, Funny

The only possible response to that is this Penny Arcade.

--
What part of "shall not be infringed" is so hard to understand?

There are only two kind of peeps... by **loki969** · 2008-04-05 07:35 · Score: 5, Insightful

...those that make backups and those that never had a hard drive fail.

Re:There are only two kind of peeps... by Raineer · 2008-04-05 07:44 · Score: 5, Insightful

I see it the other way... Once I start taking backups my HDD's never fail, it's when I forget that they crash.
Re:There are only two kind of peeps... by squidinkcalligraphy · 2008-04-05 15:16 · Score: 4, Insightful

"Backups are for wimps. Real men upload their data to an FTP site and
have everyone else mirror it." -Linus Torvalds

--
"I think it would be a good idea" Gandhi, on Western Civilisation

Marketplace can't function without good data by dpbsmith · 2008-04-05 07:38 · Score: 5, Insightful

If everyone knows how much a disk drive costs, and nobody can find out how long a disk drive really will last, there is no way the marketplace can reward the vendors of durable and reliable products.

The inevitable result is a race to the bottom. Buyers will reason they might was well buy cheap, because they at least know they're saving money, rather then paying for quality and likely not getting it.

--

"How to Do Nothing," kids activities, back in print!

Re:Marketplace can't function without good data by commodoresloat · 2008-04-05 08:56 · Score: 3, Interesting

If everyone knows how much a disk drive costs, and nobody can find out how long a disk drive really will last, there is no way the marketplace can reward the vendors of durable and reliable products. And that may be the exact reason why the vendors are providing bad data. On the flip side, however, if people knew how often drives failed, perhaps we'd buy more of them in order to always have backups.
Re:Marketplace can't function without good data by petermgreen · 2008-04-05 13:43 · Score: 3, Insightful

A MTBF is only meaningfull when combined with an operating lifespan over which is was measured and after which it is advised that customers needing high reliability replace thier drives.

Also the manufacturer needs to specify the conditions of the test, temperature, humidity etc and customers requiring reliability need to ensure they run near those conditions.

If you do a 1000 hour test and all your drives have a design fault that cause a large proportion of them to fail after about 5000 hours usage you probablly won't notice the fault but 7 months down the line customers who run the drive 24/7 will.

The problem is of course that by the time you have done proper testing (= running the drives for thier expected lifespan under realistic operating conditions and seeing what proportion fail during that time and when) for a device with an expected lifetime in years the device is obsolete.

--
note: i'm known as plugwash most places but i screwd up registering that here somehow in the past and now can't register

Never had a drive *not* fail. by Murphy+Murph · 2008-04-05 07:40 · Score: 4, Informative

I've gone through many over the years, replacing them as they became too small - still using some small ones many years old for minor tasks, etc. and he only drive I've ever had partially fail is the one I accidentally launched across a room.

My anecdotal converse is I have never had a hard drive not fail. I am a bit on the cheap side of the spectrum, I'll admit, but having lost my last 40GB drives this winter I now claim a pair of 120s as my smallest.
I always seem to have a use for a drive, so I run them until failure.

--
I dub thee... Sir Phobos, Knight of Mars, Beater of Ass.

Re:Never had a drive *not* fail. by Depili · 2008-04-05 09:09 · Score: 3, Interesting

The deathstars were all 80gt PATA disks, manufactured by a single plant, had 8 of them, all failed.
Re:Never had a drive *not* fail. by KillerBob · 2008-04-05 10:35 · Score: 4, Informative

Admittedly, it's a different environment entirely than what you're running, but let me see if I can shed some light on it for you....

I administer a small server, which runs its services in virtual sandboxes. One physical box, but through KVM the Apache/PHP/MySQL is in one sandbox, the SMTP/IMAP is in another, etc. Each VM image is about 20GB, give or take, and the machine has two physical hard drives. My backup is periodic, and incremental. And the backup alternates between the drives... at any given time each hard drive will have two copies of every VM, not counting the one that's actually running.

Now... here's where the full system backup comes in: because it's a virtual machine, it's only a single 20GB file. Backing it up is as easy as shutting down the VM and copying the file. Recovering from a backup is where it gets even easier... all I have to do is copy that one file back, and start it up. Poof. *everything* is back the way it was at the time of the backup. Total time to recover? Less than a minute.

And the host OS is easy to rebuild, too, because there's no configuration files to worry about. SSH and KVM are the only services the host is running, and for the most part an out of the box configuration for most Linux distributions will handle it quite nicely.

So... I guess to answer your question... in my case a complete system backup makes administering, and recovering from "oh shit" moments a hell of a lot easier. :) If you have the hard drive storage space available, I'd definitely suggest going that route.

--
If you believe everything you read, you'd better not read. - Japanese proverb

warranties by qw0ntum · 2008-04-05 07:45 · Score: 4, Insightful

The best metric is probably going to be the length of warranty the manufacturer offers. They have financial incentive to find out the REAL mean time until failure in calculating the warranty.

--
'Every story, if continued long enough, ends in death.' --Ernest Hemingway

Re:warranties by ooloogi · 2008-04-05 09:33 · Score: 3, Insightful

Warranties beyond about two years become largely meaningless for this purpose, because after a drive is getting older people often won't bother claiming warranty for what is by then such a small drive. The cost of shipping/transport is likely to be more than the marginal $/GB on a new drive.

So in this way a manufacturer can get away with a long warranty, without necessarily incurring a cost for unreliability.

What MTBF is for. by sakusha · 2008-04-05 07:51 · Score: 5, Insightful

I remember back in the mid 1980s when I received a service management manual from DEC, it had some information that really opened my eyes about what MTBF was really intended for. It had a calculation (I have long since forgotten the details) that allowed you to estimate how many service spares you would need to keep in stock to service any installed base of hardware, based on MTBF. This was intended for internal use in calculating spares inventory level for DEC service agents. High MTBF products needed fewer replacement parts in inventory, low MTBF parts needed lots of parts in stock. Presumably internal MTBF ratings were more accurate than those released to end users.

So anyway.. MTBF is not intended as an indicator of a specific unit's reliability. It is a statistical measurement to calculate how many spares are needed to keep a large population of machines working. It cannot be applied to a single unit in the way it can be applied to a large population of units.

Perhaps the classical example is about the old tube-based computers like ENIAC, if a single tube has an MTBF of 1 year, but the computer has 10,000 tubes, you'd be changing tubes (on average) more than once an hour, you'd rarely even get an hour of uptime. (I hope I got that calculation vaguely correct)

Re:What MTBF is for. by sakusha · 2008-04-05 08:09 · Score: 3, Informative

Thanks. I read your comment and got to thinking about it a bit more. I vaguely recall that in those olden days, MTBF was not an estimate, it was calculated from the service reports of failed parts. The calculations were released in monthly reports so we could increase our spares inventory to cover parts that were proving to be less reliable than estimated. But then, those were the days when every installed CPU was serviced by authorized agents, so data gathering was 100% accurate.
Re:What MTBF is for. by davelee · 2008-04-05 08:22 · Score: 4, Informative

MTBFs are designed to specify a RATE of failure, not the expected lifetime. This is because disk manufacturers don't test MTBF by running 100 drives until they die, but rather running say, 10000 drives and counting the number that fail during some period of months perhaps. As drives age, clearly the failure rate will increase and thus the "MTBF" will shrink.

long story short -- a 3 year old drive will not have the same MTBF as a brand new drive. And a MTBF of 1 million hours doesn't mean that the median drive will live to 1 million hours.

Misunderstanding MTBF by dh003i · 2008-04-05 07:55 · Score: 4, Interesting

I think that a lot of people are mis-understanding MTBF. A HD might have a MTBF of 100 years. This doesn't mean that the company expects the vast majority of consumers to have that HD running for 100 years without problems.

MTBF numbers are generated by running say thousands of hard-drives of the same model and batch/lot, and seeing how long it takes before 1 fails. This may be a day or so. You then figure out how many total HD running hours it took before failure. If you have 1,000 HD's running, and it takes 40 hours before one fails, that's a 40,000 hr MTBF. But this number isn't generated by running say 10 hard-drives, waiting for all of them to fail, and averaging that number.

Thus, because of the way MTBF numbers are generated, they may or may not reflect hard-drive reliability beyond a few weeks. It depends on our assumptions about hard-drive stress and usage beyond the length of time before the 1st HD of the 1,000 or so they were testing failed. Most likely, it says less and less about hard-drive reliability beyond that initial point of failure (which is on the order of tens or hundreds of hours, not hundreds of thousands of hours or millions of hours!).

To be sure, all-else equal, a higher MTBF is better than a lower one. But as far as I'm concerned, those numbers are more useful for predicting DOA, duds, or quick-failure; and are more useful to professionals who might be employing large arrays of HD's. They are not particularly useful for getting a good idea of how long your HD will actually last.

HD manufacturers also publish an expected life-cycle of their HD. But I usually put the most stock in the length of the warranty. That's what they're willing to put their money behind. Albeit, it's possible their strategy is just to warranty less than how long they expect 90% of HD's to last, so they can then sell them cheaper. But if you've had a HD and you've had it for longer than what the manufacturer publishes as the expected-life, what they're saying by that is you've basically got a good value, and will probably want to have something else on hand, and be backed up.

--
social sciences can never use experience to verify their statemen

Temperature is the key by arivanov · 2008-04-05 07:55 · Score: 4, Interesting

Disk MTBF is quoted for 20C.

Here is an example of my server. At 18C ambient in a well cooled and well designed case with dedicated hard drive fans he Maxtors I use for RAID1 run at 29ÂC. My Media server which is in the loft with sub-16C ambient runs them at 24-34 depending on the position in the case (once again, proper high end case with dedicated hard drive fans).

Very few hard disk enclosures can bring the temperature down to 24-25C.

SANs or high density servers usually end up running disks at 30C+ while at 18C ambient. In fact I have seen disks run at 40C or more in "enterprise hardware".

From there on it is not amazing that they fail at a rate different from the quoted one. In fact I would have been very surprised if they did.

--
Baker's Law: Misery no longer loves company. Nowadays it insists on it
http://www.sigsegv.cx/

Re:Temperature is the key by ABasketOfPups · 2008-04-05 08:10 · Score: 5, Interesting

Google says that's just not what they've seen. "The figure shows that failures do not increase when the average temperature increases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at the very high temperatures is there a slight reversal of this trend."

On the graph it's clear that 30-35C is best at three years. But up until then, 35-40C has lower failure rates, and both have lower rates by a lot than the 15-30C range.
Re:Temperature is the key by Jugalator · 2008-04-05 08:13 · Score: 3, Informative

I agree, I had a Maxtor disk that ran at something like 50-60 C and wondered when it was going to fail, never really treated it as my safest drive. And lo and behold, after ~3-4 years the first warnings on bad sectors started cropping up, and a year later Windows panicked and told me to immediately back it up if I hadn't already because I guess the number of SMART errors were building up.

On the other hand, I had a Samsung disk that ran at 40 C tops, in a worse drive bay too! The Maxtor one had free air passage in the middle bay (no drives nearby), where the Samsung was side-by-side with the metal casing.

So I'm thinking there can be some measurable differences between drive brands, and a study of this, along with perhaps relationship with brand failure rates would be most interesting!

--
Beware: In C++, your friends can see your privates!

Re:MTBF For Unused Drive? by zappepcs · 2008-04-05 07:56 · Score: 4, Interesting

The problem is that the MTBF is calculated on an accelerated lifecycle test schedule. Life in general does not actually act like the accelerated test expanded out to 1day=1day. It is an approximation, and prone to errors because of the aggregated averages created by the test.

On average, a disk drive can last as long as the MTBF number. What are the chances that you have an average drive? They are slim. Each component in the drive, every resistor, every capacitor, every part has an MTBF. They also have tolerance values: that is to say they are manufactured to a value with a given tolerance of accuracy. Each tolerance has to be calculated as one component out of tolerance could cause failure of complete sections of the drive itself. When you start calculating that kind of thing it becomes similar to an exercise in calculating safety on the space shuttle... damned complex in nature.

The tests remain valid because of a simple fact. In large data centers where you have large quantities of the same drive spinning in the same lifecycles, you will find that a percentage of them fail within days of each other. That means that there is a valid measurement of the parts in the drive, and how they will stand the test of life in a data center.

Is your data center an 'average' life for a drive? The accelerated lifecycle tests cannot tell you. All the testing does is look for failures of any given part over a number of power cycles, hours of use etc. It is quite improbable that your use of the drive will match that of the expanded testing life cycle.

The MTBF is a good estimation of when you can be certain of a failure of one part or another in your drive. There is ALWAYS room for it to fail prior to that number. ALWAYS.

Like any electronic device for consumers, if it doesn't fail in the first year, it's likely to last as long as you are likely to be using it. Replacement rates of consumer societies mean that manufacturers don't have to worry too much about MTBF as long as it's longer than the replacement/upgrade cycle.

If you are worried about data loss, implement a good data backup program and quit worrying about drive MTBFs.

--
Support NYCountryLawyer RIAA vs People

Re:Failure rates ! warranty period. by ABasketOfPups · 2008-04-05 07:57 · Score: 5, Informative

Warranty periods for 750 gig and 1 terabyte drives from Western Digital, Samsung, and Hitachi, are 3 years to 5 years according to the info on zipzoomfly.com.

A one year warranty doesn't seem that common. External drives seem to have one year warranties, but even SATA drives at Best Buy mostly have 3 years

Build your own USB drives by omnirealm · 2008-04-05 08:14 · Score: 3, Informative

While we are on the topic of failing drives, I think it would be appropriate to include a warning about USB drives and warranties.

I purchased a 500GB Western Digital My Book about a year and a half ago. I figured that a pre-fab USB enclosed drive would somehow be more reliable than building one myself with a regular 3.5" internal drive and my own separately purchased USB enclosure (you may dock me points for irrational thinking there). Of course, I started getting the click-of-death about a month ago, and I was unpleasantly surprised to discover that the warranty on the drive was only for 1 year, rather than the 3 year warranty that I would have gotten for a regular 3.5" 500GB Western Digital drive at the time. Meanwhile, my 750GB Seagate drive in a AMS VENUS enclosure has been chugging along just fine, and if it fails sometime in the next four years, I will still be able to exchange it under warranty.

The moral of the story is that, when there is a difference in the warranty periods (i.e., 1 year vs. 5 years), it makes a lot more sense to build your own USB enclosed drive rather than order a pre-fab USB enclosed drive.

--
An unjust law is no law at all. - St. Augustine

MTBF is a useful statistical measure by Kupfernigk · 2008-04-05 08:39 · Score: 3, Insightful

which many people confuse with MTTF (mean time to failure) - which is relevant in predicting the life of equipment. It needs to be stated clearly that MTBF applies to populations; if I have 1000 hard drives with a MTBF of 1 million hours, I would on average expect one failure every thousand hours. These are failures rather than wearouts, which are a completely different phenomenon.

Anecdotal reports of failures also need to consider the operating environment. If I have a server rack, and most servers in the rack have a drive failure in the first year, is it the drive design or the server design? Given the relative effort that usually goes into HDD design and box design, it's more likely to be due to poor thermal management in the drive enclosure. Back in the day when Apple made computers (yes, they did once, before they outsourced it) their thermal management was notoriously better than that of many of the vanilla PC boxes, and properly designed PC-format servers like the HP Kayaks were just as expensive as Macs. The same, of course, went for Sun, and that was one reason why elderly Mac and Sparc boxes would often keep chugging along as mail servers until there were just too many people sending big attachments.

One possibly related oddity that does interest me is laptop prices. The very cheap laptops are often advertised with optional 3 year warranties that cost as much as the laptop. Upmarket ones may have three year warranties for very little. I find myself wondering if the difference in price really does reflect better standards of manufacture so that the chance of a claim is much less, whether cheap laptops get abused and are so much more likely to fail, or whether the warranty cost is just built into the price of the more expensive models because most failures in fact occur in the first year.

--
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."

MTBF assumes drives are replaced every few years by AySz88 · 2008-04-05 09:32 · Score: 3, Informative

MTBF is only valid during the "lifetime" of a drive. (For example, "lifetime" might mean the five years during which a drive is under warranty.) Thus, the MTBF is the mean time before failure if you replace the drive every five years with other drives with identical MTBF. Thus the 100-some year MTBF doesn't mean that an individual drive will last 100+ years, it means that your scheme of replacing every 5 years will work for an average time of 100+ years.
Of course, I think this is another deceptive definition from the hard drive industry... To me, the drive's lifetime ends when it fails, not "5 years".
Source: http://www.rpi.edu/~sofkam/fileserverdisks.html

Re:MTBF For Unused Drive? by mollymoo · 2008-04-05 09:47 · Score: 4, Informative

Maybe they mean the MTBF for drives that are just on, but not being used. I've never put any stock into those numbers, because I've had too many drives fail to believe that they're supposed to be lasting 100 years.

If you think an MTBF of 100 years means the disk will last 100 years you're bound to be disappointed, because that's not what it means. MTBF is calculated in different ways by different companies, but generally there are at least two numbers you need to look at, MTBF and the design or expected lifetime. A disk with an MTBF of 200 000 hours and a lifetime of 20 000 hours means that 1 in 10 are expected to fail during their lifetime, or with 200 000 disks one will fail every hour. It does not mean the average drive will last 200 000 years. After the lifetime is over all bets are off.

In short, the MTBF is a statistical measure of the expected failure rate during the expected lifetime of a device, it is not a measure of the expected lifetime of a device.

--
Chernobyl 'not a wildlife haven' - BBC News

Re:MTBF For Unused Drive? by SuperQ · 2008-04-05 10:14 · Score: 3, Informative

MTBF is NOT calculated for a single drive. MTBF is calculated based on an average for ANY pool size of drives.

If you have 10,000 drives, and the failure is 1 in 1,000,000 hours, you will have a failure every 100 hours.

Here's a good document on disk failure information:
http://research.google.com/archive/disk_failures.pdf

Slashdot Mirror

Disk Failure Rates More Myth Than Metric

36 of 283 comments (clear)