Disk Failure Rates More Myth Than Metric

← Back to Stories (view on slashdot.org)

Disk Failure Rates More Myth Than Metric

Posted by Zonk on Saturday April 5, 2008 @07:30AM from the like-the-loch-ness-hard-drive dept.

Lucas123 writes "Using mean time between failure rates suggest that disks can last from 1 million to 1.5 million hours, or 114 to 170 years, but study after study shows that those metrics are inaccurate for determining hard drive life. One study found that some disk drive replacement rates were greater than one in 10. This is nearly 15 times what vendors claim, and all of these studies show failure rates grow steadily with the age of the hardware. One former EMC employee turned consultant said, 'I don't think [disk array manufacturers are] going to be forthright with giving people that data because it would reduce the opportunity for them to add value by 'interpreting' the numbers.'"

6 of 283 comments (clear)

Min score:

Reason:

Sort:

Never had a drive *not* fail. by Murphy+Murph · 2008-04-05 07:40 · Score: 4, Informative

I've gone through many over the years, replacing them as they became too small - still using some small ones many years old for minor tasks, etc. and he only drive I've ever had partially fail is the one I accidentally launched across a room.

My anecdotal converse is I have never had a hard drive not fail. I am a bit on the cheap side of the spectrum, I'll admit, but having lost my last 40GB drives this winter I now claim a pair of 120s as my smallest.
I always seem to have a use for a drive, so I run them until failure.

--
I dub thee... Sir Phobos, Knight of Mars, Beater of Ass.
1. Re:Never had a drive *not* fail. by KillerBob · 2008-04-05 10:35 · Score: 4, Informative
  
  Admittedly, it's a different environment entirely than what you're running, but let me see if I can shed some light on it for you....
  
  I administer a small server, which runs its services in virtual sandboxes. One physical box, but through KVM the Apache/PHP/MySQL is in one sandbox, the SMTP/IMAP is in another, etc. Each VM image is about 20GB, give or take, and the machine has two physical hard drives. My backup is periodic, and incremental. And the backup alternates between the drives... at any given time each hard drive will have two copies of every VM, not counting the one that's actually running.
  
  Now... here's where the full system backup comes in: because it's a virtual machine, it's only a single 20GB file. Backing it up is as easy as shutting down the VM and copying the file. Recovering from a backup is where it gets even easier... all I have to do is copy that one file back, and start it up. Poof. *everything* is back the way it was at the time of the backup. Total time to recover? Less than a minute.
  
  And the host OS is easy to rebuild, too, because there's no configuration files to worry about. SSH and KVM are the only services the host is running, and for the most part an out of the box configuration for most Linux distributions will handle it quite nicely.
  
  So... I guess to answer your question... in my case a complete system backup makes administering, and recovering from "oh shit" moments a hell of a lot easier. :) If you have the hard drive storage space available, I'd definitely suggest going that route.
  
  --
  If you believe everything you read, you'd better not read. - Japanese proverb
Re:Failure rates ! warranty period. by ABasketOfPups · 2008-04-05 07:57 · Score: 5, Informative

Warranty periods for 750 gig and 1 terabyte drives from Western Digital, Samsung, and Hitachi, are 3 years to 5 years according to the info on zipzoomfly.com.

A one year warranty doesn't seem that common. External drives seem to have one year warranties, but even SATA drives at Best Buy mostly have 3 years
Re:What MTBF is for. by davelee · 2008-04-05 08:22 · Score: 4, Informative

MTBFs are designed to specify a RATE of failure, not the expected lifetime. This is because disk manufacturers don't test MTBF by running 100 drives until they die, but rather running say, 10000 drives and counting the number that fail during some period of months perhaps. As drives age, clearly the failure rate will increase and thus the "MTBF" will shrink.

long story short -- a 3 year old drive will not have the same MTBF as a brand new drive. And a MTBF of 1 million hours doesn't mean that the median drive will live to 1 million hours.
Re:Never had a drive fail by afidel · 2008-04-05 09:01 · Score: 4, Informative

I would tend to agree with that. I run a datacenter that's cooled to 74 degrees and has good clean power from the online UPS's and I've had 6 drive failures out of about 500 drives over the last 22 months. Three were from older servers that weren't always properly cooled (the company had a crappy AC unit in their old data closet.) The other three all died in their first month or two after installation. So properly treated server class drives are dying at a rate of about .5% per year for me, I'd say that jives with manufacturer MTBF.

--
There are 4 boxes to use in the defense of liberty: soap, ballot, jury, ammo. Use in that order. Starting now.
Re:MTBF For Unused Drive? by mollymoo · 2008-04-05 09:47 · Score: 4, Informative

Maybe they mean the MTBF for drives that are just on, but not being used. I've never put any stock into those numbers, because I've had too many drives fail to believe that they're supposed to be lasting 100 years.

If you think an MTBF of 100 years means the disk will last 100 years you're bound to be disappointed, because that's not what it means. MTBF is calculated in different ways by different companies, but generally there are at least two numbers you need to look at, MTBF and the design or expected lifetime. A disk with an MTBF of 200 000 hours and a lifetime of 20 000 hours means that 1 in 10 are expected to fail during their lifetime, or with 200 000 disks one will fail every hour. It does not mean the average drive will last 200 000 years. After the lifetime is over all bets are off.

In short, the MTBF is a statistical measure of the expected failure rate during the expected lifetime of a device, it is not a measure of the expected lifetime of a device.

--
Chernobyl 'not a wildlife haven' - BBC News