Disk Failure Rates More Myth Than Metric
Lucas123 writes "Using mean time between failure rates suggest that disks can last from 1 million to 1.5 million hours, or 114 to 170 years, but study after study shows that those metrics are inaccurate for determining hard drive life. One study found that some disk drive replacement rates were greater than one in 10. This is nearly 15 times what vendors claim, and all of these studies show failure rates grow steadily with the age of the hardware. One former EMC employee turned consultant said, 'I don't think [disk array manufacturers are] going to be forthright with giving people that data because it would reduce the opportunity for them to add value by 'interpreting' the numbers.'"
...those that make backups and those that never had a hard drive fail.
If everyone knows how much a disk drive costs, and nobody can find out how long a disk drive really will last, there is no way the marketplace can reward the vendors of durable and reliable products.
The inevitable result is a race to the bottom. Buyers will reason they might was well buy cheap, because they at least know they're saving money, rather then paying for quality and likely not getting it.
"How to Do Nothing," kids activities, back in print!
Drive failures are actually fairly common, but usually the failures are due to cooling issues. Given that most PCs aren't really set up to ensure decent hard drive cooling, it is probable that the failure ratings are inflated due to operation outside of the expected operational parameters (which are probably not conservative enough for real usage). In my opinion, if you have more than a single hard drive closely stacked in your case you should have some sort of hard drive fan.
The best metric is probably going to be the length of warranty the manufacturer offers. They have financial incentive to find out the REAL mean time until failure in calculating the warranty.
'Every story, if continued long enough, ends in death.' --Ernest Hemingway
I remember back in the mid 1980s when I received a service management manual from DEC, it had some information that really opened my eyes about what MTBF was really intended for. It had a calculation (I have long since forgotten the details) that allowed you to estimate how many service spares you would need to keep in stock to service any installed base of hardware, based on MTBF. This was intended for internal use in calculating spares inventory level for DEC service agents. High MTBF products needed fewer replacement parts in inventory, low MTBF parts needed lots of parts in stock. Presumably internal MTBF ratings were more accurate than those released to end users.
So anyway.. MTBF is not intended as an indicator of a specific unit's reliability. It is a statistical measurement to calculate how many spares are needed to keep a large population of machines working. It cannot be applied to a single unit in the way it can be applied to a large population of units.
Perhaps the classical example is about the old tube-based computers like ENIAC, if a single tube has an MTBF of 1 year, but the computer has 10,000 tubes, you'd be changing tubes (on average) more than once an hour, you'd rarely even get an hour of uptime. (I hope I got that calculation vaguely correct)
I'd agree with you there; I have had probably 8 or 9 hard drives fail over the years (I currently have 10 running in the house right now and I have 8 running at my desk at work, so I do have a lot of drives). I am sure that I have caused some of the failures by just what you are talking about - I've maxed out the cases (for example my server has 4 drives in it, but was designed for 2 - I had to make my own bracket to jam the 4th in there, the 3rd went in place of a floppy). But I've never done anything about cooling and I probably caused this myself. Although to hear the noises coming from some of the platters when they failed I'm sure at least a couple weren't just heat. For example at work I have had 2 drives fail in just bog standard HP Compaq dc7700 desktops (without cramming in extra stuff). Sometimes they just up and die, other times I must have helped them along with heat.
Anecdotal reports of failures also need to consider the operating environment. If I have a server rack, and most servers in the rack have a drive failure in the first year, is it the drive design or the server design? Given the relative effort that usually goes into HDD design and box design, it's more likely to be due to poor thermal management in the drive enclosure. Back in the day when Apple made computers (yes, they did once, before they outsourced it) their thermal management was notoriously better than that of many of the vanilla PC boxes, and properly designed PC-format servers like the HP Kayaks were just as expensive as Macs. The same, of course, went for Sun, and that was one reason why elderly Mac and Sparc boxes would often keep chugging along as mail servers until there were just too many people sending big attachments.
One possibly related oddity that does interest me is laptop prices. The very cheap laptops are often advertised with optional 3 year warranties that cost as much as the laptop. Upmarket ones may have three year warranties for very little. I find myself wondering if the difference in price really does reflect better standards of manufacture so that the chance of a claim is much less, whether cheap laptops get abused and are so much more likely to fail, or whether the warranty cost is just built into the price of the more expensive models because most failures in fact occur in the first year.
From scarped cliff or quarried stone she cries "A thousand types are gone, I care for nothing, no not one."