Disk Drive Failures 15 Times What Vendors Say
jcatcw writes "A Carnegie Mellon University study indicates that customers are replacing disk drives more frequently than vendor estimates of mean time to failure (MTTF) would require.. The study examined large production systems, including high-performance computing sites and Internet services sites running SCSI, FC and SATA drives. The data sheets for the drives indicated MTTF between 1 and 1.5 million hours. That should mean annual failure rates of 0.88%, annual replacement rates were between 2% and 4%. The study also shows no evidence that Fibre Channel drives are any more reliable than SATA drives."
Yes, and its mentioned in the report.
The best part about the entire thing is the very last quote:
"If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."
Just common sense.
liqbase
In the article, they mention that the study didn't track actual failures, just the how often customers *thought* there was a failure and replaced their drive. There are all sorts of reasons someone might think a drive has failed. They're not all correct. I can't begin to guess what percentage of those perceived failures were for real.
This study is not news. All it says is that people *think* their hard drives fail more often than the mean time to failure.
TFA seems surprised by SATA drives lasting as long as Fibre...why one earth would your data interface have any consequences on the drive internals? Or are we talking assuming Interface = Data Throughput?
I have had 3 personal use hard drives go bad in the last 5 years, they were either Maxtor or Wester Digital. I am not hard on the drives other than leaving them on 24/7. The drives that failed were all just for data backup and I put them in big, well ventilated boxes. With this use I would think the drives would last for years (at least 5 years), but nope! The drives did not arrive broken either, they all functioned great for 1-2 years before dying. The quality of consumer hard drives nowadays is way, WAY low, and the manufacturers should do something about it.
I don't consider myself a fluke because I know quite a few other people who have had similar problems. What's the deal?
Also, does anyone else find this quote interesting?:
"and may have failed for any reason, such as a harsh environment at the customer site and intensive, random read/write operations that cause premature wear to the mechanical components in the drive."
It's a f$#*ing hard drive! Jesus H Tapdancing Christ how can they call that premature wear, do they calculate the MTTF by just letting the drive sit idle and never reading and writing to it? That actually wouldn't suprise me.
Hey, there is only one Return and it's not of the King, it's of the Jedi.
Give me 6 month failure rates.
... 60 months? That would be the info that I'd need. Where's the big failure spike? I'm going to be replacing them right before that.
Start with 100 drives. Continuous usage.
How many fail in the first 6 months? 12 months? 18 months?
Slightly off-topic, but if you haven't checked the Self-Monitoring, Analysis and Reporting Technology (SMART) info provided by your drive to see if it is having errors, you probably should. You can download smartmontools, which works on Linux/Unix and Windows. Your Linux distro may have it included, but may not have the daemon running to automatically monitor the drive (smartd).
/dev/sda do: /dev/sda /dev/sda
:-)
To view the SMART info for drive
smartctl -a
To do a full disk read check (can take hours) do:
smartctl -t long
Sadly, I just found read errors on a 375-hour-old drive (manufacturer's software claimed that repair succeeded). Fortunately, they were on the Windows partition
I had a 4mb 72-pin parity SIMM go bad one time...this was about 12 years ago in a 486 I used to have. It just didn't work one day (it worked for the first two months). Turn the computer on, get past BIOS start, bam...parity error before bootloader could even start. Reboot, try again, parity error. Turn off parity checking, it actually started to boot and then crashed. The RAM was obviously very defective...when I took that 1 stick out the computer booted normally even with parity on, if I tried to boot with just that stick it would never even POST. That's the only time I have ever seen memory fail...but then it came from a really shady local dealer who regularly scammed people...this same guy had a rack of "shareware" DOS games with neatly printed labels (all labels he printed) for like $5/disk, all of the disks completely blank (not even formatted). I had happened to get one of those when I got the RAM, and my friend did too (from another part of the rack, we didn't give much thought to that at the time, was just an "oh, this looks like it might be neat" thing). Neither disk was even formatted. The CDROM drives he sold me and my friend died within a month also (about a month after the RAM). Amazingly the store was still in business when I went back with the stick of RAM...he looked at it with a magnifying glass, claimed it was "scratched" and therefore abused. I burned rubber out of his parking lot, tossing a lot of gravel against the windows, then I found a reputable place to get RAM (though this was back in the days when 4MB cost $200). 2 days later I drove by, the place was boarded up and closed. Both CDROM drives died within 2 days of each other a month later. Nothing that came out of that place worked.
...is that it detects SMART disk errors in normal use (i.e. you don't have to be watching the BIOS screens when your PC boots).
When I was trying the Vista RC, it told me that my drive was close to failing. I, of course, didn't believe it at first, but I ran the Seagate test floppy and it agreed. So I sent it back to Seagate for a free replacement.
About the only feature that impressed me in Vista, sadly. (And I'm not sure it should have impressed me, tbh. I'm assuming XP never did this as I've never seen/heard of such a feature.)
There is no context in which is appropriate to apply metric reasoning to computers.
It's exactly this kind of bullshit that irritates me. Suppose you look at a file. It's 95,015,327 bytes long. You're claiming that referring to the file as being 95MB is "inappropriate"?
I'm a software engineer, fully versed in binary math, and the fact that computers refer to that file as being 90MB still really pisses me off. It's pointless and annoying.
ZFS: because love is never having to say fsck