Disk Drive Failures 15 Times What Vendors Say

← Back to Stories (view on slashdot.org)

Disk Drive Failures 15 Times What Vendors Say

Posted by Zonk on Friday March 2, 2007 @09:15AM from the cough-sputter-wheeze-choke dept.

jcatcw writes "A Carnegie Mellon University study indicates that customers are replacing disk drives more frequently than vendor estimates of mean time to failure (MTTF) would require.. The study examined large production systems, including high-performance computing sites and Internet services sites running SCSI, FC and SATA drives. The data sheets for the drives indicated MTTF between 1 and 1.5 million hours. That should mean annual failure rates of 0.88%, annual replacement rates were between 2% and 4%. The study also shows no evidence that Fibre Channel drives are any more reliable than SATA drives."

14 of 284 comments (clear)

Min score:

Reason:

Sort:

Re:Repeat? by georgewilliamherbert · 2007-03-02 09:20 · Score: 3, Informative

We did both this study and the Google study in the first couple of days after FAST was over. Completely redundant....
In other news... by Mr.+Underbridge · 2007-03-02 09:22 · Score: 4, Informative

...Carnegie Mellon researchers can't tell a mean from a median. This is inherently a long-tailed distribution in which the mean will be much higher than the median. Imagine a simple situation in which failure rates are 50%/yr, but those that last beyond a year last a long time. Mean time to failure might be 1000 years. You simply can't compare the statistics the way they have without knowing a lot more about the distribution than I saw in the article. Perhaps I missed it while skimming.
1. Re:In other news... by Falkkin · 2007-03-02 09:57 · Score: 3, Informative
  
  In other news, Carnegie Mellon researchers know more about statistics than you give them credit for; blame ComputerWorld for crappy coverage of what the paper says. If you read the paper or the abstract, the researchers actually claim the opposite of what you are suggesting, namely, that the "infant mortality effect" (bathtub curve) often claimed for hard drives isn't actually the case. See Figure 4 in the paper and Section 5 ("Statistical properties of disk failures"). The paper is online here:
  
  http://www.usenix.org/events/fast07/tech/schroeder /schroeder_html/index.html
Re:Repeat? by ajs · 2007-03-02 09:34 · Score: 5, Informative

The best part about the entire thing is the very last quote:

"If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."

Just common sense. It's "common sense," but not as useful as one might hope. What MTTF tells you is, within some expected margin of error, how much failure you should plan on in a statistically significant farm. So, for example, I know of an installation that has thousands of disks used for everything from root disks on relatively drop-in-replaceable compute servers to storage arrays. On the budgetary side, that installation wants to know how much replacement cost to expect per annum. On the admin side, that installation wants to be prepared with an appropriate number of redundant systems, and wants to be able to assert a failure probability for key systems. That is, if you have a raid array with 5 disks and one spare, then you want to know the probability that three disks will fail on it in the, let's say, 6 hour worst-case window before you can replace any of them. That probability is non-zero, and must be accounted for in your computation of anticipated downtime, along with every other unlikely, but possible event that you can account for.

When a vendor tells you to expect 1 0.2% failure rate, but it's really 2-4% that's a HUGE shift in the impact to your organization.

When you just have one or a handful of disks in your server at home, that's a very different situation from a datacenter full of systems with all kinds of disk needs.
Re:Interface matters why? by mollymoo · 2007-03-02 09:42 · Score: 5, Informative

TFA seems surprised by SATA drives lasting as long as Fibre...why one earth would your data interface have any consequences on the drive internals?

Fibre Channel drives, like SCSI drives, are assumed to be "enterprise" drives and therefore better built than "consumer" SATA and PATA drives. It's nothing inherent to the interface, but a consequence of the environment in which that interface is expected to be used. At least, that's the idea.

--
Chernobyl 'not a wildlife haven' - BBC News
Re:Personally I am SHOCKED by Beardo+the+Bearded · 2007-03-02 09:43 · Score: 4, Informative

What, really?

The same companies that lie about the capacity on EVERY SINGLE DRIVE they make? You don't think that they're a bunch of lying fucking weasels? (We're both using sarcasm here.)

I don't care how you spin it. 1024 is the multiple. NOT 1000!

Failure doesn't get fixed because making a drive more reliable means it costs more. If it costs more, it's not going to get purchased.

--

---
ECHELON is a government program to find words like bomb, jihad, plutonium, assassinate, and anarchy.
Re:Even better ... by Falkkin · 2007-03-02 10:01 · Score: 5, Informative

This is handled in the paper. See this graph: http://www.usenix.org/events/fast07/tech/schroeder /schroeder_html/img14b.PNG

Unfortunately there is no big "spike"; the average replacement rate just grows and grows with time.
just assume 3 years by crabpeople · 2007-03-02 10:05 · Score: 4, Informative

A good rule of thumb is 3 years. Most hard drives fail in 3 years. I dont know why, but im currently seeing alot of bad 2004 branded drives and consider that right on schedule. Last year the 02-03 drives were the ones failing left and right. I just pulled one this morning thats stamped march 04. Just started acting up a few days ago. Like clockwork.

--
I'll just use my special getting high powers one more time...
Re:Personally I am SHOCKED by Lord+Ender · 2007-03-02 10:11 · Score: 3, Informative

Before computers were used in real engineering, we could get away with "k" sometimes meaning 1024 (like in memory addresses) and sometimes meaning 1000 (like in network speeds). Those days are past. Now that computers are part of real engineering work, even the slightest amount of ambiguity is not acceptable .

Differentiating between "k" (=1000) and "ki" (=1024) is a sign that the computer industry is finally maturing. It's called progress.

--
A slashdotter who didn't build his own computer is like a Jedi who didn't build his own lightsaber.
Off-Topic: SI Units by ewhac · 2007-03-02 10:21 · Score: 5, Informative
I just can't believe that the same vendors that would misrepresent the capacity of their disk by redefining a Gigabyte as 1,000,000,000 bytes instead of 1,073,741,824 bytes would misrepresent their MTBF too!

Not that this is actually relevant or anything, but there's been a long-standing schism between the computing community and the scientific community concerning the meaning of the SI prefixes Kilo, Mega, and Giga. Until computers showed up, Kilo, Mega, and Giga referred exclusively to multipliers of exactly 1,000, 1,000,000, and 1,000,000,000, respectively. Then, when computers showed up and people had to start speaking of large storage sizes, the computing guys overloaded the prefixes to mean powers of two which were "close enough." Thus, when one speaks of computer storage, Kilo, Mega, and Giga refer to 2**10, 2**20, and 2**30 bytes, respectively. Kilo, Mega, and Giga, when used in this way, are properly slang, but they've gained traction in the mainstream, causing confusion among members of differing disciplines.
As such, there has been a decree to give the powers of two their own SI prefix names. The following have been established:
- 2**10: Kibi (abbreviated Ki)
- 2**20: Mebi (Mi)
- 2**30: Gibi (Gi)
These new prefixes are gaining traction in some circles. If you have a recent release of Linux handy, type /sbin/ifconfig and look at the RX and TX byte counts. It uses the new prefixes.
Schwab
--
Editor, A1-AAA AmeriCaptions
Re:Not So Fuzzy math by Annoying · 2007-03-02 10:22 · Score: 4, Informative

0.88% != 0.88
0.0088 * 15 = 0.132 (13%)
13% you say? The excerpt says 2%-4%. RTA and you'll see though they report up to 13% on some systems.
Re:Personally I am SHOCKED by Chonine · 2007-03-02 11:39 · Score: 3, Informative

Standard metric is indeed powers of 10, and a megabyte is indeed 10^6 bytes.
To clear up the confusion, the notation for binary, as in 2^20 bytes was developed. That would be a Mebibyte.
http://en.wikipedia.org/wiki/Mebibyte
Re:Repeat? by ShakaUVM · 2007-03-02 13:30 · Score: 3, Informative

Except MTBF is just pulled out of their asses. Look at the development cycle of a hard drive. Look at the MTBF. I used to work for an engineering company, and have worked doing test suites to determine MTBF. Sure, there's numbers involved, but it's probably 60% wishful thinking and 40% science.

Believe me, they aren't determining an 11 year MTBF empirically.
Re:Actually, one useful feature of Vista... by Matt+Perry · 2007-03-02 14:03 · Score: 3, Informative

When I was trying the Vista RC, it told me that my drive was close to failing. ... About the only feature that impressed me in Vista, sadly.
Be sad no more. SmartMonTools will run in UNIX or Windows and notify you if it detects SMART errors. For the Windows installer look for the phrase "Install the Windows package" on the smartmontools home page..

--
Slashdot: Failed Car Analogies. Amateur Lawyering. Anecdote Battles.