Everything You Know About Disks Is Wrong

← Back to Stories (view on slashdot.org)

Everything You Know About Disks Is Wrong

Posted by kdawson on Tuesday February 20, 2007 @01:34PM from the mean-time dept.

modapi writes "Google's wasn't the best storage paper at FAST '07. Another, more provocative paper looking at real-world results from 100,000 disk drives got the 'Best Paper' award. Bianca Schroeder, of CMU's Parallel Data Lab, submitted Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? The paper crushes a number of (what we now know to be) myths about disks such as vendor MTBF validity, 'consumer' vs. 'enterprise' drive reliability (spoiler: no difference), and RAID 5 assumptions. StorageMojo has a good summary of the paper's key points."

9 of 330 comments (clear)

Min score:

Reason:

Sort:

MTBF by seanadams.com · 2007-02-20 13:36 · Score: 5, Interesting

MT[TB]F has become a completely BS metric because it is so poorly understood. It only works if your failure rate is linear with respect to time. Even if you test for a stupendously huge period of time, it is still misleading because of the bathtub curve effect. You might get an MTBF of say, two years, when the reality is that the distribution has a big spike at one month, and the rest of the failures forming a wide bell curve centered at say, five years.

Suppose a tire manufacturer drove their tires around the block, and then observed that not one of the four tires had gone bald. Could they then claim an enormous MTBF? Of course not, but that is no less absurd than the testing being reported by hard drive manufacturers.
1. Re:MTBF by gvc · 2007-02-20 14:04 · Score: 3, Interesting
  
  MT[TB]F has become a completely BS metric because it is so poorly understood. It only works if your failure rate is linear with respect to time. Even if you test for a stupendously huge period of time, it is still misleading because of the bathtub curve effect. You might get an MTBF of say, two years, when the reality is that the distribution has a big spike at one month, and the rest of the failures forming a wide bell curve centered at say, five years.
  The simplest model for survival analysis is that the failure rate is constant. That yields an exponential distribution, which I would not characterize as a bell curve. The Weibull distribution more aptly models things (like people and disks) that eventually wear out; i.e. the failure rate increases with time (but not linearly).
  With the right model, it is possible to extrapolate life expectancy from a short trial. It is just that the manufacturers have no incentive to tell the truth, so they don't. Vendors never tell the truth unless some standardized measurement is imposed on them.
So SSD's are not only faster, but more reliable? by gelfling · 2007-02-20 14:32 · Score: 3, Interesting

I wonder if anyone looked at what actually failed in the drives? An arm, a platter, an actuator, a board, an MPU?

Would an analysis tell us that SSDs are not only faster but more reliable and if so by how much?
Re:Desktop vs Server usage. by Lumpy · 2007-02-20 14:38 · Score: 4, Interesting

Or she forgot to put in the part that Enterprise drives are replaced on a schedule BEFORE they fail. At Comcast I used to have 30 some servers with 25-50 drives each scattered about the state. every hard drive was replaced every 3 years to avoid failures. These servers (Tv ad insertion servers) made us between $4500-13,000 a minute they were in operation in spurts of 15 minutes down 3-5 minutes inserting ad's. Downtime was not acceptable so we replaced them on a regular basis.

Most enterprise level operations that relies on their data replace drives before they fail. In fac tthe replacement rate was increased to every 2 years not for failure prevention but for capacity increases.

--
Do not look at laser with remaining good eye.
How much does handling matter? by RebornData · 2007-02-20 14:43 · Score: 5, Interesting

What's interesting to me is that neither of these papers mentions the issue of pre-installation handling. The good folks over at Storage Review seem to be of the opinion that the shocks and bumps that happen to a drive between the factory and the final installation are the most significant factor in drive reliability (much more than brand, for example).

The google paper talks a bit about certain drive "vintages" being problemmatic, but I wonder if they buy drives in large lots, and perhaps some lots might have been handled roughly during shipping. If they could trace back each hard drive to the original order, perhaps they could look to see if there's a correlation between failure and shipping lot.

-R
Re:moving parts by blackest_k · 2007-02-20 15:37 · Score: 3, Interesting

Still doesn't mean it will last, got a 1 gig usb flash drive here dead in less than 8 weeks and very few read and writes. It will not identify itself. It might have 99,900 write cycles left but its still trashed.
Lets face it there is no reliable storage media, the only way to be safe is multiple copies.

--
Blarney Quality Restaurant, Plants
and Google contradicts. by bill_mcgonigle · 2007-02-20 16:33 · Score: 4, Interesting

Well, the article actually says that drives don't have a spike of failures at the beginning.

Hmm, the Google paper says they do, from 3-6 months (Figure 2).

Which leaves us with confirmation that 50% of all studies are wrong.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
disk spin-up is most responsible for failure ? by cats-paw · 2007-02-20 16:52 · Score: 3, Interesting

I keep hearing this persistent rumor that it's disk spin-up which is the most significant contribution to disk failure. The moral of the story is that systems which are left on 24/7 are less likely to see HD failures than systems turned on/off everyday.

Now if that's really true, wouldn't it be quite simple for the manufacturers to simply spin-up the disk more slowly by putting in very simple and reliable motor control circuitry ?

Does anyone have any real evidence, i.e. not anecdotal, that this is really true.

--
Absolute statements are never true
Re:Desktop vs Server usage. by yoprst · 2007-02-20 18:05 · Score: 3, Interesting

It's broadcasting, dude! No downtime is allowed. Here in Soviet Russia we (broadcasters) do exactly the same, except that we prefer 2-year period.