Slashdot Mirror


Google Releases Paper on Disk Reliability

oski4410 writes "The Google engineers just published a paper on Failure Trends in a Large Disk Drive Population. Based on a study of 100,000 disk drives over 5 years they find some interesting stuff. To quote from the abstract: 'Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.'"

7 of 267 comments (clear)

  1. Conclusion by llZENll · · Score: 3, Informative

    This is awesome, but the conclusion of such an interesting study leaves a lot to be desired. FTA...

    "In this study we report on the failure characteristics of consumer-grade disk drives. To our knowledge, the study is unprecedented in that it uses a much larger population size than has been previously reported and presents a comprehensive analysis of the correlation between failures and several parameters that are believed to affect disk lifetime. Such analysis is made possible by a new highly parallel health data collection and analysis infrastructure, and by the sheer size of our computing deployment.

    One of our key findings has been the lack of a consistent pattern of higher failure rates for higher temperature drives or for those drives at higher utilization levels. Such correlations have been repeatedly highlighted by previous studies, but we are unable to confirm them by observing our population. Although our data do not allow us to conclude that there is no such correlation, it provides strong evidence to suggest that other effects may be more prominent in affecting disk drive reliability in the context of a professionally managed data center deployment.

    Our results confirm the findings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure probabilities. We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART."

  2. Similar paper by reset_button · · Score: 4, Informative

    I was at the talk, and it was very interesting. CMU also had a paper (PDF) about disk failures in the same conference (in fact, they presented one after the other).

  3. and in the meanwhile... by pedantic+bore · · Score: 3, Informative
    ... at the same conference, Bianca Schroeder presented a paper disk reliability that developed sophisticated statistical models for disk failures, building on earlier work by Qin Xin and dozen papers by John Elerath...

    C'mon, slashdot. There were about twenty other papers presented at FAST this year. Let's not focus only on the one with Google authors...

    --
    Am I part of the core demographic for Swedish Fish?
  4. Re:Hmm by Anonymous Coward · · Score: 4, Informative

    There are several SMART signals which are highly correlated with drive errors, but the authors note that 56% of the failed drives had no occurrences of these highly correlated errors. Even considering all SMART signals, 36% of failed drives still had no SMART signals reported.

    So, if you have errors in those highly correlated categories your drives are probably going to fail, but if you do not have errors in these categories your drives can still fail.

  5. Re:OS X SMART tool? by kimvette · · Score: 3, Informative

    http://sourceforge.net/projects/smartmontools

    Not exactly point & click but it'll do.

    --
    The Christian Right is Neither (Christian nor right). See: Matthew 23, Matthew 25, Ezekiel 16:48-50
  6. You can get IDE/SATA drives FAILURE RATES Here by Augur · · Score: 5, Informative

    One of largest retailers in Russia (and maybe in Europe - more than 300 terminals for orders in person at ex-factory building, busy 24/7) "Pro Sunrise" released information on failure rates of major components (CPU, Videocards, motherboards, IDE/SATA, etc) of PC they sold for Q1-Q2 of 2005.

    http://pro.sunrise.ru/articletext.asp?reg=30&id=28 3 - the article (in russian, but diagrams are self-explanatory).

    http://pro.sunrise.ru/docs/30/image001.gif - IDE/SATA (3.5" formfactor)

    http://pro.sunrise.ru/docs/30/image002.gif - HDD (2.5" notebook formfactor)

    In short, most returns are for Maxtor brand. Lowest - IBM/Hitachi.

    Toshiba is worst in 2.5", and Seagate is best.

    The chance to be blown are between 1/20 (Maxtor) to 1/70 (Hitachi).

  7. Re:That would be corporate dynamite by gbjbaanb · · Score: 3, Informative

    When a friend broke down, she asked the breakdown man who came what were the most reliable cars. He said he wasn't allowed to comment but that "he carried no honda parts". I guess the same thing applies here - Google won't say, they'd get sued.

    On the other hand, hard drives change so much that this year's model will be totally different design and mechanics than next years, so blaming (say) IBM for its crappy deskstar range should not be reason to blame their (ok, Hitachi's) current line.

    If you do want to know more about which drives are best - check out storeagereview and enter details of your drives to their reliability database.